One of the marvels of modern science, made possible by the use of sophisticated computer algorithms, is the famous Human Genome Project. Some background and related resources are available in the slides on genome reassembly from fragments.
The idea of reassembling a long sequence from many overlapping fragments of the whole is not new. It is the basis for an early 20th-century technique called dendrochronology, where it has enabled the creation of year-by-year records of tree-ring growth patterns in some geographical regions over intervals as long as several thousand years. These sequences have been used to pinpoint the dates of construction for ancient structures with wooden members such as the cliff dwellings at Mesa Verde National Park in the southwestern U.S. However, the ability to reassemble very long patterns such as the sequence of some 3 billion nucleotides in a strand of human DNA — from millions of fragments — is new. This "scaling up" of the basic sequence-reassembly concept was crucial to the success of the Human Genome Project. How can this sort of thing be done? That is the question you will explore a bit in this project.
In following the instructions below, you may choose to write code for StringReassemblyTest.java first, or to write code for StringReassembly.java first, or to alternate between them method-by-method. Different strokes for different folks; do what works best for you! If you're not sure what works best for you, try writing the code for StringReassemblyTest.java first if this approach (called test-driven development) is new to you.
Edit StringReassembly.java to implement the methods not yet implemented, following the instructions and advice in the comments. As a guide, be sure to examine the code for the other methods whose bodies are provided already, as well as the main program. This directory holds some text files containing fragments derived from rather small original documents on which to try your solution. We recommend you download these files and place them in the data folder of your Eclipse project. For files with over 1000 lines, expect to wait a minute or more to reassemble the strings. Also, recall that the greedy solution does not always result in an optimal result, declaration-50-8.txt and gettysburg-30-4.txt will not be fully reassembled by our implementation (for added challenge, try to determine why it occurs for these files but not the others).
Note that each line separator in the original documents that were "shredded" to create the above fragment files has first been replaced by '~'. Separate lines of the fragment files contain separate fragments of the respective original documents.
Here are some possible additional activities related to this project. Any extra work is strictly optional, for your own benefit, and will not directly affect your grade.