In: Computer Science
In this problem you will design and implement C++ code that
identifies overlap in strings. Specifically, design and implement a
C++ program that does the following:
1. Asks a user to input a filename and then opens that file. If the
file open fails, then print the message “Unable to open file” and
terminate the program using exit(1).
2. Reads the file contents, in order, into an array of strings. (See
the file format explanation below.)
3. Computes the string overlapping order described below, and then
prints the strings out, one per line, in that order.
4. Closes the file.
File Format: The data file consists of a number of strings, each on
its own line. You do not know in advance how many strings will be
in the file, other than there will be no more than 30 strings.
Assume (i) there is at least one string in the file, (ii) the
strings do not have any whitespace in their interior, and (iii) the
strings consist entirely of alphabetic, upper case characters. Here
is an example of a data file containing four strings:
AGGTGTGGA AAAATTA AATTGTCGCTGA GGAAAA
Overlapping Order: Here is the explanation of the string
overlapping your program is to find in Step (3) above. Suppose we
have the strings above read, in order, into an array of strings. We
start with the first string in the array, AGGTGTGGA. What we want to
determine is which of the other strings’ beginning overlaps the
most with the end of the this first string. Note this is a
nontrivial problem since we do not know without further analysis
what the size of the overlapping substring will be. For example, if
we just look at the last character of AGGTGTGGA, the A, there are
two other strings that begin with A. If we look at the last two
characters, GA, there are no strings starting with GA. If we
1
look at the last three, GGA, there is one other string starting at
GGA. If we continue along this line, we notice that the largest
overlap is the three characters GGA, and the overlap is with the
fourth string. So move the fourth string into the second position
in the array by exchanging it with the current second string:
AGGTGTGGA GGAAAA AATTGTCGCTGA AAAATTA
Now which of the remaining strings (i.e., the third and fourth
ones), has the largest overlap at its beginning with the end of
GGAAAA? Both of the remaining strings have an overlap of one
character (A), and of two characters (AA). But only one has a
larger overlap, the fourth string, with maximum overlap AAAA. So
move the fourth string into the third position by exchanging it
with the third string:
AGGTGTGGA GGAAAA AAAATTA AATTGTCGCTGA
Then we look for the maximum overlap between AAAATT and the
remaining strings. However, since there is only one string left,
the final string trivially has the maximum overlap. So the strings
are now ordered as desired, and the program’s output should be the
strings printed one per line in the order immediately above.
Additional clarification, requirements, and advice: • If more than
one string has the maximum overlap (e.g., two string overlap with
the current string in a substring of 3 characters) choose the one
that occurs first in the current array. • It is possible for the
maximum overlap to be of length 0. • Your program should work on a
data file with any name, and with any reasonable number of strings,
as long as the number of strings is at most 30. • Your program just
needs to print the strings in the final order. It does not need to
print any additional information such as the amount of overlap. •
Make sure you understand how the overlap works here with the end of
the “current” string overlapping with the beginning of the other
strings. • Finding the length of overlap between any two strings is
not a simple task, but should not be extremely lengthy or difficult
either. Think about the problem carefully, and remember you can use
string functions as useful. Moreover, to organize your program,
write a function int findOverlap(string s1, string s2) that finds
the number of characters of overlap between the beginning of s2 and
end of s1, and returns the length of this overlap. For example,
when s1 = AGGTGTGGA and s2 = GATTTAG the function should return two
since the largest overlap is two characters long, namely GA.
2
Here’s a complete example using another data file. Suppose the file
is the following:
AGCTG GCGGACGGG GGG TGAGCGGA GGGGC
Then here is the program input and output. As usual, user input is
underlined, and there is a single space between the input prompt
and the input file name. And — also as usual — follow the input and
output example carefully. Input filename: testData.dat AGCTG
TGAGCGGA GCGGACGGG GGG GGGGC
And here is an example of an invalid filename: Input filename:
testData.cat Unable to open file
So basically we have to read some strings from a file and then find overlaps and give output . I've created a code file which is explained with the comments which will really help you . code:)-
#include<iostream>
#include<fstream>
#include<stdlib.h>
using namespace std;
int findoverlaps(string s1,string s2) // this will check maximum
overlap length
{
int l1=s1.length();
int l2=s2.length();
int count=0;
int len=min(l1,l2);
for(int i=1;i<=len;i++) //this loop run for
checking overlap length which can go from 1 to minimum length
{
int q=0;
int c=0;
while(q<i &&
s1[l1-(i-q)]==s2[q])
{
q++;
c++;
}
if(q==i) // if i length is a
overlap
{
count=q;
}
}
return count;
}
int main()
{
fstream file;
string s[31];
char file1[50];
cout<<"enter the file name:";
cin>>file1;
//this will read the filename
file.open(file1,ios::in);
if(file==NULL)
{
cout<<"unable to open the
file";
exit(0);
}
int len=0;
while(file)
{
file>>s[len++];
//this read a string from the
file
}
for(int i=0;i<len;i++) // here we go
for overlap at every position
{
int m=0;
int k=i+1;
for(int j=i+1;j<len;j++) // this
loop check all possible overlap with the remaining strings
{
int
overlap=findoverlaps(s[i],s[j]);
if(overlap>m)
{
m=overlap;
k=j;
}
}
if(k!=i+1)
{
swap(s[i+1],s[k]);
}
}
for(int i=0;i<len;i++)
{
cout<<s[i]<<"\n";
}
}
Here is the input file: