In: Computer Science
A DNA string is a sequence of the bases a, c, g, and t in any order, whose length is usually a multiple of three. In reality, it is not necessarily a multiple of three, but we will simplify it as such for discussion.
For example,
aacgtttgtaaccagaactgt
is a DNA string with a length of 21 bases. Recall that a sequence
of three consecutive letters is called a codon. Assuming the first
codon starts at position 1, the codons are
aac, gtt, tgt, aac, cag, aac, and tgt
Those of you who know a little about genomics know that the open reading frame can be shifted to get a different set of codons. I want any of you who know this much to assume for discussion simplicity that there is only one open reading frame – the one starting at position 1.
A DNA string can be hundreds of thousands of codons long, even millions of codons long, which means that it is infeasible to count them by hand. Assuming the central dogma of molecular biology[1] holds true, a histogram of the standard genetic code[2] will prove useful in reporting the frequency of each recognized codon.
Instructions
Write a script named codonhistogram that will accept the
pathname of a dna file on the command line as its only
argument.
Generally, the dna file should be a text file containing a valid
DNA string with no newline characters or white space characters of
any kind within it. (It will be terminated with a newline
character.) This dna file should contain nothing but a sequence of
the bases a, c, g, and t in any order.
Error checking:
The script should check that it has one single command line argument, If it is missing the argument, the program should output to the user what the user error is, a how-to-use-me message such as
codonhistogram <dnafile> |
and then exit.
The script should check that it is a file that it can read. If it cannot, the script should output to the user the user error such as
codonhistogram: cannot open file for reading |
where <filename argument> is the pathname of the dna file, a
how-to-use-me message, and then exit.
The script should check that the dna file has only the letters a, c, g, and t and no other letter characters. If it does not satisfy this constraint, the script should output an appropriate user error message and then exit.
The script does not have to check that the file contains a number of characters equal to a multiple of three. If the file ends with a newline character, then the number of characters is equal to a multiple of three plus 1. If It does not satisfy this constraint, the script should output an appropriate user error message and then exit.
The script must count the number of occurrences of every codon in the file, assuming the first codon starts at position 1, and it must output the number of times each codon occurs in the file, sorted in order of decreasing frequency. For example, if dnafile is a file containing the DNA string aacgtttgtaaccagaactgt, then the command
./codonhistogram dnafile |
should produce the following output:
3 aac 2 tgt 1 cag 1 gtt |
because there are 3 aac codons, 2 tgt, 1 cag, and 1 gtt. Notice that frequency comes first, then the codon name.
Important: If two or more codons have the same frequency, your script should break the tie using alphabetical order of the codons. In this example, cag and gtt each occur just once, but because c precedes g, cag comes before gtt above.
Stepwise procedure to create the required program in C:
1. Check that there are only 2 command line arguments.[1st for name of program and 2nd for name of file]/
2. Try to open the file.
3. Create a struct with two fields: One name and other cnt[Number of times the name occurs in the file]
4. Since there can be only 64 codons possible, create an array of structs of size 64.
5. Initialize the array of structs with all the possible names and cnt =0.
6. Iterate through the file and read three letter word in a temporary w and find its index in the array of structs and increase its count.
The index can be found using a helper find_index function.
the base values are a=0,c=1,g=2,t=3.
Index of cgt = 1*16 + 2*4 + 3*1 = 27 and so on. It is equivalent to a number system with base 4.
7. Use qsort to sort the array of structs. The compare function first checks the cnt of each element. If the cnt of two elements are not equal, the decision is made on basis of cnt otherwise if they are equal, the decision is made based on their names.
The final program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//Helper structure to store the count and name of each codon
struct codon {
char name[4];
int cnt;
};
//There are only 64 possible number of cododns
struct codon c[64];
char bases[] = "acgt";
//Returns the index of codon
int find_index(char w[4]) {
int idx = 0;
if(w[0]=='c') idx += 16;
else if(w[0]=='g') idx += 32;
else if(w[0]=='t') idx += 48;
if(w[1]=='c') idx += 4;
else if(w[1]=='g') idx += 8;
else if(w[1]=='t') idx += 12;
if(w[2]=='c') idx += 1;
else if(w[2]=='g') idx += 2;
else if(w[2]=='t') idx += 3;
return idx;
}
//Helper function to be used by qsort()
int compare(const void *a, const void *b) {
struct codon *ca = (struct codon *)a;
struct codon *cb = (struct codon *)b;
if(ca->cnt != cb->cnt) return cb->cnt - ca->cnt;
//If they are equal check their names
return strcmp(ca->name, cb->name);
}
int main(int argc, char *argv[])
{
//First argument is the name of the program and the second is the
//Name of the dnafile
if(argc != 2) {
printf("%d\n",argc);
printf("Please use the following command format to use me:\n");
printf("./codonhistogram <dnafile>");
exit(2);
}
//Try to open the dnafile
FILE *fp = fopen(argv[1], "r");
//If the file could not be opened
if(!fp) {
printf("Cannot open file for reading");
exit(2);
}
int ptr = 0;
//Initialize the 64 codons
for(int i=0;i<4;i++) {
for(int j=0;j<4;j++) {
for(int k=0;k<4;k++) {
c[ptr].name[0] = bases[i];
c[ptr].name[1] = bases[j];
c[ptr].name[2] = bases[k];
c[ptr].name[3] = '\0';
c[ptr].cnt = 0;
ptr++;
}
}
}
int count = 0;
char ch;
char w[4];
int i=0;
while((ch = fgetc(fp))!=EOF) {
count++;
//Reached the end of file
if(ch=='\n' && fgetc(fp)==EOF) break;
//Check that the character must not be other than a,c, g and t
if(ch!='a' && ch!='c' && ch!='g' && ch!='t') {
printf("The Genome formatting contians an invalid character: %c", ch);
exit(3);
}
//Insert the character in w
w[i++] = ch;
if(i==3) {
w[i] = '\0';
i=0;
c[find_index(w)].cnt += 1;
}
}
//Check count must be a multiple of 3 plus 1
if(count % 3 != 1) {
printf("Error in number of bases--must be a multiple of 3 and terminated by newline");
exit(4);
}
//Sort the array of structs using qsort();
qsort(c, 64, sizeof(struct codon), compare);
for(int i=0;i<64;i++) {
if(c[i].cnt > 0) {
printf("%d %s\n",c[i].cnt,c[i].name);
}
}
fclose(fp);
return 0;
}
8. Save the above program as codonhistogram.c
9. Compie the program by:
gcc codonhistogram.c -o codonhistogram
10. Create a file named dnafile.txt in the same folder as the codonhistogram.c with the following contents:
aacgtttgtaaccagaactgt
Make sure that there is a newline at the end of the string as specified in the question and as shown above.
11. Run the following command:
./codonhistogram dnafile.txt
Output:
3 aac
2 tgt
1 cag
1 gtt