Question

In: Computer Science

A DNA string is a sequence of the bases a, c, g, and t in any...

A DNA string is a sequence of the bases a, c, g, and t in any order, whose length is usually a multiple of three. In reality, it is not necessarily a multiple of three, but we will simplify it as such for discussion.

For example,

aacgtttgtaaccagaactgt


is a DNA string with a length of 21 bases. Recall that a sequence of three consecutive letters is called a codon. Assuming the first codon starts at position 1, the codons are

aac, gtt, tgt, aac, cag, aac, and tgt

Those of you who know a little about genomics know that the open reading frame can be shifted to get a different set of codons. I want any of you who know this much to assume for discussion simplicity that there is only one open reading frame – the one starting at position 1.

A DNA string can be hundreds of thousands of codons long, even millions of codons long, which means that it is infeasible to count them by hand. Assuming the central dogma of molecular biology[1] holds true, a histogram of the standard genetic code[2] will prove useful in reporting the frequency of each recognized codon.

Instructions

Write a script named codonhistogram that will accept the pathname of a dna file on the command line as its only argument.

Generally, the dna file should be a text file containing a valid DNA string with no newline characters or white space characters of any kind within it. (It will be terminated with a newline character.) This dna file should contain nothing but a sequence of the bases a, c, g, and t in any order.

Error checking:

The script should check that it has one single command line argument, If it is missing the argument, the program should output to the user what the user error is, a how-to-use-me message such as

codonhistogram <dnafile>


and then exit.

The script should check that it is a file that it can read. If it cannot, the script should output to the user the user error such as

codonhistogram: cannot open file for reading


where <filename argument> is the pathname of the dna file, a how-to-use-me message, and then exit.

The script should check that the dna file has only the letters a, c, g, and t and no other letter characters. If it does not satisfy this constraint, the script should output an appropriate user error message and then exit.

The script does not have to check that the file contains a number of characters equal to a multiple of three. If the file ends with a newline character, then the number of characters is equal to a multiple of three plus 1. If It does not satisfy this constraint, the script should output an appropriate user error message and then exit.

The script must count the number of occurrences of every codon in the file, assuming the first codon starts at position 1, and it must output the number of times each codon occurs in the file, sorted in order of decreasing frequency. For example, if dnafile is a file containing the DNA string aacgtttgtaaccagaactgt, then the command

./codonhistogram dnafile


should produce the following output:

3 aac

2 tgt

1 cag

1 gtt

because there are 3 aac codons, 2 tgt, 1 cag, and 1 gtt. Notice that frequency comes first, then the codon name.

Important: If two or more codons have the same frequency, your script should break the tie using alphabetical order of the codons. In this example, cag and gtt each occur just once, but because c precedes g, cag comes before gtt above.

Solutions

Expert Solution

Stepwise procedure to create the required program in C:

1. Check that there are only 2 command line arguments.[1st for name of program and 2nd for name of file]/

2. Try to open the file.

3. Create a struct with two fields: One name and other cnt[Number of times the name occurs in the file]

4. Since there can be only 64 codons possible, create an array of structs of size 64.

5. Initialize the array of structs with all the possible names and cnt =0.

6. Iterate through the file and read three letter word in a temporary w and find its index in the array of structs and increase its count.

The index can be found using a helper find_index function.

the base values are a=0,c=1,g=2,t=3.

Index of cgt = 1*16 + 2*4 + 3*1 = 27 and so on. It is equivalent to a number system with base 4.

7. Use qsort to sort the array of structs. The compare function first checks the cnt of each element. If the cnt of two elements are not equal, the decision is made on basis of cnt otherwise if they are equal, the decision is made based on their names.

The final program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

//Helper structure to store the count and name of each codon
struct codon {
    char name[4];
    int cnt;
};

//There are only 64 possible number of cododns
struct codon c[64];
char bases[] = "acgt";

//Returns the index of codon
int find_index(char w[4]) {
    int idx = 0;
    if(w[0]=='c') idx += 16;
    else if(w[0]=='g') idx += 32;
    else if(w[0]=='t') idx += 48;
    
    if(w[1]=='c') idx += 4;
    else if(w[1]=='g') idx += 8;
    else if(w[1]=='t') idx += 12;
    
    
    if(w[2]=='c') idx += 1;
    else if(w[2]=='g') idx += 2;
    else if(w[2]=='t') idx += 3;
    
    return idx;
}

//Helper function to be used by qsort()
int compare(const void *a, const void *b) {
    struct codon *ca = (struct codon *)a;
    struct codon *cb = (struct codon *)b;
    if(ca->cnt != cb->cnt) return cb->cnt - ca->cnt;
    //If they are equal check their names
    return strcmp(ca->name, cb->name);
}


int main(int argc, char *argv[])
{
    //First argument is the name of the program and the second is the 
    //Name of the dnafile
    if(argc != 2) {
        printf("%d\n",argc);
        printf("Please use the following command format to use me:\n");
        printf("./codonhistogram <dnafile>");
        exit(2);
    }
    
    //Try to open the dnafile
    FILE *fp = fopen(argv[1], "r");
    //If the file could not be opened
    if(!fp) {
        printf("Cannot open file for reading");
        exit(2);
    }
    
    int ptr = 0;
    
    //Initialize the 64 codons
    for(int i=0;i<4;i++) {
        for(int j=0;j<4;j++) {
            for(int k=0;k<4;k++) {
                c[ptr].name[0] = bases[i];
                c[ptr].name[1] = bases[j];
                c[ptr].name[2] = bases[k];
                c[ptr].name[3] = '\0';
                c[ptr].cnt = 0;
                ptr++;
            }
        }
    }
    
    int count = 0;
    char ch;
    char w[4];
    int i=0;
    while((ch = fgetc(fp))!=EOF) {
        count++;
        //Reached the end of file
        if(ch=='\n' && fgetc(fp)==EOF) break;
        //Check that the character must not be other than a,c, g and t
        if(ch!='a' && ch!='c' && ch!='g' && ch!='t') {
            printf("The Genome formatting contians an invalid character: %c", ch);
            exit(3);
        }
        //Insert the character in w
        w[i++] = ch;
        if(i==3) {
            w[i] = '\0';
            i=0;
            c[find_index(w)].cnt += 1;
        }
    }
    
    //Check count must be a multiple of 3 plus 1
    if(count % 3 != 1) {
        printf("Error in number of bases--must be a multiple of 3 and terminated by newline");
        exit(4);
    }
    //Sort the array of structs using qsort();
    qsort(c, 64, sizeof(struct codon), compare);
    for(int i=0;i<64;i++) {
        if(c[i].cnt > 0) {
            printf("%d %s\n",c[i].cnt,c[i].name);
        }
    }
    fclose(fp);
    return 0;
}

8. Save the above program as codonhistogram.c

9. Compie the program by:

gcc codonhistogram.c -o codonhistogram

10. Create a file named dnafile.txt in the same folder as the codonhistogram.c with the following contents:

aacgtttgtaaccagaactgt
    

Make sure that there is a newline at the end of the string as specified in the question and as shown above.

11. Run the following command:

./codonhistogram dnafile.txt

Output:

3 aac                                                                                                                         
2 tgt                                                                                                                         
1 cag                                                                                                                         
1 gtt

Related Solutions

A DNA strand is represented by a string of the characters A, C, G, and T,...
A DNA strand is represented by a string of the characters A, C, G, and T, each of which represents a nucleotide. Each nucleotide has its complement as indicated by this Ruby hash: NUCLEOTIDE_COMPLEMENT = { 'A' => 'T', 'T' => 'A', 'C' => 'G', 'G' => 'C' } The reverse-complement of a DNA string is a new string in which each nucleotide is replaced by its complement and the string is reversed. Reverse-complements are important in bioinformatics since a...
The sequence of nitrogenous bases in an RNA and DNA strand is always written in the...
The sequence of nitrogenous bases in an RNA and DNA strand is always written in the 5? to 3? direction because __________. 1. each strand of RNA or DNA has an unlinked 3? carbon and an unlinked 5? carbon 2. DNA and RNA are synthesized in this direction in cells 3. DNA strands are antiparallel 4. nucleotides are added to the 5? end of the nucleic acid
1. Compute for the A + C : T + G ratios of the given DNA...
1. Compute for the A + C : T + G ratios of the given DNA strand in "A" 5' A A C C T A C T C C T G C G T G G G T G T C T T 3' 3' T T G G A T G A G G A C G C A C C C A C A G A A 5' 2. What happens if any of the protein for...
Suppose the DNA bases in a gene sequence follow the distribution: DNA base Probability A 1/3...
Suppose the DNA bases in a gene sequence follow the distribution: DNA base Probability A 1/3 C θ G 1/3 T 1/3 - θ In an experiment, the number of observed bases that are “A” or “C” in a gene sequence is x, and the number of observed bases that are “G” or “T” is y. The EM method is used to find the best value for the parameter θ. Describe the Expectation step for computing the expected numbers of...
We would like to align two DNA sequences: (v) C G A T A C T,...
We would like to align two DNA sequences: (v) C G A T A C T, and (w) G A T T C G T : i) s(i, j) = 1.5 if vi = wj (matches); ii) s(i, j) = -1.0 if vi != wj (mismatches); iii) d = 0.25 (indels: insertions or deletions). What would be the maximum alignment score? Explain how you get the result.
Biologists use a sequence of the letters A, C, T, and G to model a genome....
Biologists use a sequence of the letters A, C, T, and G to model a genome. A gene is a substring of a genome that starts after a triplet ATG and ends before a triplet TAG, TAA, or TGA. Furthermore, the length of a gene string is a multiple of 3, and the gene does not contain any of the triplets ATG, TAG, TAA, or TGA. Write a program that prompts the user to enter a genome and displays all...
3` - T A T A G A G C A A T T G C...
3` - T A T A G A G C A A T T G C T A C G T G T A T C C C G A G A C T C C G T A A – 5` 5` - A T A T C T C G T T A A C G A T G C A C A T A G G G C T C T G A G G C A...
1. In DNA, the type of bonds that form the base pairing between the bases, A::T...
1. In DNA, the type of bonds that form the base pairing between the bases, A::T and C::G are? a. ionic bonds b. peptide bonds c. H-bonds d. covalent bonds 2. If you wanted to build a short DNA strand, which of the following nucleotides would you not use? Select one: a. dTTP b. dUTP c. dGTP d. dCTP
Given the following prokaryote sequence of DNA: Identify if any promoter sequences are present in DNA...
Given the following prokaryote sequence of DNA: Identify if any promoter sequences are present in DNA SEQ1 (Highlight/Underline these in the sequence) Predict the mRNA sequence that would arise from this DNA sequence. Identify the ribosome binding site, start and stop codons (if present). (Highlight/Underline these in the sequence). Predict the sequence of the peptide that would arise from the predicted mRNA sequence using the genetic code table (see Codon usage table). Based on your knowledge of the properties of...
Describe DNA denaturation, renaturation (annealing) and how base composition (A & T v. C & G)...
Describe DNA denaturation, renaturation (annealing) and how base composition (A & T v. C & G) affects DNA stability
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT