Question

In: Computer Science

Problem Statement You have the mRNA sequence that results from the transcription of the Homo sapiens...

Problem Statement

You have the mRNA sequence that results from the transcription of the Homo sapiens Hemoglobin subunit beta gene. Knowing that the 5' and 3' ends of the mRNA are processed post-transcriptionally, you know that the start codon and termination codon lie somewhere inside the sequence. A manual inspection of the mRNA sequence should reveal the locations of the start and stop codons, but to ensure you don't miss anything you decide to write a Python script to analyze the mRNA sequence and find the positions of both codons.

You have the mRNA sequence, in the 5' to 3' direction, in a text file:

acauuugcuucugacacaacuguguucacuagcaaccucaaacagacaccauggugcauc
ugacuccugaggagaagucugccguuacugcccuguggggcaaggugaacguggaugaag
uugguggugaggcccugggcaggcugcugguggucuacccuuggacccagagguucuuug
aguccuuuggggaucuguccacuccugaugcuguuaugggcaacccuaaggugaaggcuc
auggcaagaaagugcucggugccuuuagugauggccuggcucaccuggacaaccucaagg
gcaccuuugccacacugagugagcugcacugugacaagcugcacguggauccugagaacu
ucaggcuccugggcaacgugcuggucugugugcuggcccaucacuuuggcaaagaauuca
ccccaccagugcaggcugccuaucagaaagugguggcugguguggcuaaugcccuggccc
acaaguaucacuaagcucgcuuucuugcuguccaauuucuauuaaagguuccuuuguucc
cuaaguccaacuacuaaacugggggauauuaugaagggccuugagcaucuggauucugcc
uaauaaaaaacauuuauuuucauugcaa

From the lecture you know that the canonical start codon is AUG, and you know the 3 stop codons are UAA, UAG, and UGA.

Requirements

We have covered enough Python to accomplish this task. The basic idea is to store the mRNA sequence as a string value, then take advantage of the string's find() function to locate the start codon and the first stop codon. We haven't covered how to read data from a file in Python yet, but you can copy and paste the sequence into a script.

Your script's output should follow the template shown below:

Homo sapiens HBB mRNA:
acauuugcuucugacacaacuguguucacuagcaaccucaaacagacaccauggugcauc
ugacuccugaggagaagucugccguuacugcccuguggggcaaggugaacguggaugaag
uugguggugaggcccugggcaggcugcugguggucuacccuuggacccagagguucuuug
aguccuuuggggaucuguccacuccugaugcuguuaugggcaacccuaaggugaaggcuc
auggcaagaaagugcucggugccuuuagugauggccuggcucaccuggacaaccucaagg
gcaccuuugccacacugagugagcugcacugugacaagcugcacguggauccugagaacu
ucaggcuccugggcaacgugcuggucugugugcuggcccaucacuuuggcaaagaauuca
ccccaccagugcaggcugccuaucagaaagugguggcugguguggcuaaugcccuggccc
acaaguaucacuaagcucgcuuucuugcuguccaauuucuauuaaagguuccuuuguucc
cuaaguccaacuacuaaacugggggauauuaugaagggccuugagcaucuggauucugcc
uaauaaaaaacauuuauuuucauugcaa

Translation start: <position of first AUG codon>

Translation Stop: <position of first stop codon after the start codon>
<stop codon> found at position <Translation stop>

# of amino acids in the HBB protein: <number of amino acids encoded from Translation start to Translation stop>

Commenting

Be sure to comment your code meaningfully! This not only helps you to understand your code, it also helps me understand your thought processes, which is important for awarding partial credit when necessary. Commenting is also one of the rubric items, so if you do not comment your code you will lose points.

Data Storage in the Script

How you store the data inside a script is very important. In general, you want to minimize hardcoding data values, especially if they will be used repeatedly. "Hardcoding" means to use the literal data value in your code instead of storing it in a variable. Every place where a data value is hardcoded represents a potential source of error. If that data value has to be changed, and it is hardcoded, every instance of that value in the script must be changed to avoid errors. If you instead store the value in a variable, and use the variable name in the script instead of the data value itself, you only have to change the data value once, where the variable is initialized.

With this in mind, you should store the following initial data at the top of your script, to be used later:

  1. The mRNA sequence should be stored as a single-line string with no whitespace or line feed characters, in a variable named "HBB_CDS" (short for HBB CoDing Sequence). Although Python allows syntax for storing multiline strings, do not use this syntax, since the line feed characters will be included when you performs searches on the sequence.
  2. The codon length, 3, should be stored in a variable named "Codon_length".
  3. The value of the start codon, "aug", should be stored in a variable named "Start_codon".
  4. The 3 stop codons, "uaa", "uag", and "uga", should be stored in a list. You could use 3 separate variables and store each stop codon separately, but using a list only requires 1 variable, and you can use list indexing to retrieve individual values.

Use of Upper vs. Lower Case

Whether you use upper case or lower case for the sequence data and codons is entirely up to you. Just be sure that you are consistent throughout your script.

Displaying the mRNA Sequence

The first item in the output is the display of the mRNA sequence itself. On one line you should display the species, Homo sapiens, followed by the abbreviation ("HBB") of the gene. Below this line the mRNA sequence itself should be displayed at 60 bases per line, which is the same convention used by GenBank. You do not need to include numbering or spaces every 10 bases like GenBank does, however.

Hint: Use a for loop combined with the range function with an increment of 60, and print each line as a slice, or substring, of 60 bases beginning with the current position in the loop.

Finding and Displaying the Position of the Translation Start Codon

The translation start codon will be the first occurrence of the canonical start codon, AUG, as the mRNA is read from left to right. You can use the string's find() function, which we covered in the Module 1 Python lecture to do this. One important thing to keep in mind is that Python treats strings as 0-based in terms of indexing, meaning the first base in the mRNA is at position 0, not 1. When you display the position of the start codon you must remember to add 1 to the position returned by the find() function, since we read nucleotide sequences as 1-based, with the first base starting at position 1.

Caution: Do not add 1 to the position of the codon when you store it, or you will run the risk of error when you use the position for searches, etc. Only add 1 to the position when you are displaying the codon's position; e.g.:

print("Translation start:", start_codon_pos + 1)

In the example above, the variable, start_codon_pos is not changed; the values of start_codon_pos and the "+ 1" are dynamically added in a different, local variable that is passed as an argument to the print() function, and this local variable is lost once the print() function is done.

Finding and Displaying the Position of the Translation Stop Codon

There are 3 possible stop codons, UAA, UAG, and UGA, and any one of these will signal translation to terminate. You can find the stop codon using a similar approach to finding the start codon. There are a few things to bear in mind, however:

  1. Translation begins at the position of the first AUG codon, so the stop codon must come after the start codon.
  2. You don't know beforehand which stop codon will be the first one encountered, so you must check for all 3 of them. Whichever of the 3 stop codons occurs first after the start codon will be the one that terminates protein synthesis.
  3. Translation reads the mRNA as codons, not individual bases, and codons do not overlap each other. Therefore, when you look for the stop codon you must read the sequence 1 codon, or 3 bases, at a time, with the first codon being the one immediately following the start codon. So if you have the following sequence:

    gggaugacccagaaauaa

    the start codon is at position 4 (in a Python string it will be position 3 since strings are 0-based). Reading the sequence 1 codon at a time to find the stop codon would result in the sequence being read as follows:

    ggg aug acc cag aaa uaa

    The stop codon, UAA, would thus be found at position 13 (index 12 in the Python string).

    If you were to read the sequence one base at a time instead of one codon at a time, you would find a stop codon at position 5 (index 4 in the Python string), which is incorrect.

Be sure to store the position of the stop codon in a variable so you can display it after you have found it.

Hint: This is another good use of a loop with the range function. The range function should begin at the first codon after the start codon, and use an increment of 3 to read the sequence one codon at a time. Inside the loop use an if-elif-elif block to check for each of the stop codons. The stop codons are stored in a list, so you can use list indexes (0, 1, 2) to access individual stop codons. Once a stop codon is found, use the break statement to terminate the loop immediately. Don't forget to store the position of the stop codon in a variable, since you will need to display it.

Calculating and Displaying the Number of Amino Acids in the HBB Protein

Once you have the positions of both the start and stop codons, you can calculate how many amino acids are encoded by the HBB mRNA. Keep in mind the positions of the start and stop codons give the length of the mRNA in bases, not codons, but the number of amino acids will always be equal to the number of codons.

Hint: The math involved here is pretty straightforward, but Python will end up giving you a result that is a floating point value. To convert the floating point value to an integer, use Python's int() function:

int_value = int(floating_point_value)

Solutions

Expert Solution

Input:

rna.py:

def getString(filename): #fetch string from file

    with open(filename) as f: #safely open file

        return f.read() #return string

def displaySequence(): #display sequence

    for i in range(0,len(mrna), 60): #Iterate over range 0 - len of string with stepsize:60

        print(mrna[i:i+60]) #Slice 60 characters and print

def getStartCodon():

    return mrna.find("aug") + 1 #return index at which given string occurred

def getStopCodon():

    index, stopCodon = None, None #Declare a variable for storing index and codon

    for i in ["uaa", "uag", "uga"]: #Iterate over stop codons

        tempIndex = None #Set temporary index

        for j in range(startCodonIndex + 2, len(mrna), 3): #Iterate from startCodon until range

            if mrna[j:j+3] == i: #Check if the codon matches stop codon

                tempIndex = j #store its index

                break

        if index == None or (tempIndex != None and tempIndex < index): #check if index is empty or new index less than previous

            index = tempIndex #Assing value for index

            stopCodon = i #Assign value for codon

    return [index + 1, stopCodon] #return list

def getNumberofAmino():

    number = 0 #Set the number to 0

    for i in range(startCodonIndex-1, stopCodonIndex): #Iterate from startCodon to stopCodon

        if mrna[i:i+3] == "aug": #Check if codon is matching or not

            number += 1 #Increment the count

    return number

if __name__ == "__main__": #Starting point of program

    mrna = getString("mRna.txt").lower() #get file stored in the same directory as code

    startCodonIndex = getStartCodon() #get index of start codon

    stopCodonIndex, stopCodon = getStopCodon() #get index and name of stop codon

    print("Homo sapiens HBB mRNA:")

    displaySequence() #Display mRNA sequence

    print("\nTranslation Start: {}\n".format(startCodonIndex))

    print("Translation Stop: {}\n".format(stopCodonIndex))

    print(stopCodon+" found at position {}\n".format(stopCodonIndex))

    print("# of amino acids in the HBB protein: {}".format(getNumberofAmino()))

Output:

    

As per the instructions, I created a file to store mRNA sequence. This file is stored in the same directory as the code.

Explanation:

  • The given code has a total of 5 methods: getString(), displaySequence(), getStartCodon(), getStopCodon(), getNumberofAmino()
  • getString(): This method takes filename as parameter and returns the first line of the file which is mRNA sequence
  • displaySequence(): This method displays a sequence. It iterates over the input sequence with a step size of 60.
  • getStartCodon(): It returns the first occurrence of start codon using find() method
  • getStopCodon(): It runs in a loop from start codon until the end of the string. It follows 3-base representation and returns index as soon as a stop codon matches.
  • getNumberofAmino(): This method checks for codons between start codon and stop codon

Note: Please follow indentations carefully!


Related Solutions

1. It is thought that all Homo sapiens originated from a breeding population of __________________ individuals....
1. It is thought that all Homo sapiens originated from a breeding population of __________________ individuals. a. 600 b. 2000 c. 5000 d. 10000 2. What is the "pit of bones" in Northern Spain? a. Where 30 complete skeletons of H. heidelbergensis were found. b. A 50' vertical shaft. c. A primitive burial site. d. All of the above. 3. Teeth can tell us how fast Neanderthal children were growing. True False 4. Humans take _____________ as chimpanzees to reach...
Pick one of the best answers below - thank you The lineage leading to Homo sapiens...
Pick one of the best answers below - thank you The lineage leading to Homo sapiens diverged from the lineage leading to Pan troglodytes ________________ of years ago. 1. hundreds of millions 2. billions 3. thousands 4. millions What is the difference between microevolution and macroevolution? 1. Microevolution is proven fact, and macroevolution is just a guess (a theory). 2. Microevolutionary processes are always faster than macroevolutionary processes. 3. Microevolution describes changes in allele frequencies within populations/species; macroevolution describes processes...
#1: Gene Transcription and Translation Write a brief statement hypothesizing as to why mRNA is so...
#1: Gene Transcription and Translation Write a brief statement hypothesizing as to why mRNA is so unstable in most bacteria (t1/2 of about 1 minute), when the same half-life is more like 1 hour in higher organisms?
1) You sequence a gene of interest and isolate the matching mRNA. You find that the...
1) You sequence a gene of interest and isolate the matching mRNA. You find that the mRNA is considerably shorter than the DNA sequence. Why is that? a) There was an experimental mistake. The mRNA should have the same length as the gene. b) The mRNA should be longer than the DNA sequence because the promoter is also transcribed. c) The processed mRNA is shorter because introns were removed. d) The mRNA is shorter because the signal sequence to cross...
Part of a gene sequence from a eukaryotic cell is written below. Transcription begins at the...
Part of a gene sequence from a eukaryotic cell is written below. Transcription begins at the boxed G/C base pair and proceeds from left to right. 5’-CCGATAAATGGCCGATTACGATATGCCAGATCATTACAACTAACGAGGCC -3’ 1 - - - - - - - - - - +- - - - - - - - - - -+- - - - - - - - - - +- - - - - - - - -+ - - - - - - - - - - - -+...
What is the maximum number of amino acids that could result from the following mRNA sequence?...
What is the maximum number of amino acids that could result from the following mRNA sequence? 5′ AUGAGACCGUCG 3′ A.    0 B.    4 C.    7 D.    10
You have isolated a novel transcription factor. How will you determine the genes regulated by transcription...
You have isolated a novel transcription factor. How will you determine the genes regulated by transcription factor in the whole genome?
You have total RNA extract from plant. How can you make cDNA only from mRNA selectively?
You have total RNA extract from plant. How can you make cDNA only from mRNA selectively?
Problem 1: You have received the following projected income statement from an employee of your company:...
Problem 1: You have received the following projected income statement from an employee of your company: Sales                                                                              $86,000,000 -Variable costs                                                                       ? - Fixed costs                                                                           ?                                                                                        --------------- EBIT                                                                                      ?    - Interest                                                                            4,050,000                                                                                              --------------- Profit before tax                                                             19,450,000             - Tax                                                                                  5,450,000                                                                                            --------------- Net Income                                                                     14,000,000 The employee (UW-Eau Claire grad) is confused on the meaning of variable versus fixed costs and could not finish the statement. I have assured the employee that you will not have any problem finishing...
can you make me a sample problem of 2 geometric sequence, harmonic sequence, arithmetic sequence so...
can you make me a sample problem of 2 geometric sequence, harmonic sequence, arithmetic sequence so that's all 5 this is for Grade 10 Mathematics all with answers and solutions thankyou
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT