In: Computer Science
Problem Statement |
You have the mRNA sequence that results from the transcription of the Homo sapiens Hemoglobin subunit beta gene. Knowing that the 5' and 3' ends of the mRNA are processed post-transcriptionally, you know that the start codon and termination codon lie somewhere inside the sequence. A manual inspection of the mRNA sequence should reveal the locations of the start and stop codons, but to ensure you don't miss anything you decide to write a Python script to analyze the mRNA sequence and find the positions of both codons. You have the mRNA sequence, in the 5' to 3' direction, in a text file:
acauuugcuucugacacaacuguguucacuagcaaccucaaacagacaccauggugcauc From the lecture you know that the canonical start codon is AUG, and you know the 3 stop codons are UAA, UAG, and UGA. |
Requirements |
We have covered enough Python to accomplish this task. The basic idea is to store the mRNA sequence as a string value, then take advantage of the string's find() function to locate the start codon and the first stop codon. We haven't covered how to read data from a file in Python yet, but you can copy and paste the sequence into a script. Your script's output should follow the template shown below: Homo sapiens HBB mRNA: Translation start: <position of first AUG codon> Translation Stop: <position of first stop codon after the
start codon> # of amino acids in the HBB protein: <number of amino acids encoded from Translation start to Translation stop> Commenting Be sure to comment your code meaningfully! This not only helps you to understand your code, it also helps me understand your thought processes, which is important for awarding partial credit when necessary. Commenting is also one of the rubric items, so if you do not comment your code you will lose points. Data Storage in the Script How you store the data inside a script is very important. In general, you want to minimize hardcoding data values, especially if they will be used repeatedly. "Hardcoding" means to use the literal data value in your code instead of storing it in a variable. Every place where a data value is hardcoded represents a potential source of error. If that data value has to be changed, and it is hardcoded, every instance of that value in the script must be changed to avoid errors. If you instead store the value in a variable, and use the variable name in the script instead of the data value itself, you only have to change the data value once, where the variable is initialized. With this in mind, you should store the following initial data at the top of your script, to be used later:
Use of Upper vs. Lower Case Whether you use upper case or lower case for the sequence data and codons is entirely up to you. Just be sure that you are consistent throughout your script. Displaying the mRNA Sequence The first item in the output is the display of the mRNA sequence itself. On one line you should display the species, Homo sapiens, followed by the abbreviation ("HBB") of the gene. Below this line the mRNA sequence itself should be displayed at 60 bases per line, which is the same convention used by GenBank. You do not need to include numbering or spaces every 10 bases like GenBank does, however. Hint: Use a for loop combined with the range function with an increment of 60, and print each line as a slice, or substring, of 60 bases beginning with the current position in the loop. Finding and Displaying the Position of the Translation Start Codon The translation start codon will be the first occurrence of the canonical start codon, AUG, as the mRNA is read from left to right. You can use the string's find() function, which we covered in the Module 1 Python lecture to do this. One important thing to keep in mind is that Python treats strings as 0-based in terms of indexing, meaning the first base in the mRNA is at position 0, not 1. When you display the position of the start codon you must remember to add 1 to the position returned by the find() function, since we read nucleotide sequences as 1-based, with the first base starting at position 1. Caution: Do not add 1 to the position of the codon when you store it, or you will run the risk of error when you use the position for searches, etc. Only add 1 to the position when you are displaying the codon's position; e.g.: print("Translation start:", start_codon_pos + 1) In the example above, the variable, start_codon_pos is not changed; the values of start_codon_pos and the "+ 1" are dynamically added in a different, local variable that is passed as an argument to the print() function, and this local variable is lost once the print() function is done. Finding and Displaying the Position of the Translation Stop Codon There are 3 possible stop codons, UAA, UAG, and UGA, and any one of these will signal translation to terminate. You can find the stop codon using a similar approach to finding the start codon. There are a few things to bear in mind, however:
Be sure to store the position of the stop codon in a variable so you can display it after you have found it. Hint: This is another good use of a loop with the range function. The range function should begin at the first codon after the start codon, and use an increment of 3 to read the sequence one codon at a time. Inside the loop use an if-elif-elif block to check for each of the stop codons. The stop codons are stored in a list, so you can use list indexes (0, 1, 2) to access individual stop codons. Once a stop codon is found, use the break statement to terminate the loop immediately. Don't forget to store the position of the stop codon in a variable, since you will need to display it. Calculating and Displaying the Number of Amino Acids in the HBB Protein Once you have the positions of both the start and stop codons, you can calculate how many amino acids are encoded by the HBB mRNA. Keep in mind the positions of the start and stop codons give the length of the mRNA in bases, not codons, but the number of amino acids will always be equal to the number of codons. Hint: The math involved here is pretty straightforward, but Python will end up giving you a result that is a floating point value. To convert the floating point value to an integer, use Python's int() function: int_value = int(floating_point_value) |
Input:
rna.py:
def getString(filename): #fetch string from file
with open(filename) as f: #safely open file
return f.read() #return string
def displaySequence(): #display sequence
for i in range(0,len(mrna), 60): #Iterate over range 0 - len of string with stepsize:60
print(mrna[i:i+60]) #Slice 60 characters and print
def getStartCodon():
return mrna.find("aug") + 1 #return index at which given string occurred
def getStopCodon():
index, stopCodon = None, None #Declare a variable for storing index and codon
for i in ["uaa", "uag", "uga"]: #Iterate over stop codons
tempIndex = None #Set temporary index
for j in range(startCodonIndex + 2, len(mrna), 3): #Iterate from startCodon until range
if mrna[j:j+3] == i: #Check if the codon matches stop codon
tempIndex = j #store its index
break
if index == None or (tempIndex != None and tempIndex < index): #check if index is empty or new index less than previous
index = tempIndex #Assing value for index
stopCodon = i #Assign value for codon
return [index + 1, stopCodon] #return list
def getNumberofAmino():
number = 0 #Set the number to 0
for i in range(startCodonIndex-1, stopCodonIndex): #Iterate from startCodon to stopCodon
if mrna[i:i+3] == "aug": #Check if codon is matching or not
number += 1 #Increment the count
return number
if __name__ == "__main__": #Starting point of program
mrna = getString("mRna.txt").lower() #get file stored in the same directory as code
startCodonIndex = getStartCodon() #get index of start codon
stopCodonIndex, stopCodon = getStopCodon() #get index and name of stop codon
print("Homo sapiens HBB mRNA:")
displaySequence() #Display mRNA sequence
print("\nTranslation Start: {}\n".format(startCodonIndex))
print("Translation Stop: {}\n".format(stopCodonIndex))
print(stopCodon+" found at position {}\n".format(stopCodonIndex))
print("# of amino acids in the HBB protein: {}".format(getNumberofAmino()))
Output:
As per the instructions, I created a file to store mRNA sequence. This file is stored in the same directory as the code.
Explanation:
Note: Please follow indentations carefully!