In: Computer Science
Practical electronics strongly favors binary representations for encoding information (two-state; 0 or 1). In DNA, biology uses a quaternary representation of four different nucleobases (four-state; cytosine [C], guanine [G], adenine [A] or thymine [T]) to code for the amino acids with which all the proteins in our bodies are constructed. All proteins are long chemical chains assembled from 20 different types of amino acids.
a) What is the minimum number of nucleobase “digits” required to code for the 20 different amino acids? (Groups of nucleotides, each coding for a single amino acid in a protein chain, are called “codons” by biologists.) (Hint: How many four-state "digits" are needed to represent 20 unique things?)
b) A binary representation is used to store the human genome on a digital computer. How many bits are required to represent each nucleotide in the genetic code? (Note: There is one nucleobase in a nucleotide)
c) The DNA in a human genome consists of about 3 billion (3,000,000,000) nucleotides (base pairs). How many bits are required to store this information in a binary format?
d) The memory in a computer is usually organized in groups of 8 bits, called “Bytes.” How many bytes of memory are required to store your genome in a digital computer?
(a) Almost two dozens of various amino acids are collected to be processed by the combinations of three of the possible four nucleotides in anticodons
As the number of different bases in the genome = four
So, we can say that a nucleotide can code in four ways.
So, two nucleotides can code in 42 = 16 ways which is not sufficient. Anticodons should be able to differentiate at least twenty amino acids.
Three nucleotides can code in 43 = 64 ways. So, it can be sufficient.
This set of three nucleotides is called a codon when it stays in messenger mRNA and is called anticodon when located in the transfer tRNA segments.
The amino acids are removed from the ribosome like colored beads so a structured necklace can be formed and it will eventually be folded to create a protein. The "beads" sequence is determined by the order of the codons carried by the messenger mRNA.
So, the codons should be three nucleotides long.
(b) Each nucleobase size = 2 bits.
Each nucleotide contain nucleobase. So, we need 2 bits to represent each nucleotide in the genetic group.
(c) Such Information can't be stored in 2-bit representation. As each base pair takes 2 bits there are 22 = 4 combinations. As one can use 00, 01, 10, 11 for T, G, C, and A.
Number of base pairs in the human genome = 2.9 billions
So, number of required bits (2 * 2.9 billion) bits = 5.8*109 bits ~ 691 MB.
(d) To represent a DNA sequence on a computer all 4 base pair possibilities should be represented in binary format (0 and 1). One byte consists of 8 bits. So, using a minimum of 2 bits we can denote each base pair, which provides total 4 different bit combinations (00, 01, 10, and 11). One DNA base pair is represented by a base pair. So, one byte can represent 4 DNA base pairs. In order to represent the entire diploid human genome in terms of bytes:
(6×109) / 8 Bytes = 0.75 * 109 Bytes = 768 Megabytes.
Hope this helps.