In: Computer Science
Question no 1:
The d-neighborhood Neighbors(Pattern, d) is the set of all k-mers
whose Hamming distance from Pattern does not exceed d.
Generate the d-Neighborhood of a String
Find all the neighbors of a pattern.
Given: A DNA string Pattern and an integer d.
Return: The collection of strings Neighbors(Pattern, d).
Sample Dataset
ACG
1
Sample Output
CCG
TCG
GCG
AAG
ATG
AGG
ACA
ACC
ACT
ACG
Question no 2:
We say that a k-mer Pattern appears as a substring of Text with at
most d mismatches if there is some k-mer substring Pattern' of Text
having d or fewer mismatches with Pattern, i.e.,
HammingDistance(Pattern, Pattern') ≤ d. Our observation that a DnaA
box may appear with slight variations leads to the following
generalization of the Pattern Matching Problem.
Approximate Pattern Matching Problem
Find all approximate occurrences of a pattern in a string.
Given: Strings Pattern and Text along with an integer d.
Return: All starting positions where Pattern appears as a substring
of Text with at most d mismatches.
Sample Dataset
ATTCTGGA
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC
3
Sample Output
6 7 26 27 78
Question no 3:
We defined a mismatch in “Compute the Hamming Distance Between Two
Strings”. We now generalize “Find the Most Frequent Words in a
String” to incorporate mismatches as well.
Given strings Text and Pattern as well as an integer d, we define
Countd(Text, Pattern) as the total number of occurrences of Pattern
in Text with at most d mismatches. For example,
Count1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4 because AAAAA appears
four times in this string with at most one mismatch: AACAA, ATAAA,
AAACA, and AAAGA. Note that two of these occurrences overlap.
A most frequent k-mer with up to d mismatches in Text is simply a
string Pattern maximizing Countd(Text, Pattern) among all k-mers.
Note that Pattern does not need to actually appear as a substring
of Text; for example, AAAAA is the most frequent 5-mer with 1
mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not
appear exactly in this string. Keep this in mind while solving the
following problem.
Frequent Words with Mismatches Problem
Find the most frequent k-mers with mismatches in a string.
Given: A string Text as well as integers k and d.
Return: All most frequent k-mers with up to d mismatches in
Text.
Sample Dataset
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output
GATG ATGC ATGT
Question no 4
We now extend “Find the Most Frequent Words with Mismatches in a
String” to find frequent words with both mismatches and reverse
complements. Recall that Pattern refers to the reverse complement
of Pattern.
Frequent Words with Mismatches and Reverse Complements
Problem
Find the most frequent k-mers (with mismatches and reverse
complements) in a DNA string.
Given: A DNA string Text as well as integers k and d.
Return: All k-mers Pattern maximizing the sum Countd(Text, Pattern)
+ Countd(Text, Pattern) over all possible k-mers.
Sample Dataset
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output
ATGT ACAT
1.
def Suffix(Pattern):
return Pattern[1:]
def FirstSymbol(Pattern):
return Pattern[0]
def HammingDistance(str1, str2):
hd = 0
for (x, y) in zip(str1, str2):
if x != y:
hd = hd + 1
return hd
def Neighbors(Pattern, d):
nucleotide = ['A', 'C', 'G', 'T']
if d==0:
return [Pattern]
if len(Pattern)==1:
return ['A', 'C', 'G', 'T']
Neighborhood = []
SuffixNeighbors = Neighbors(Suffix(Pattern),d)
for SuffixNeighbor in SuffixNeighbors:
if HammingDistance(Suffix(Pattern), SuffixNeighbor)<d:
for x in nucleotide:
Neighborhood.append(x + SuffixNeighbor)
else:
Neighborhood.append(FirstSymbol(Pattern)+SuffixNeighbor)
return Neighborhood
2.
def Text(DNA, i, n):
return DNA[i:i + n]
def HammingDistance(str1, str2):
hd = 0
for (x, y) in zip(str1, str2):
if x != y:
hd = hd + 1
return hd
def ApproximatePatternMatching(Genome, Pattern, d):
positions = []
for i in range(0, len(Genome) - len(Pattern) + 1):
Patterndash = Text(Genome, i, len(Pattern))
if (HammingDistance(Pattern, Patterndash) <= d):
positions.append(i)
return positions
3.
def NumberToSymbol(index):
dict = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}
return dict[index]
def Quotient(m,n):
return m//n
def Remainder(m,n):
return m%n
def NumberToPattern(index, k):
if k == 1:
return NumberToSymbol(index)
prefixIndex = Quotient(index, 4)
r = Remainder(index,4)
symbol = NumberToSymbol(r)
PrefixPattern = NumberToPattern(prefixIndex, k-1)
return PrefixPattern+symbol
def createKmers(k):
kmers = []
for i in range (0, pow(4,k)):
kmer = NumberToPattern(i, k)
kmers.append(kmer)
return kmers
def Text(DNA, i, n):
return DNA[i:i + n]
def HammingDistance(str1, str2):
hd = 0
for (x, y) in zip(str1, str2):
if x != y:
hd = hd + 1
return hd
def ApproximatePatternCount(Genome, Pattern, d):
positions = []
for i in range(0, len(Genome) - len(Pattern) + 1):
Patterndash = Text(Genome, i, len(Pattern))
if (HammingDistance(Pattern, Patterndash) <= d):
positions.append(i)
return len(positions)
def FrequentWordsWithMismatches(pattern, k, d):
kmers = createKmers(k)
Count = []
res = []
for kmer in kmers:
Count.append(ApproximatePatternCount(pattern, kmer, d))
maxCount = max(Count)
for i in range(0, len(Count)):
if Count[i] == maxCount:
res.append(kmers[i])
return res
4.
def ReverseComplement(DNAstr):
RevCompStr = ""
for nucleotide in DNAstr:
if nucleotide == 'a' or nucleotide == 'A':
complement = 'T'
elif nucleotide == 't' or nucleotide == 'T':
complement = 'A'
elif nucleotide == 'g' or nucleotide == 'G':
complement = 'C'
elif nucleotide == 'c' or nucleotide == 'C':
complement = 'G'
RevCompStr = RevCompStr + complement
RevCompStr = RevCompStr[::-1]
return RevCompStr
def NumberToSymbol(index):
dict = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}
return dict[index]
def Quotient(m,n):
return m//n
def Remainder(m,n):
return m%n
def NumberToPattern(index, k):
if k == 1:
return NumberToSymbol(index)
prefixIndex = Quotient(index, 4)
r = Remainder(index,4)
symbol = NumberToSymbol(r)
PrefixPattern = NumberToPattern(prefixIndex, k-1)
return PrefixPattern+symbol
def createKmers(k):
kmers = []
for i in range (0, pow(4,k)):
kmer = NumberToPattern(i, k)
kmers.append(kmer)
return kmers
def createKmersRC(kmers):
kmersRC = []
for kmer in kmers:
kmersRC.append(ReverseComplement(kmer))
return kmersRC
def Text(DNA, i, n):
return DNA[i:i + n]
def HammingDistance(str1, str2):
hd = 0
for (x, y) in zip(str1, str2):
if x != y:
hd = hd + 1
return hd
def ApproximatePatternCount(Genome, Pattern, d):
positions = []
for i in range(0, len(Genome) - len(Pattern) + 1):
Patterndash = Text(Genome, i, len(Pattern))
if (HammingDistance(Pattern, Patterndash) <= d):
positions.append(i)
return len(positions)
def FrequentWordsWithMismatchesAndReverseComplements(text, k , d):
kmers = createKmers(k)
kmersRC = createKmersRC(kmers)
Countkmers = []
CountkmersRC = []
CountTotal = []
res = []
for kmer in kmers:
Countkmers.append(ApproximatePatternCount(text, kmer, d))
for kmer in kmersRC:
CountkmersRC.append(ApproximatePatternCount(text, kmer, d))
for i in range (0, len(kmers)):
CountTotal.insert(i, Countkmers[i] + CountkmersRC[i])
maxCount = max(CountTotal)
for i in range(0, len(CountTotal)):
if CountTotal[i] == maxCount:
res.append(kmers[i])
return res