In: Computer Science
Dynamic Programming Question. Make the table for finding a LCS (Longest Common Subsequence) of the strings SLWOVNNDK and ALWGQVNBKB. You need to Show the traceback for the LCS.
here string LWVNK is the LCS.
------------------------
The longest subsequence common to R = (GAC), and C = (AGCAT) will be found. Because the LCS function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences: R0 = Ø; and C0 = Ø. All the prefixes are placed in a table with C in the first row (making it a column header) and R in the first column (making it a row header).
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | Ø | Ø | Ø | Ø | Ø | Ø |
G | Ø | |||||
A | Ø | |||||
C | Ø |
This table is used to store the LCS sequence for each step of the calculation. The second column and second row have been filled in with Ø, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.
LCS(R1, C1) is determined by comparing the first elements in each sequence. G and A are not the same, so this LCS gets (using the "second property") the longest of the two sequences, LCS(R1, C0) and LCS(R0, C1). According to the table, both of these are empty, so LCS(R1, C1) is also empty, as shown in the table below. The arrows indicate that the sequence comes from both the cell above, LCS(R0, C1) and the cell on the left, LCS(R1, C0).
LCS(R1, C2) is determined by comparing G and G. They match, so G is appended to the upper left sequence, LCS(R0, C1), which is (Ø), giving (ØG), which is (G).
For LCS(R1, C3), G and C do not match. The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these, LCS(R1, C3) is (G). The arrow points to the left, since that is the longest of the two sequences.
LCS(R1, C4), likewise, is (G).
LCS(R1, C5), likewise, is (G).
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | Ø | Ø | Ø | Ø | Ø | Ø |
G | Ø | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}Ø | {\displaystyle {\overset {\nwarrow }{\ }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) |
A | Ø | |||||
C | Ø |
For LCS(R2, C1), A is compared with A. The two elements match, so A is appended to Ø, giving (A).
For LCS(R2, C2), A and G do not match, so the longest of LCS(R1, C2), which is (G), and LCS(R2, C1), which is (A), is used. In this case, they each contain one element, so this LCS is given two subsequences: (A) and (G).
For LCS(R2, C3), A does not match C. LCS(R2, C2) contains sequences (A) and (G); LCS(R1, C3) is (G), which is already contained in LCS(R2, C2). The result is that LCS(R2, C3) also contains the two subsequences, (A) and (G).
For LCS(R2, C4), A matches A, which is appended to the upper left cell, giving (GA).
For LCS(R2, C5), A does not match T. Comparing the two sequences, (GA) and (G), the longest is (GA), so LCS(R2, C5) is (GA).
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | Ø | Ø | Ø | Ø | Ø | Ø |
G | Ø | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}Ø | {\displaystyle {\overset {\nwarrow }{\ }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) |
A | Ø | {\displaystyle {\overset {\nwarrow }{\ }}}(A) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) | {\displaystyle {\overset {\nwarrow }{\ }}}(GA) | {\displaystyle {\overset {\ }{\leftarrow }}}(GA) |
C | Ø |
For LCS(R3, C1), C and A do not match, so LCS(R3, C1) gets the longest of the two sequences, (A).
For LCS(R3, C2), C and G do not match. Both LCS(R3, C1) and LCS(R2, C2) have one element. The result is that LCS(R3, C2) contains the two subsequences, (A) and (G).
For LCS(R3, C3), C and C match, so C is appended to LCS(R2, C2), which contains the two subsequences, (A) and (G), giving (AC) and (GC).
For LCS(R3, C4), C and A do not match. Combining LCS(R3, C3), which contains (AC) and (GC), and LCS(R2, C4), which contains (GA), gives a total of three sequences: (AC), (GC), and (GA).
Finally, for LCS(R3, C5), C and T do not match. The result is that LCS(R3, C5) also contains the three sequences, (AC), (GC), and (GA).
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | Ø | Ø | Ø | Ø | Ø | Ø |
G | Ø | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}Ø | {\displaystyle {\overset {\nwarrow }{\ }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) | {\displaystyle {\overset {\ }{\leftarrow }}}(G) |
A | Ø | {\displaystyle {\overset {\nwarrow }{\ }}}(A) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) | {\displaystyle {\overset {\nwarrow }{\ }}}(GA) | {\displaystyle {\overset {\ }{\leftarrow }}}(GA) |
C | Ø | {\displaystyle {\overset {\ \uparrow }{\ }}}(A) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) | {\displaystyle {\overset {\nwarrow }{\ }}}(AC) & (GC) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(AC) & (GC) & (GA) | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(AC) & (GC) & (GA) |
The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA). The table also shows the longest common subsequences for every possible pair of prefixes. For example, for (AGC) and (GA), the longest common subsequence are (A) and (G).
Traceback approach[edit]
Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row. Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space. Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}0 | {\displaystyle {\overset {\nwarrow }{\ }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 |
A | 0 | {\displaystyle {\overset {\nwarrow }{\ }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\nwarrow }{\ }}}2 | {\displaystyle {\overset {\ }{\leftarrow }}}2 |
C | 0 | {\displaystyle {\overset {\ \uparrow }{\ }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\nwarrow }{\ }}}2 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2 |
The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table. When the length decreases, the sequences must have had a common element. Several paths are possible when two arrows are shown in a cell. Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease. The bold numbers trace out the sequence, (GA).[6]
Ø | A | G | C | A | T | |
---|---|---|---|---|---|---|
Ø | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}0 | {\displaystyle {\overset {\nwarrow }{\ }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 | {\displaystyle {\overset {\ }{\leftarrow }}}1 |
A | 0 | {\displaystyle {\overset {\nwarrow }{\ }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\nwarrow }{\ }}}2 | {\displaystyle {\overset {\ }{\leftarrow }}}2 |
C | 0 | {\displaystyle {\overset {\ \uparrow }{\ }}}1 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 | {\displaystyle {\overset {\nwarrow }{\ }}}2 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2 | {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2 |