Needleman Wunsch Algorithm utilizes Dynamic programming to solve a traditional problem of bioinformatics: align 2 sequences of nucleotide in the optimal way such that the number of matched nucleotides is maximized and the number of deletions and insertions is minimized.

For example, we have 2 nucleotide sequences like below:

A | C | G | T | C | A |

C | G | T | A | T | G |

The alignment could be:

C | G | T | A | ||

C | G | T | A |

For clarification: the stricken-through ones are the ones to be deleted. In the above alignment, the first sequence is reduced to 4 nucleotides (A at position 1 and C at position 5 are deleted), the second sequence is also reduced to 4 nucleotides (the last 2 nucleotides are deleted).

After deletions, the 2 sequences become:

C | G | T | A |

C | G | T | A |

In the above alignment, we delete nucleotide 4 times and get matching of length 4. Hence, we have the aligning valuation equals (-4)delete_cost + 4match_reward.

Needleman-Wunsch Algorithm

As we stated at the beginning of the blog, the Needleman-Wunsch Algorithm solves this optimal alignment problem by using dynamic programming.

From now, we will call the 2 sequences: *s1* and *s2*.

we define a 2-D array * dp* with size

*n*x

*m*, while

*n*is the length of

*s1*and

*m*is the length of

*s2*.

*dp[i][j]*represents the maximum aligning valuation we can have when aligning the first

*i*nucleotide of sequence 1 with the first

*j*nucleotide of sequence 2.

Initially, we set:

for (int i = 0; i <= n; ++i) dp[i][0] = - delete_cost * i; for (int i = 0; i <= m; ++i) dp[0][i] = - delete_cost * i;

Then, for each pair of *i* and *j*, we have:

dp[i][j] = max(dp[i][j-1] - delete_cost, dp[i-1][j] - delete_cost); if (s1[i] == s2[j]) dp[i][j] = max(dp[i][j], dp[i-1][j-1] + match_reward);

The result (optimal aligning valuation) is stored in *dp[n][m]*.

Implementation

Unfortunately, I don’t find any reliable libraries in python that support this algorithm. Hence, we will have to write our own.

As for the optimization of speed, we write this on C++.

The sample code is below. Depends on different situations, we may tweak the code a little bit to fit our cases.

// Sample usage : // vector<string> s1 = {"One", "Two", "There"}; // vector<string> s2 = {"One", "Two", "Four"}; // NeedlemanWunsch needle(0, 1); // cout << needle.computeMatchValue(s1, s2) << endl; struct NeedlemanWunsch { vector< vector<int> > dp; int delete_cost; int match_reward; NeedlemanWunsch(int delete_cost_=0, int match_reward_=1) { delete_cost = delete_cost_; match_reward = match_reward_; } float computeMatchValue(vector<string> s1, vector<string> s2) { // get size of each vector int n = s1.size(); int m = s2.size(); // prepare a dp matrix dp.resize(n+1, vector<int>(m+1, 0)); // initialize dp matrix for (int i = 0; i <= n; ++i) dp[i][0] = -delete_cost*i; for (int j = 0; j <= m; ++j) dp[0][j] = -delete_cost*j; // compute dp for (int i = 1; i <= n; ++i) for (int j = 1; j <= m; ++j) { dp[i][j] = max(dp[i-1][j] - delete_cost, dp[i][j-1] - delete_cost); if (s1[i-1] == s2[j-1]) dp[i][j] = max(dp[i][j], dp[i-1][j-1] + match_reward); } return dp[n][m]; } };

Apart from aligning nucleotide, we can use this algorithm to compute alignments of words (consider each character to be a nucleotide) or alignment of sentences (consider each word to be a nucleotide).

*References:*

*Wikipedia’s page on Needleman-Wunsch algorithm: link*