Needleman Wunsch Algorithm utilizes Dynamic programming to solve a traditional problem of bioinformatics: align 2 sequences of nucleotide in the optimal way such that the number of matched nucleotides is maximized and the number of deletions and insertions is minimized.
For example, we have 2 nucleotide sequences like below:
A | C | G | T | C | A |
C | G | T | A | T | G |
The alignment could be:
C | G | T | A | ||
C | G | T | A |
For clarification: the stricken-through ones are the ones to be deleted. In the above alignment, the first sequence is reduced to 4 nucleotides (A at position 1 and C at position 5 are deleted), the second sequence is also reduced to 4 nucleotides (the last 2 nucleotides are deleted).
After deletions, the 2 sequences become:
C | G | T | A |
C | G | T | A |
In the above alignment, we delete nucleotide 4 times and get matching of length 4. Hence, we have the aligning valuation equals (-4)delete_cost + 4match_reward.
Needleman-Wunsch Algorithm
As we stated at the beginning of the blog, the Needleman-Wunsch Algorithm solves this optimal alignment problem by using dynamic programming.
From now, we will call the 2 sequences: s1 and s2.
we define a 2-D array dp with size n x m, while n is the length of s1 and m is the length of s2.
dp[i][j] represents the maximum aligning valuation we can have when aligning the first i nucleotide of sequence 1 with the first j nucleotide of sequence 2.
Initially, we set:
for (int i = 0; i <= n; ++i) dp[i][0] = - delete_cost * i; for (int i = 0; i <= m; ++i) dp[0][i] = - delete_cost * i;
Then, for each pair of i and j, we have:
dp[i][j] = max(dp[i][j-1] - delete_cost, dp[i-1][j] - delete_cost); if (s1[i] == s2[j]) dp[i][j] = max(dp[i][j], dp[i-1][j-1] + match_reward);
The result (optimal aligning valuation) is stored in dp[n][m].
Implementation
Unfortunately, I don’t find any reliable libraries in python that support this algorithm. Hence, we will have to write our own.
As for the optimization of speed, we write this on C++.
The sample code is below. Depends on different situations, we may tweak the code a little bit to fit our cases.
// Sample usage : // vector<string> s1 = {"One", "Two", "There"}; // vector<string> s2 = {"One", "Two", "Four"}; // NeedlemanWunsch needle(0, 1); // cout << needle.computeMatchValue(s1, s2) << endl; struct NeedlemanWunsch { vector< vector<int> > dp; int delete_cost; int match_reward; NeedlemanWunsch(int delete_cost_=0, int match_reward_=1) { delete_cost = delete_cost_; match_reward = match_reward_; } float computeMatchValue(vector<string> s1, vector<string> s2) { // get size of each vector int n = s1.size(); int m = s2.size(); // prepare a dp matrix dp.resize(n+1, vector<int>(m+1, 0)); // initialize dp matrix for (int i = 0; i <= n; ++i) dp[i][0] = -delete_cost*i; for (int j = 0; j <= m; ++j) dp[0][j] = -delete_cost*j; // compute dp for (int i = 1; i <= n; ++i) for (int j = 1; j <= m; ++j) { dp[i][j] = max(dp[i-1][j] - delete_cost, dp[i][j-1] - delete_cost); if (s1[i-1] == s2[j-1]) dp[i][j] = max(dp[i][j], dp[i-1][j-1] + match_reward); } return dp[n][m]; } };
Apart from aligning nucleotide, we can use this algorithm to compute alignments of words (consider each character to be a nucleotide) or alignment of sentences (consider each word to be a nucleotide).
References:
- Wikipedia’s page on Needleman-Wunsch algorithm: link