Alignment

2019-12-10(Tue)
Bio/alignment.py

Notebook

Example Notebook: Kerasy.examples.alignment.ipynb

Why compare DNA sequences?

To understand the function of the DNA. (The more similar the DNA sequences are the more similar the functions are.)
To find where are the important region.
To know the phylogenetic tree of organisms.
To inspect where the DNA come from.

Needleman-Wunsch-Gotoh

class kerasy.Bio.alignment.NeedlemanWunshGotoh(**kwargs)

Global Alignment: Align the entire two sequences.
Direction: Forward
params:
- match
- mismatch
- gap_opening
- gap_extension

$$ \begin{aligned} M(i,j) &= \max \begin{cases} M(i-1,j-1) + s(x_i, y_j)\\ X(i-1,j-1) - s(x_i, y_j)\\ Y(i-1,j-1) - s(x_i, y_j) \end{cases}\\ X(i,j) &= \max \begin{cases} M(i-1,j) - d\\ X(i-1,j) - d\\ \end{cases}\\ Y(i,j) &= \max \begin{cases} M(i,j-1) - d\\ Y(i,j-1) - d\\ \end{cases}\\ d &= \text{gap opening penalty}\\ e &= \text{gap extension penalty}\\ s(x_i,y_j) &= \begin{cases} \text{match score} & (\text{if $x_i=y_j$})\\ \text{mismatch score} & (\text{otherwise.})\\ \end{cases} \end{aligned} $$

Backward Needleman-Wunsh-Gotoh

class kerasy.Bio.alignment.BackwardNeedlemanWunshGotoh(**kwargs)

Global Alignment:
Direction: Backward
params:
- match
- mismatch
- gap_opening
- gap_extension

$$ \begin{aligned} bM(i,j) &= \max \begin{cases} bM(i+1,j+1) + s(x_{i+1}, y_{j+1})\\ bX(i+1,j) - d\\ bY(i,j+1) - d \end{cases}\\ bX(i,j) &= \max \begin{cases} bM(i+1,j+1) + s(x_{i+1},y_{j+1})\\ bX(i+1,j) - e \end{cases}\\ bY(i,j) &= \max \begin{cases} bM(i+1,j+1) - s(x_{i+1},y_{j+1})\\ Y(i,j+1) - e\\ \end{cases}\\ \end{aligned} $$

This algorithm is for find the maximum alignment under the condition of $x_i$ and $y_j$ are aligned. It is given by:

$$A(i,j) = M(i,j) + bM(i,j)$$

Smith-Waterman

class kerasy.Bio.alignment.SmithWaterman(**kwargs)

Local Alignment: Detect the most similar part of the sequences and align them.
Direction: Forward
params:
- match
- mismatch
- gap_opening
- gap_extension

$$ \begin{aligned} M(i,j) &= \max \begin{cases} 0\\ M(i-1,j-1) + s(x_i, y_j)\\ X(i-1,j-1) - s(x_i, y_j)\\ Y(i-1,j-1) - s(x_i, y_j) \end{cases}\\ X(i,j) &= \max \begin{cases} M(i-1,j) - d\\ X(i-1,j) - d\\ \end{cases}\\ Y(i,j) &= \max \begin{cases} M(i,j-1) - d\\ Y(i,j-1) - d\\ \end{cases}\\ \end{aligned} $$

Pair HMM

class kerasy.Bio.alignment.PairHMM(**kwargs)

We can interpret alignment probabilistically.

$$ \begin{aligned} P(\pi|x,y) &= \frac{P(x,y,\pi)}{P(x,y)}\\ P(x,y) &= \sum_{\text{alignments }\pi}P(x,y,\pi)\\ P(x_i\diamondsuit y_i|x,y) &= \frac{\sum_{\pi\in\Omega_{ij}}P(x,y,\pi)}{P(x,y)}\\ &= \frac{F^M(i,j)\cdot B^M(i,j)}{P(x,y)} \end{aligned} $$

Probability	Explanation
$$P(\pi\mid x,y)$$	The probability of alignment $\pi$ is obtained given $x_i$ and $y_j$.
$$P(x_i\diamondsuit y_i\mid x,y)$$	The probability of $x_i$ and $y_j$ are aligned.

PairHMM

Global Alignment
Direction: Forward
params:
- delta
- epsilon
- tau
- px_e_y
- px_ne_y
- qx
- qy

Viterbi Algorithm

$$ \begin{aligned} V^M(i,j) &= p_{x_iy_j}\max\begin{cases}(1-2\delta-\tau)V^M(i-1,j-1)\\(1-\varepsilon-\tau)V^X(i-1,j-1)\\(1-\varepsilon-\tau)V^Y(i-1,j-1)\end{cases}\\ V^X(i,j) &= q_{x_i}\max\begin{cases}\delta V^M(i-1,j)\\\varepsilon V^X(i-1,j)\end{cases}\\ V^Y(i,j) &= q_{y_j}\max\begin{cases}\delta V^M(i,j-1)\\\varepsilon V^X(i,j-1)\end{cases}\\ V^E(n,m) &= \tau\max\left(V^M(n,m),V^X(n,m),V^Y(n,m)\right) \end{aligned} $$

Forward Algorithm

$$ \begin{aligned} F^M(i,j) &= p_{x_iy_j}\text{sum}\begin{cases}(1-2\delta-\tau)F^M(i-1,j-1)\\(1-\varepsilon-\tau)F^X(i-1,j-1)\\(1-\varepsilon-\tau)F^Y(i-1,j-1)\end{cases}\\ F^X(i,j) &= q_{x_i}\text{sum}\begin{cases}\delta F^M(i-1,j)\\\varepsilon F^X(i-1,j)\end{cases}\\ F^Y(i,j) &= q_{y_j}\text{sum}\begin{cases}\delta F^M(i,j-1)\\\varepsilon F^X(i,j-1)\end{cases}\\ F^E(n,m) &= \tau\ \text{sum}\left(F^M(n,m),F^X(n,m),F^Y(n,m)\right) \end{aligned} $$

Backward Algorithm

$$ \begin{aligned} B^M(i,j) &= \text{sum}\begin{cases}(1-2\delta-\tau)p_{x_iy_j}B^M(i+1,j+1)\\\delta q_{x_{i+1}}B(i+1,j)\\\delta q_{y_{j+1}}B^Y(i,j+1)\end{cases}\\ B^X(i,j) &= (1-\varepsilon-\tau)p_{x_{i+1},y_{j+1}}B^M(i+1,j+1) + \varepsilon q_{x_{i+1}}B^X(i+1,j)\\ B^Y(i,j) &= (1-\varepsilon-\tau)p_{x_{i+1},y_{j+1}}B^M(i+1,j+1) + \varepsilon q_{y_{j+1}}B^Y(i,j+1)\\ \end{aligned} $$

Baum-Welch Algorithm

Although the ordinary Hidden Markov Models calculate the optimal parameters from the perspective of time span $t$, PairHMM focus on the position $u,v$ of each sequence.

Estep

$$ \begin{aligned} \xi_{u,v}\left(i,j\right) &=\frac{f_i\left(u,v\right)A_{ij}\phi_j\left(\diamond_1\right)b_j\left(\diamond_2,\diamond_3\right)}{\sum_{k=1}^Kf_k\left(u,v\right)b_k\left(u,v\right)}\\ \gamma_i\left(u,v\right) &=\frac{f_i\left(u,v\right)b_j\left(u,v\right)}{\sum_{k=1}^Kf_k\left(u,v\right)b_k\left(u,v\right)} \end{aligned} $$

variable	meaning
$f_i\left(u,v\right)$	The probability of ending up in state $i$ after aligning the two sequences $\mathbf{X,Y}$ up to observation $u$ and $v$ respectively.
$A_{ij}$	the transition probability from state $i$ to state $j$.
$\phi_j\left(\diamond_1\right)$	the emission probability of emitting $\diamond_1$ in state $j$.
$b_j\left(\diamond_2,\diamond_3\right)$	the probability of being in state $j$, given the sequences $\mathbf{X,Y}$ are aligned from observation $\diamond_2+1$ and $\diamond_3+1$ to the end of the sequences respectively.

※ "$\diamond_{[1,2,3]}$" depends on the states ($M,X,Y$) as below:

state\variable	$\phi_j\left(\diamond_1\right)$	$b_j\left(\diamond_2,\diamond_3\right)$
$M$	$e\left(x_{u+1},y_{v+1}\right)$	$b_j\left(u+1,v+1\right)$
$X$	$e\left(x_{u+1}\right)$	$b_j\left(u+1,v\right)$
$Y$	$e\left(y_{v+1}\right)$	$b_j\left(u,v+1\right)$

M step

(\pi_k)

$$\pi^{\star}_i = \sum_w\gamma_i^w(0,0)$$

(A_{ij})

$$A^{\star}_{ij} = \frac{\sum_w\sum_{u=0}^{\tilde{U}}\sum_{v=0}^{\tilde{V}}\xi_{u,v}^{w}\left(i,j\right)}{\sum_w\sum_{u=0}^{\tilde{U}}\sum_{v=0}^{\tilde{V}}\sum_{j}^{K}\xi_{u,v}^{w}\left(i,j\right)}$$

final emitting state	$\tilde{U}$	$\tilde{V}$
$M$	$U-1$	$V-1$
$X$	$U-1$	$V$
$Y$	$U$	$V-1$

(\phi_{i}(k))

$$\phi_i\left(k\right) = \frac{\sum_w\sum_{u=0}^{U\ast1}\sum_{v=0}^{V\ast2}\gamma_i^w\left(u,v\right)}{\sum_w\sum_{u=0}^U\sum_{v=0}^V\gamma_i^w\left(u,v\right)}$$

$\ast1$：$x_u=k_x$ and state $i$ equals the state $M$ or $X$. (in the state $Y$, a gap is present in observation stream $x$ therefore $k_x$ is not present.)
$\ast2$：$y_v=k_y$ and state $j$ equals the state $M$ or $Y$.

Reference

variable	meaning
\(f_i\left(u,v\right)\)	The probability of ending up in state \(i\) after aligning the two sequences \(\mathbf{X,Y}\) up to observation \(u\) and \(v\) respectively.
\(A_{ij}\)	the transition probability from state \(i\) to state \(j\).
\(\phi_j\left(\diamond_1\right)\)	the emission probability of emitting \(\diamond_1\) in state \(j\).
\(b_j\left(\diamond_2,\diamond_3\right)\)	the probability of being in state \(j\), given the sequences \(\mathbf{X,Y}\) are aligned from observation \(\diamond_2+1\) and \(\diamond_3+1\) to the end of the sequences respectively.

state\variable	\(\phi_j\left(\diamond_1\right)\)	\(b_j\left(\diamond_2,\diamond_3\right)\)
\(M\)	\(e\left(x_{u+1},y_{v+1}\right)\)	\(b_j\left(u+1,v+1\right)\)
\(X\)	\(e\left(x_{u+1}\right)\)	\(b_j\left(u+1,v\right)\)
\(Y\)	\(e\left(y_{v+1}\right)\)	\(b_j\left(u,v+1\right)\)

final emitting state	\(\tilde{U}\)	\(\tilde{V}\)
\(M\)	\(U-1\)	\(V-1\)
\(X\)	\(U-1\)	\(V\)
\(Y\)	\(U\)	\(V-1\)