第2回 2019/6/17

講師：Martin Frith

Today's topic

Motivations
Aims of sequence comparison
"Repeats"
Statistical models
Seeds & indexing
Sparse seeding & minimizers
Spaced seeds & subset seeds
Adaptive seeds (LAST)
Exercises

1. Motivations

Diverse bio-molecule sequence data

topic	images
What kind of microbes are in here?
What's wrong with his DNA?
Why so long-lived?
Ancient DNA
Religious relics?
Infectious diseases
What genes are active in each cell?

How do we analyze these sequences?

The main way is by comparing and aligning them.

Diverse alignment tasks

Compare two genome assemblies
Align bisulfite-converted DNA reads to a genome
Compare metagenomic DNA reads to all known proteins (via the genetic code)
Align ancient / damaged DNA
Align (trans-)spliced RNA
Align long, high-error reads from nanopore
Compare DNA reads to each other

2. Aims of sequence comparison

What are we really trying to do?

Find and align similar sequences?
Find and align homologous sequences?
Find and align orthologous sequences?
Find and align paralogous sequences?

It depends on the purpose and researchers' options.

Ortholog and Paralog

term	explanation
Homology	descent from a common ancestor
Orthology	descent from a common ancestor by genome division
Paralogy	descent from a common ancestor by duplication within a genome

Orthology is not necessary to 1-to-1. (many-to-many.)
Orthology is not transitive. Given "A and B are Orthologs" and "B and C are Orthologs", A and C are not always Orthologs. (Not an equivalence relation.)

Similarity versus homology

	Homologous sequences	Not
Similar sequences		Convergent evolution
Not	Rapid evolution over a long time span

The most frequent case of convergent evolution is simple sequences.

Simple sequences

DNA (and RNA and protein) frequently has simple sequences:

atgatcgattatcgtagtctaggtcgtatgctatgatt
cgat "aaaaaaaaaaaaaaaaaaa" cggtatgcgtagctg
cgatcgtagtgactatatgagagaggattcgatgctaa
gttctctaggagaggcttaggctgagcgcgtatcactg
gctcgcggc "tgtgtgtgtgtgtgtgtgtgtgtgtgtg" a
cgtatcgcacatcgtcgattttgagattcccgatggcc

Of course, these simple sequences also appear randomly, but such sequences appear with a higher probability in the genome sequences of the actual organisms. (Why??)

How do simple sequences evolve?

Strand slippage during DNA replication:
- An initial (short, mild) simple sequence occurs by chance
- Due to slippage, it gets longer...
- And longer...
Retro-transcription of RNA poly-A tails

3. "Repeats"

Different thigs are called "repeats":
- Tandem Repeats
- Simple sequences/low-complexity tracts
- Interspersed repeats

Tandem repeats

The repeating unit may have any length (minimum = 1, maximum = millions?)
The number of repeats may not be an integer
The repeats may be inexact (Not perfect repeats.)

Simple sequences/low-complexity tracts

Maybe the same as short-period (inexact) tandem repeats?
Examples:

atatatatatatctctctctct
aaaagagaaaacaaaaaagggaaaaa

Maybe evolve by slippage during DNA replication, with change slip offsets?

Interspersed repeats

Completely different from the above two.

Identical, or non-identical, sequences scattered around the genome:

ctt "gttattgctgatc" gtcctctctgtaaatt "gttattgctgatc" atgctttaac

Many of them are virus-like, and copy themselves
- Example: one human genome has about 1 million Alu elements, and about 1 million LINE elements.

Effects of repeats on sequence comparison

The influence of the "repeat sequence" is very serious, but it is divided into the following two types.

Simple sequences cause similarities that are not homologies
- Interspersed repeats do not, because they are true homologs
All kinds of repeat cause huge numbers of similarities
- E.g. human and chimp genomes both have $\sim10^6$ Alu elements $\rightarrow10^{12}$ pair-wise similarities

Dealing with repeats

It is common to identify and "mask" repeats, to exclude them from sequence comparison. (ignore)
Unfortunately, many researchers just use a popular method, and assume it will "work".
Care is needed!
- There is no exact, objective definition of "repeat"
- Repeats can be biologically important!
- There are different kinds of "repeat", with differing effects.

4. Statistical models

The most important thing about statistical models

All models are wrong, but some are useful

Classic sequence alignment

Define a scoring scheme.

	a	c	g	t
a	2	-3	-1	-3
c	-3	2	-3	-1
g	-1	-3	2	-3
t	-3	-1	-3	2

Gap existence cost: 5
Gap extension cost: 1
Find alignments with high (optimal) scores

Why is this a good method??

Because using above matrix means equivalent to finding alignments with high likelihood, in a statistical model of related sequences.

	a	c	g	t
a	0.2	0.03	0.05	0.03
c	0.03	0.2	0.03	0.05
g	0.05	0.03	0.2	0.03
t	0.03	0.05	0.03	0.2

Gap existence probability: 0.1
Gap extension probability: 0.7

All models are wrong

The classic alignment model is a gross simplification. In reality: - Substitutions are much more frequent at cg dinucleotides (neighbor effect). - Insertions and deletions are much more frequent in tandem repeats. - Gap lengths do not follow a geometric distribution. - Substitution rates vary across a genome - Etc.

But, some are useful

Nevertheless, the classic alignment model often works well enough to give useful results.
More complex models have been proposed (but they are still wrong). There is a tradeoff between model complexity and wrongness.

Scores are log likelihood ratios

$$S_{xy}=t \times \log \left(\frac{A_{xy}}{P_{x} \times Q_{y}}\right)$$

variable	meaning
$A_{xy}$	Probability of $x$ aligned to $y$ in a true alignment
$P_{x}$	Probability of $x$ in the first sequence
$Q_{y}$	Probability of $y$ in the second sequence
$t$	an arbitrary positive constant.

We get more accurate alignments if we use model parameters that fit the data.

Sequence quality data

Some DNA sequencers estimate the rror probability of every base. We ought to use this information when comparing sequences. ex.) |base|t|a|t|g|c|g|a| |:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |error prob|0.01|0.02|0.07|0.24|0.32|0.75|0.62|

$$\begin{aligned} t\times\log\left(\frac{A_{xy}}{P_x\times Q_y}\right) &\Longrightarrow t\times\log\left(\sum_{i,j}\frac{A_{ij}}{P_i\times Q_j}p(x=i)p(y=j)\right)\\ &= t\times\log\left(cd\frac{A_{xy}}{P_xQ_y} + (1-cd)\right) \end{aligned}$$

Why this formula can model probability well ?? → Home work:(

A flaw with classic alignment

Alignment scores	$200$	$-150$	$300$	$-450$
Sequence A	▲	◀︎	◼︎	▶︎
Sequence B	▲	🔵	◼︎	⬛️

→ Maximum alignment score: $350$

We Assuming that there is only one homologous sequence, but there may be two homologous sequences.

5. Seeds & indexing

Alignment algorithms

Exact	Heuristic
Guaranteed to find the highest alignment score	Not guaranteed to find the highest alignment score
A bit slow	Fast
Smith-Waterman algotirhm, Gotoh algorithm	seed-and-extend approach

Seed-and-extend approach (e.g. BLAST)

Find "seeds"
Try extending an alignment

Types of seed

Fixed-length exact matches
Spaced seeds (inexact)
Subset seeds (inexact)
Adaptive seeds (variable length)

How to find seed matches efficiently?

A common approach is indexing
Preprocessing step: make an "index" of the first sequence
Search step: scan across the second sequence, looking up seeds in the index

indexing

First sequence: Use suffix array algorithm etc. to make suffix array. (or k-mer table).
Second sequence: Move fixed-length indexes from the end to end and search each of them in above.

6. Sparse seeding & minimizers

Sparse seeding reduce run time and/or memory use, and there are 3 types of it. Examine only some of the indexes, not all the exact ones.

Sparse seeding in the "query" (Second sequence) sequence(s)
- Chose query's every $2^{\mathrm{nd}}$ position (or every $3^{\mathrm{rd}}$ position, etc.)
Sparse seeding in the "reference" (First sequence) sequence(s)
- Chose reference's every $2^{\mathrm{nd}}$ position (or every $3^{\mathrm{rd}}$ position, etc.)
Minimizers
- Use sparse seeding in both query and reference.
- Chose position which are "minimum" (the sequence starting here is alphabetically lowest) in sliding windows of $W$ consecutive positions.
- We can define "minimum" as we like, we have to use the same definition in both query and reference. (If so, the probability that "minimum" is included in the "reference" will increase. Avoid dropping too much.)

7. Spaced seeds & subset seeds

Fixed-length exact matches
Spaced seeds (inexact)
Subset seeds (inexact)
Adaptive seeds (variable length)

Name	Methods	Biological background
Spaced seeds	Ignore some positions.	protein-coding DNA tends to change at every $3^{\mathrm{rd}}$ position
Subset seeds	A generalization of spaced seeds. Use reduced alphabets (allow a mismatch between "g" and "t") at some positions.	Proteins: similar (e.g. small) amino acids tend to replace each other DNA: a ↔︎ g and c ↔︎ t are often more common
Spaced & subset seeds	Both	Even for randomly evolving sequences, they can be more sensitive than exact match seeds (surprising)

Analogy: word counts

We have a random DNA sequence of length 5.
Which of these words is more likely to occur? "aaaa" or "atgc" (Ans. "aaaa" ($7/4^5$) < "atgc" ($8/4^5$) )

Spaced seeds

# 70%-identical sequences, length 50
ctaactagctagctgccgctatcttagctagctagctctctggtcgta
cttgctagctaactacccctatctaaactagttaaatccctgggcata

Un-spaced seed (length=$5$)		Spaced seed (length=$7$, weight=$5$)
■■■■■	seed	■■□■□■■
$(0.7)^5$	Prob(match at $1$ position)	$(0.7)^5$
$46$	Number of positions	$44$
$46\times(0.7)^5$	Expected number of matches	$44\times(0.7)^5$

Almost same probability, but...

# 70%-identical sequences, length 50
index: 0123456789...
seq A: ctaa"ctagcta"gctg"ccgctat"cttag"ctagcta"gctctctggtcgta
seq B: cttg"ctagcta"acta"cccctat"ctaaa"ctagtta"aatccctgggcata

Un-spaced seed: $\Longrightarrow [4:9],[5:10],[6:11]$
Spaced seed: $\Longrightarrow [4:11],[15:22],[27:34]$

In this case, the spaced pattern finds more pairs of similar sequence. Even though the number of seed hits is no greater.

Un-spaced seeds tend to have several overlapping hits, or no hits Spaced seeds are (sometimes) more likely to have at least on hit.

Seed design

Which pattern is best??

"■■□■□■■","■■□■■■","■■□■□□■■","■■□■□□□■■"

It is very hard and depends on these numbers. ($70$% identical sequences, length $50%)

Co-designed seed patterns

# 1: must match
# T: transitions are allowed (A ↔︎ G, C ↔︎ T)
# 0: any mismatch is allowed
1101T1T0T1T00TT1TT11
1TTTTT010TT0TT01011TTT1
1TTTT10010T011T0TTTT11
111T011T0T01T100111
1T10T100TT01000TT01TT111
111T101TT000T0T10T00T1T1
111100T011TTT00T0TT01T1
1T1T10T11011011T1

Each pattern tends to find alignments that the other ones miss.
Design is very and very hard, so usually use Heuristics.

8. Adaptive seeds (LAST)

Fixed-length exact matches
Spaced seeds (inexact)
Subset seeds (inexact)
Adaptive seeds (variable length)

Three algorithms so far

Find "seeds" (initial matches) of a fixed length.
Try extending an alignment from each seed.

The three algorithms so far used "fixed-length" seeds. However, if we align repeated array (such as "LINEs", "CpG islands", "Alu", "SINEs", "Isochores" etc.), we could get too many seeds, and also they could extend very long. As a result, calculation costs are huge.

Example) If we compare the human and chimp genomes, as each genome has $\sim1$ million Alu elements, we will get $\sim10^{12}$ seed matches...

Solution: adaptive seeds

Find "seeds" (initial matches) of a fixed ~~length~~ rareness.
Try extending an alignment from each seed.

Definition of adaptive seeds

At each position in the query, find the shortest sequence that occurs $\leq M$ times in the reference. (For example, $M=10$, and query starts from "|")

# Query
…atcgtatcgtatcgtactgctggcc | tagtggggga…
# Reference
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…

"tag" occurs $135$ times in the reference.
"tagt" occurs $19$ times in the reference.
"tagtg" occurs $7$ times in the reference.

→ Chose "tagtg" as the query.

Adaptive versus fixed-length seeds

Reference: Adaptive seeds tame genomic sequence comparison Material Supplemental

Performance comparison of adaptive seeds (circles) versus fixed-length seeds (crosses)

How to find adaptive seeds?

One way is to use a suffix array . About Suffix array, please look at here (Doubling), and here (SA-IS).

Adaptive spaced seeds & Adaptive subset seeds

We can combine the benefits of spaced/subset seeds and adaptive seeds

Spaced suffix array

Also, there is a Spaced suffix array to find "Adaptive spaced seeds". (Algotirhm is same as normal one.)

# Sequence
acgatcgatg
# Spaced suffixes
0 ac_at_ga_g
1  cg_tc_at_
2   ga_cg_tg
3    at_ga_g
4     tc_at_
5      cg_tg
6       ga_g
7        at_
8         tg
9          g

Subset suffix array

# Sequence
acgatcgatg
# Subset suffixes
# a,g → R / c,t → Y
0 acRatYgaYg
1  cgRtcRatR
2   gaYcgRtg
3    atYgaYg
4     tcRatR
5      cgRtg
6       gaYg
7        atR
8         tg
9          g

Advertising

Frith(teacher) and his colleagues made LAST. "Adaptive seeds", and "Subset seeds" are used in the algorithm.

Summary

To do meaningful sequence comparison, the most important thing is to understand the aim clearly

"Repeats" have a massive effect on bio-sequence comparison

Statistical models

are the basis of accurate sequence comparison

Spaced and subset seeds improve sensitivity, but are hard to design

Adaptive seeds are useful when there is non-uniform composition

Exercises

Find the minimizer positions, using $W=3$, for this sequence: "acgatcgatgaagatgca"
Write the sparse suffix array, which uses only these minimizer positions
Provide an argument for why equation [1] is appropriate.
Show how to simpligy equation [1] into equation [2]. What are $c$ and $d$ ?

$$\begin{aligned} t\times\log\left(\frac{A_{xy}}{P_x\times Q_y}\right) &\Longrightarrow t\times\log\left(\sum_{i,j}\frac{A_{ij}}{P_i\times Q_j}p(x=i)p(y=j)\right) & [1]\\ &= t\times\log\left(cd\frac{A_{xy}}{P_xQ_y} + (1-cd)\right) & [2] \end{aligned}$$

Answer

1

$0,3,5,7,10,11,13,16,17$

acgatcgatgaagatgca
=--
 --=
  -=-
   =--
    -=-
acgatcgatgaagatgca
     --=
      -=-
       =--
        --=
         -=-
acgatcgatgaagatgca
          =--
           =--
            -=-
             =--
              --=
               --=

2

17 a
10 aagatgca
 0 acgatcgatgaagatgca
11 agatgca
 3 atcgatgaagatgca
 7 atgaagatgca
13 atgca
16 ca
 5 cgatgaagatgca

3

$S_{xy} (x,y\in\{a,c,g,t\})$ specifies the score for aligning the nucleotides $x$ and $y$. The scoring matrix can be interpreted as a log likelihood ratio:

$$S_{x y}=t \times \ln \left(R_{x y}\right)$$

where $T$ is an arbitrary scaling factor, and $R_{xy}$ is the likelihood ratio:

$$R_{x y}=\frac{P(x y | A)}{P(x) P(y)}$$

$P(xy|A)$ is the probability of observing $x$ aligned to $y$ in a probabilistic model of aligned sequences.
$P(x)$ and $P(y)$ are the probabilities of observing $x$ and $y$ individually.

Then, extending the standard scoring matrix derivation to take sequencing error probabilities into account.

We assume that the sequencing instrument estimates $P(z|d)$, the probability that a base is $z$ (where $z\in\{a,c,g,t\}$) based on some observed data $d_z$ (e.g. image intensities).

Following the likelihood-ratio principle, we define a generalization of standard substitution scores:

$$R_{\alpha \beta}^{\prime}=\frac{P(\alpha \beta | A)}{P(\alpha) P(\beta)}$$

This formula can be rearranged by observing,

$$\begin{aligned} P(\alpha) &=\sum_{x \in[a, c, g, t]} P(x) P(\alpha | x) \\ P(\beta) &=\sum_{y \in[a, c, g, t]} P(y) P(\beta | y) \\ P(\alpha \beta | A) &=\sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]} P(x y | A) P(\alpha | x) P(\beta | y) \end{aligned}$$

and by Bayes formula,

$$ P(\alpha | x)=\frac{P(x | \alpha) P(\alpha)}{P(x)}\\ P(\beta | y)=\frac{P(y | \beta) P(\beta )}{P(y)} $$

to obtain:

$$ \begin{aligned} R_{\alpha \beta}^{\prime} &= \frac{P(\alpha \beta | A)}{P(\alpha) P(\beta)}\\ &= \frac{\sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]} P(x y | A) P(\alpha | x) P(\beta | y)}{\sum_{x \in[a, c, g, t]} P(x) P(\alpha | x) \sum_{y \in[a, c, g, t]} P(y) P(\beta | y)} \\ &= \frac{\sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]} P(x y | A) P(\alpha | x) P(\beta | y)}{\sum_{x \in[a, c, g, t]} P(x | \alpha) P(\alpha) \sum_{y \in[a, c, g, t]} P(y | \beta) P(\beta )} \\ &= \frac{\sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]} P(x y | A) P(\alpha | x) P(\beta | y)}{P(\alpha) P(\beta )} \\ &= \sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]}\frac{ P(x y | A) P(x|\alpha) P(y|\beta )}{P(x) P(y )} \\ &= \sum_{x \in[a, c, g, t]}\sum_{y \in[a, c, g, t]}R_{xy}P(x|\alpha) P(y|\beta ) \\ \end{aligned} $$

Finally, we define scores by the usual log transformation:

$$S_{\alpha\beta}^{\prime}=t \times \ln \left(R_{\alpha\beta}^{\prime}\right)$$

Reference: Incorporating sequence quality data into alignment improves DNA read mapping

4

Given the probability of one of the bases, $x', y'$, we can make an assumption that the probability of the others (ex. $x''$) is following the respective $P_{x''}$.

Therefore, the following equation holds;

$$p(x=i) = \begin{cases} p(x=x'), & (\text{if}\ i=x') \\ \left(1-p(x=x')\right)\frac{P_i}{1-P_{x'}}, & (\text{otherwise}) \end{cases} $$

Then, the formula can be rearranged as follows;

$$\begin{aligned} &\sum_{i, j} \frac{A_{i j}}{P_{i} \times Q_{j}} p(x=i) p(y=j)\\ &= \sum_{i=x',j=y'}+\sum_{i=x',j\neq y'}+\sum_{i\neq x',j=y'}+\sum_{x\neq x',y\neq y'}\\ &= \frac{A_{x' y'}}{P_{x'} \times Q_{y'}} p(x=x') p(y=y') + \sum_{j\neq y'} \frac{A_{x' j}}{P_{x'} \times Q_{j}}p(x=x')\left\{\left(1-p(y=y')\right)\frac{Q_j}{1-Q_{y'}}\right\}\\ &+ \sum_{i\neq x'} \frac{A_{i y'}}{P_{i} \times Q_{y'}}\left\{\left(1-p(x=x')\right)\frac{P_i}{1-P_{x'}}\right\}p(y=y') + \sum_{i\neq x',j\neq y'} \frac{A_{i j}}{P_{i} \times Q_{j}}\left\{\left(1-p(x=x')\right)\frac{P_i}{1-P_{x'}}\right\}\left\{\left(1-p(y=y')\right)\frac{Q_i}{1-Q_{y'}}\right\}\\ &= \frac{A_{x' y'}}{P_{x'} \times Q_{y'}} p(x=x') p(y=y') + \frac{p(x=x')}{P_{x'}}\frac{1-p(y=y')}{1-Q_{y'}}\sum_{j\neq y'}A_{x' j}\\ &+ \frac{1-p(x=x')}{1-P_{x'}}\frac{p(y=y')}{Q_{y'}}\sum_{i\neq x'}A_{i y'} + \frac{1-p(x=x')}{1-P_{x'}}\frac{1-p(y=y')}{1-Q_{y'}} \sum_{i\neq x',j\neq y'} A_{i j} \\ &= \frac{A_{x' y'}}{P_{x'} \times Q_{y'}} p(x=x') p(y=y') + \frac{p(x=x')}{P_{x'}}\frac{1-p(y=y')}{1-Q_{y'}}\left(P_{x'}-A_{x' y'}\right)\\ &+ \frac{1-p(x=x')}{1-P_{x'}}\frac{p(y=y')}{Q_{y'}}\left(Q_{y'}-A_{x' y'}\right) + \frac{1-p(x=x')}{1-P_{x'}}\frac{1-p(y=y')}{1-Q_{y'}} \left(1-\left(P_{x'}+Q_{y'}-A_{x' y'}\right)\right) \\ &=\cdots\\ &=\left(\frac{p(x=x')-P_{x'}}{1-P_{x'}}\right)\left(\frac{p(y=y')-Q_{y'}}{1-Q_{y'}}\right)\frac{A_{x' y'}}{P_{x'} \times Q_{y'}} + \left(1-\left(\frac{p(x=x')-P_{x'}}{1-P_{x'}}\right)\left(\frac{p(y=y')-Q_{y'}}{1-Q_{y'}}\right)\right) \end{aligned}$$

So, we get the following equation:

$$ t\times\log\left(\sum_{i,j}\frac{A_{ij}}{P_i\times Q_j}p(x=i)p(y=j)\right)= t\times\log\left(cd\frac{A_{xy}}{P_xQ_y} + (1-cd)\right)\\ c=\left(\frac{p(x=x')-P_{x'}}{1-P_{x'}}\right),\quad d=\left(\frac{p(y=y')-Q_{y'}}{1-Q_{y'}}\right) $$

variable	meaning
\(A_{xy}\)	Probability of \(x\) aligned to \(y\) in a true alignment
\(P_{x}\)	Probability of \(x\) in the first sequence
\(Q_{y}\)	Probability of \(y\) in the second sequence
\(t\)	an arbitrary positive constant.

Un-spaced seed (length=\(5\))		Spaced seed (length=\(7\), weight=\(5\))
■■■■■	seed	■■□■□■■
\((0.7)^5\)	Prob(match at \(1\) position)	\((0.7)^5\)
\(46\)	Number of positions	\(44\)
\(46\times(0.7)^5\)	Expected number of matches	\(44\times(0.7)^5\)