第2回 2019/6/17
- 講師:Martin Frith
Today's topic
- Motivations
- Aims of sequence comparison
- "Repeats"
- Statistical models
- Seeds & indexing
- Sparse seeding & minimizers
- Spaced seeds & subset seeds
- Adaptive seeds (LAST)
- Exercises
1. Motivations
Diverse bio-molecule sequence data
topic | images |
---|---|
What kind of microbes are in here? | |
What's wrong with his DNA? | |
Why so long-lived? | |
Ancient DNA | |
Religious relics? | |
Infectious diseases | |
What genes are active in each cell? |
How do we analyze these sequences?
The main way is by comparing and aligning them.
Diverse alignment tasks
- Compare two genome assemblies
- Align bisulfite-converted DNA reads to a genome
- Compare metagenomic DNA reads to all known proteins (via the genetic code)
- Align ancient / damaged DNA
- Align (trans-)spliced RNA
- Align long, high-error reads from nanopore
- Compare DNA reads to each other
2. Aims of sequence comparison
What are we really trying to do?
- Find and align similar sequences?
- Find and align homologous sequences?
- Find and align orthologous sequences?
- Find and align paralogous sequences?
It depends on the purpose and researchers' options.
Ortholog and Paralog
term | explanation |
---|---|
Homology | descent from a common ancestor |
Orthology | descent from a common ancestor by genome division |
Paralogy | descent from a common ancestor by duplication within a genome |
- Orthology is not necessary to 1-to-1. (many-to-many.)
- Orthology is not transitive. Given "A and B are Orthologs" and "B and C are Orthologs", A and C are not always Orthologs. (Not an equivalence relation.)
Similarity versus homology
Homologous sequences | Not | |
---|---|---|
Similar sequences | Convergent evolution | |
Not | Rapid evolution over a long time span |
The most frequent case of convergent evolution is simple sequences.
Simple sequences
DNA (and RNA and protein) frequently has simple sequences:
atgatcgattatcgtagtctaggtcgtatgctatgatt
cgat "aaaaaaaaaaaaaaaaaaa" cggtatgcgtagctg
cgatcgtagtgactatatgagagaggattcgatgctaa
gttctctaggagaggcttaggctgagcgcgtatcactg
gctcgcggc "tgtgtgtgtgtgtgtgtgtgtgtgtgtg" a
cgtatcgcacatcgtcgattttgagattcccgatggcc
Of course, these simple sequences also appear randomly, but such sequences appear with a higher probability in the genome sequences of the actual organisms. (Why??)
How do simple sequences evolve?
- Strand slippage during DNA replication:
- An initial (short, mild) simple sequence occurs by chance
- Due to slippage, it gets longer...
- And longer...
- Retro-transcription of RNA poly-A tails
3. "Repeats"
- Different thigs are called "repeats":
- Tandem Repeats
- Simple sequences/low-complexity tracts
- Interspersed repeats
Tandem repeats
- The repeating unit may have any length (minimum = 1, maximum = millions?)
- The number of repeats may not be an integer
- The repeats may be inexact (Not perfect repeats.)
Simple sequences/low-complexity tracts
- Maybe the same as short-period (inexact) tandem repeats?
- Examples:
atatatatatatctctctctct
aaaagagaaaacaaaaaagggaaaaa
- Maybe evolve by slippage during DNA replication, with change slip offsets?
Interspersed repeats
Completely different from the above two.
- Identical, or non-identical, sequences scattered around the genome:
ctt "gttattgctgatc" gtcctctctgtaaatt "gttattgctgatc" atgctttaac
- Many of them are virus-like, and copy themselves
- Example: one human genome has about 1 million Alu elements, and about 1 million LINE elements.
Effects of repeats on sequence comparison
The influence of the "repeat sequence" is very serious, but it is divided into the following two types.
- Simple sequences cause similarities that are not homologies
- Interspersed repeats do not, because they are true homologs
- All kinds of repeat cause huge numbers of similarities
- E.g. human and chimp genomes both have \(\sim10^6\) Alu elements \(\rightarrow10^{12}\) pair-wise similarities
Dealing with repeats
- It is common to identify and "mask" repeats, to exclude them from sequence comparison. (ignore)
- Unfortunately, many researchers just use a popular method, and assume it will "work".
- Care is needed!
- There is no exact, objective definition of "repeat"
- Repeats can be biologically important!
- There are different kinds of "repeat", with differing effects.
4. Statistical models
All models are wrong, but some are useful
Classic sequence alignment
- Define a scoring scheme.
a | c | g | t | |
---|---|---|---|---|
a | 2 | -3 | -1 | -3 |
c | -3 | 2 | -3 | -1 |
g | -1 | -3 | 2 | -3 |
t | -3 | -1 | -3 | 2 |
- Gap existence cost: 5
-
Gap extension cost: 1
-
Find alignments with high (optimal) scores
Why is this a good method??
Because using above matrix means equivalent to finding alignments with high likelihood, in a statistical model of related sequences.
a | c | g | t | |
---|---|---|---|---|
a | 0.2 | 0.03 | 0.05 | 0.03 |
c | 0.03 | 0.2 | 0.03 | 0.05 |
g | 0.05 | 0.03 | 0.2 | 0.03 |
t | 0.03 | 0.05 | 0.03 | 0.2 |
- Gap existence probability: 0.1
- Gap extension probability: 0.7
All models are wrong
The classic alignment model is a gross simplification. In reality: - Substitutions are much more frequent at cg dinucleotides (neighbor effect). - Insertions and deletions are much more frequent in tandem repeats. - Gap lengths do not follow a geometric distribution. - Substitution rates vary across a genome - Etc.
But, some are useful
- Nevertheless, the classic alignment model often works well enough to give useful results.
- More complex models have been proposed (but they are still wrong). There is a tradeoff between model complexity and wrongness.
Scores are log likelihood ratios
variable | meaning |
---|---|
\(A_{xy}\) | Probability of \(x\) aligned to \(y\) in a true alignment |
\(P_{x}\) | Probability of \(x\) in the first sequence |
\(Q_{y}\) | Probability of \(y\) in the second sequence |
\(t\) | an arbitrary positive constant. |
We get more accurate alignments if we use model parameters that fit the data.
Sequence quality data
Some DNA sequencers estimate the rror probability of every base. We ought to use this information when comparing sequences. ex.) |base|t|a|t|g|c|g|a| |:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |error prob|0.01|0.02|0.07|0.24|0.32|0.75|0.62|
Why this formula can model probability well ?? → Home work:(
A flaw with classic alignment
Alignment scores | \(200\) | \(-150\) | \(300\) | \(-450\) |
---|---|---|---|---|
Sequence A | ▲ | ◀︎ | ◼︎ | ▶︎ |
Sequence B | ▲ | 🔵 | ◼︎ | ⬛️ |
→ Maximum alignment score: \(350\)
We Assuming that there is only one homologous sequence, but there may be two homologous sequences.
5. Seeds & indexing
Alignment algorithms
Exact | Heuristic |
---|---|
Guaranteed to find the highest alignment score | Not guaranteed to find the highest alignment score |
A bit slow | Fast |
Smith-Waterman algotirhm, Gotoh algorithm | seed-and-extend approach |
Seed-and-extend approach (e.g. BLAST)
- Find "seeds"
- Try extending an alignment
Types of seed
- Fixed-length exact matches
- Spaced seeds (inexact)
- Subset seeds (inexact)
- Adaptive seeds (variable length)
How to find seed matches efficiently?
- A common approach is indexing
- Preprocessing step: make an "index" of the first sequence
- Search step: scan across the second sequence, looking up seeds in the index
indexing
- First sequence: Use suffix array algorithm etc. to make suffix array. (or k-mer table).
- Second sequence: Move fixed-length indexes from the end to end and search each of them in above.
6. Sparse seeding & minimizers
Sparse seeding reduce run time and/or memory use, and there are 3 types of it. Examine only some of the indexes, not all the exact ones.
- Sparse seeding in the "query" (Second sequence) sequence(s)
- Chose query's every \(2^{\mathrm{nd}}\) position (or every \(3^{\mathrm{rd}}\) position, etc.)
- Sparse seeding in the "reference" (First sequence) sequence(s)
- Chose reference's every \(2^{\mathrm{nd}}\) position (or every \(3^{\mathrm{rd}}\) position, etc.)
- Minimizers
- Use sparse seeding in both query and reference.
- Chose position which are "minimum" (the sequence starting here is alphabetically lowest) in sliding windows of \(W\) consecutive positions.
- We can define "minimum" as we like, we have to use the same definition in both query and reference. (If so, the probability that "minimum" is included in the "reference" will increase. Avoid dropping too much.)
7. Spaced seeds & subset seeds
- Fixed-length exact matches
- Spaced seeds (inexact)
- Subset seeds (inexact)
- Adaptive seeds (variable length)
Name | Methods | Biological background |
---|---|---|
Spaced seeds | Ignore some positions. | protein-coding DNA tends to change at every \(3^{\mathrm{rd}}\) position |
Subset seeds | A generalization of spaced seeds. Use reduced alphabets (allow a mismatch between "g" and "t") at some positions. |
Proteins: similar (e.g. small) amino acids tend to replace each other DNA: a ↔︎ g and c ↔︎ t are often more common |
Spaced & subset seeds | Both | Even for randomly evolving sequences, they can be more sensitive than exact match seeds (surprising) |
Analogy: word counts
- We have a random DNA sequence of length 5.
- Which of these words is more likely to occur? "aaaa" or "atgc" (Ans. "aaaa" (\(7/4^5\)) < "atgc" (\(8/4^5\)) )
Spaced seeds
# 70%-identical sequences, length 50
ctaactagctagctgccgctatcttagctagctagctctctggtcgta
cttgctagctaactacccctatctaaactagttaaatccctgggcata
Un-spaced seed (length=\(5\)) | Spaced seed (length=\(7\), weight=\(5\)) | |
---|---|---|
■■■■■ | seed | ■■□■□■■ |
\((0.7)^5\) | Prob(match at \(1\) position) | \((0.7)^5\) |
\(46\) | Number of positions | \(44\) |
\(46\times(0.7)^5\) | Expected number of matches | \(44\times(0.7)^5\) |
Almost same probability, but...
# 70%-identical sequences, length 50
index: 0123456789...
seq A: ctaa"ctagcta"gctg"ccgctat"cttag"ctagcta"gctctctggtcgta
seq B: cttg"ctagcta"acta"cccctat"ctaaa"ctagtta"aatccctgggcata
- Un-spaced seed: \(\Longrightarrow [4:9],[5:10],[6:11]\)
- Spaced seed: \(\Longrightarrow [4:11],[15:22],[27:34]\)
In this case, the spaced pattern finds more pairs of similar sequence. Even though the number of seed hits is no greater.
Un-spaced seeds tend to have several overlapping hits, or no hits Spaced seeds are (sometimes) more likely to have at least on hit.
Seed design
Which pattern is best??
"■■□■□■■","■■□■■■","■■□■□□■■","■■□■□□□■■"
It is very hard and depends on these numbers. (\(70\)% identical sequences, length $50%)
Co-designed seed patterns
# 1: must match
# T: transitions are allowed (A ↔︎ G, C ↔︎ T)
# 0: any mismatch is allowed
1101T1T0T1T00TT1TT11
1TTTTT010TT0TT01011TTT1
1TTTT10010T011T0TTTT11
111T011T0T01T100111
1T10T100TT01000TT01TT111
111T101TT000T0T10T00T1T1
111100T011TTT00T0TT01T1
1T1T10T11011011T1
- Each pattern tends to find alignments that the other ones miss.
- Design is very and very hard, so usually use Heuristics.
8. Adaptive seeds (LAST)
- Fixed-length exact matches
- Spaced seeds (inexact)
- Subset seeds (inexact)
- Adaptive seeds (variable length)
Three algorithms so far
- Find "seeds" (initial matches) of a fixed length.
- Try extending an alignment from each seed.
The three algorithms so far used "fixed-length" seeds. However, if we align repeated array (such as "LINEs", "CpG islands", "Alu", "SINEs", "Isochores" etc.), we could get too many seeds, and also they could extend very long. As a result, calculation costs are huge.
Example) If we compare the human and chimp genomes, as each genome has \(\sim1\) million Alu elements, we will get \(\sim10^{12}\) seed matches...
Solution: adaptive seeds
- Find "seeds" (initial matches) of a fixed ~~length~~ rareness.
- Try extending an alignment from each seed.
Definition of adaptive seeds
At each position in the query, find the shortest sequence that occurs \(\leq M\) times in the reference. (For example, \(M=10\), and query starts from "|")
# Query
…atcgtatcgtatcgtactgctggcc | tagtggggga…
# Reference
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
- "tag" occurs \(135\) times in the reference.
- "tagt" occurs \(19\) times in the reference.
- "tagtg" occurs \(7\) times in the reference.
→ Chose "tagtg" as the query.
Reference: Adaptive seeds tame genomic sequence comparison Material Supplemental
Performance comparison of adaptive seeds (circles) versus fixed-length seeds (crosses)
How to find adaptive seeds?
One way is to use a suffix array . About Suffix array, please look at here (Doubling), and here (SA-IS).
Adaptive spaced seeds & Adaptive subset seeds
We can combine the benefits of spaced/subset seeds and adaptive seeds
Spaced suffix array
Also, there is a Spaced suffix array to find "Adaptive spaced seeds". (Algotirhm is same as normal one.)
# Sequence
acgatcgatg
# Spaced suffixes
0 ac_at_ga_g
1 cg_tc_at_
2 ga_cg_tg
3 at_ga_g
4 tc_at_
5 cg_tg
6 ga_g
7 at_
8 tg
9 g
Subset suffix array
# Sequence
acgatcgatg
# Subset suffixes
# a,g → R / c,t → Y
0 acRatYgaYg
1 cgRtcRatR
2 gaYcgRtg
3 atYgaYg
4 tcRatR
5 cgRtg
6 gaYg
7 atR
8 tg
9 g
Advertising
Frith(teacher) and his colleagues made LAST. "Adaptive seeds", and "Subset seeds" are used in the algorithm.
Summary
Exercises
- Find the minimizer positions, using \(W=3\), for this sequence: "acgatcgatgaagatgca"
- Write the sparse suffix array, which uses only these minimizer positions
- Provide an argument for why equation [1] is appropriate.
- Show how to simpligy equation [1] into equation [2]. What are \(c\) and \(d\) ?
Answer
1
\(0,3,5,7,10,11,13,16,17\)
acgatcgatgaagatgca
=--
--=
-=-
=--
-=-
acgatcgatgaagatgca
--=
-=-
=--
--=
-=-
acgatcgatgaagatgca
=--
=--
-=-
=--
--=
--=
2
17 a
10 aagatgca
0 acgatcgatgaagatgca
11 agatgca
3 atcgatgaagatgca
7 atgaagatgca
13 atgca
16 ca
5 cgatgaagatgca
3
\(S_{xy} (x,y\in\{a,c,g,t\})\) specifies the score for aligning the nucleotides \(x\) and \(y\). The scoring matrix can be interpreted as a log likelihood ratio:
where \(T\) is an arbitrary scaling factor, and \(R_{xy}\) is the likelihood ratio:
- \(P(xy|A)\) is the probability of observing \(x\) aligned to \(y\) in a probabilistic model of aligned sequences.
- \(P(x)\) and \(P(y)\) are the probabilities of observing \(x\) and \(y\) individually.
Then, extending the standard scoring matrix derivation to take sequencing error probabilities into account.
We assume that the sequencing instrument estimates \(P(z|d)\), the probability that a base is \(z\) (where \(z\in\{a,c,g,t\}\)) based on some observed data \(d_z\) (e.g. image intensities).
Following the likelihood-ratio principle, we define a generalization of standard substitution scores:
This formula can be rearranged by observing,
and by Bayes formula,
to obtain:
Finally, we define scores by the usual log transformation:
Reference: Incorporating sequence quality data into alignment improves DNA read mapping
4
Given the probability of one of the bases, \(x', y'\), we can make an assumption that the probability of the others (ex. \(x''\)) is following the respective \(P_{x''}\).
Therefore, the following equation holds;
Then, the formula can be rearranged as follows;
So, we get the following equation: