cs262 lecture 9, win07, batzoglou fragment assembly given n reads… where n ~ 30 million… we need...
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/1.jpg)
CS262 Lecture 9, Win07, Batzoglou
Fragment Assembly
Given N reads…Given N reads…Where N ~ 30 Where N ~ 30
million…million…
We need to use a We need to use a linear-time linear-time algorithmalgorithm
![Page 2: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/2.jpg)
CS262 Lecture 9, Win07, Batzoglou
Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes out of sequencer
mate pair a pair of reads from two endsof the same insert fragment
contig a contiguous sequence formed by several overlapping readswith no gaps
supercontig an ordered and oriented set(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from thesequene multiple alignment of reads
in a contig
![Page 3: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/3.jpg)
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads
• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
T GA
TAGA| ||
TACA
TAGT||
• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons
• Solution: Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
![Page 4: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/4.jpg)
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA
insert A
replace T with Ccorrelated errors—probably caused by repeats disentangle overlaps
TAGATTACACAGATTACTGATAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
In practice, error correction removes up to 98% of the errors
![Page 5: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/5.jpg)
CS262 Lecture 9, Win07, Batzoglou
2. Merge Reads into Contigs
• Overlap graph: Nodes: reads r1…..rn
Edges: overlaps (ri, rj, shift, orientation, score)
Note:of course, we don’tknow the “color” ofthese nodes
Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat
![Page 6: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/6.jpg)
CS262 Lecture 9, Win07, Batzoglou
2. Merge Reads into Contigs
![Page 7: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/7.jpg)
CS262 Lecture 9, Win07, Batzoglou
Overlap graph after forming contigs
Unitigs:Gene Myers, 95
![Page 8: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/8.jpg)
CS262 Lecture 9, Win07, Batzoglou
Repeats, errors, and contig lengths
• Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
Increase read length Decrease sequencing error rate
Role of error correction:Discards up to 98% of single-letter sequencing errors
decreases error rate decreases effective repeat content increases contig length
![Page 9: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/9.jpg)
CS262 Lecture 9, Win07, Batzoglou
3. Link Contigs into Supercontigs
Too dense Overcollapsed
Inconsistent links Overcollapsed?
Normal density
![Page 10: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/10.jpg)
CS262 Lecture 9, Win07, Batzoglou
Find all links between unique contigs
3. Link Contigs into Supercontigs
Connect contigs incrementally, if 2 links
supercontig(aka scaffold)
![Page 11: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/11.jpg)
CS262 Lecture 9, Win07, Batzoglou
Fill gaps in supercontigs with paths of repeat contigs
3. Link Contigs into Supercontigs
![Page 12: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/12.jpg)
CS262 Lecture 9, Win07, Batzoglou
4. Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
![Page 13: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/13.jpg)
CS262 Lecture 9, Win07, Batzoglou
Some Assemblers
• PHRAP• Early assembler, widely used, good model of read errors
• Overlap O(n2) layout (no mate pairs) consensus
• Celera• First assembler to handle large genomes (fly, human, mouse)
• Overlap layout consensus
• Arachne• Public assembler (mouse, several fungi)
• Overlap layout consensus
• Phusion• Overlap clustering PHRAP assemblage consensus
• Euler• Indexing Euler graph layout by picking paths consensus
![Page 14: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/14.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies
Celera’s assemblies of human and mouse
![Page 15: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/15.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies—mouse
![Page 16: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/16.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies—mouse
Terminology: N50 contig lengthN50 contig lengthIf we sort contigs from largest to smallest, and startCovering the genome in that order, N50 is the lengthOf the contig that just covers the 50th percentile.
![Page 17: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/17.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies—rat
![Page 18: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/18.jpg)
CS262 Lecture 9, Win07, Batzoglou
History of WGA
• 1982: -virus, 48,502 bp
• 1995: h-influenzae, 1 Mbp
• 2000: fly, 100 Mbp
• 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee,
several fungal genomes
Gene Myers
Let’s sequence the human
genome with the shotgun
strategy
That is impossible, and
a bad idea anyway
Phil Green
1997
![Page 19: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/19.jpg)
CS262 Lecture 9, Win07, Batzoglou
Genomes Sequenced
• http://www.genome.gov/10002154
![Page 20: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/20.jpg)
CS262 Lecture 9, Win07, Batzoglou
Multiple Sequence Multiple Sequence AlignmentsAlignments
![Page 21: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/21.jpg)
CS262 Lecture 9, Win07, Batzoglou
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
![Page 22: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/22.jpg)
CS262 Lecture 9, Win07, Batzoglou
Protein Phylogenies
• Proteins evolve by both duplication and species divergence
![Page 23: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/23.jpg)
CS262 Lecture 9, Win07, Batzoglou
Orthology and Paralogy
HB HumanHB Human
WB WormWB Worm
HA1 HumanHA1 Human
HA2 HumanHA2 Human
YeastYeast
WA WormWA Worm
Orthologs:Derived by speciation
Paralogs:Everything else
![Page 24: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/24.jpg)
CS262 Lecture 9, Win07, Batzoglou
Orthology, Paralogy, Inparalogs, Outparalogs
![Page 25: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/25.jpg)
CS262 Lecture 9, Win07, Batzoglou
![Page 26: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/26.jpg)
CS262 Lecture 9, Win07, Batzoglou
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology
• The patterns of conservation can help us tell function of the element
![Page 27: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/27.jpg)
CS262 Lecture 9, Win07, Batzoglou
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
![Page 28: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/28.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken
• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
Duck
![Page 29: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/29.jpg)
CS262 Lecture 9, Win07, Batzoglou
A Profile Representation
• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi
• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings
Can think of this as a “likelihood” of each letter in each position
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4
![Page 30: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/30.jpg)
CS262 Lecture 9, Win07, Batzoglou
Multiple Sequence Alignments
Algorithms
![Page 31: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/31.jpg)
CS262 Lecture 9, Win07, Batzoglou
Multidimensional DP
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)
F(i1,i2,…,iN)= max(all neighbors of cube)(F(nbr)+S(nbr))
![Page 32: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/32.jpg)
CS262 Lecture 9, Win07, Batzoglou
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),
F(i – 1, j – 1, k ) + S(xi, xj, - ),F(i – 1, j , k – 1) + S(xi, -, xk),F(i – 1, j , k ) + S(xi, -, - ),F(i , j – 1, k – 1) + S( -, xj, xk),F(i , j – 1, k ) + S( -, xj, - ),F(i , j , k – 1) + S( -, -, xk) }
Multidimensional DP
![Page 33: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/33.jpg)
CS262 Lecture 9, Win07, Batzoglou
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
![Page 34: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/34.jpg)
CS262 Lecture 9, Win07, Batzoglou
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
• How do gap states generalize?
• VERY badly! Require 2N – 1 states, one per combination of
gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)
XY XYZ Z
Y YZ
X XZ
![Page 35: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/35.jpg)
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
pxy
pzw
pxyzw
![Page 36: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/36.jpg)
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
Example
Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)
s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
Result: pxy = (0.7, 0.1, 0, 0, 0.2)
s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
Result: px- = (0.4, 0.1, 0, 0, 0.5)
![Page 37: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/37.jpg)
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is unknown:
Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary
distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree
x
w
y
z?
![Page 38: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/38.jpg)
CS262 Lecture 9, Win07, Batzoglou
Heuristics to improve alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
![Page 39: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/39.jpg)
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
![Page 40: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/40.jpg)
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Algorithm (Barton-Stenberg):
1. For j = 1 to N,Remove xj, and realign to x1…
xj-1xj+1…xN
2. Repeat 4 until convergence
x
y
z
x,z fixed projection
allow y to vary
![Page 41: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/41.jpg)
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
![Page 42: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/42.jpg)
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
![Page 43: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/43.jpg)
CS262 Lecture 9, Win07, Batzoglou
Consistency
z
x
y
xi
yj yj’
zk
![Page 44: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/44.jpg)
CS262 Lecture 9, Win07, Batzoglou
Consistency
Basic method for applying consistency
• Compute all pairs of alignments xy, xz, yz, …
• When aligning x, y during progressive alignment,
For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function
z
x
y
xi
yj yj’
zk
![Page 45: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/45.jpg)
CS262 Lecture 9, Win07, Batzoglou
Real-world protein aligners
• MUSCLE High throughput One of the best in accuracy
• ProbCons High accuracy Reasonable speed
![Page 46: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/46.jpg)
CS262 Lecture 9, Win07, Batzoglou
MUSCLE at a glance
1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
2. Build tree TDRAFT based on those distances, with UPGMA
3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT
• Only perform alignment steps for the parts of the tree that have changed
4. Measure new Kimura-based distances D(x, y) based on MDRAFT
5. Build tree T based on D
6. Progressive alignment over T, to build M
7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept
![Page 47: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/47.jpg)
CS262 Lecture 9, Win07, Batzoglou
PROBCONS at a glance
1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM
2. Re-estimation of posterior matrices M’xy with probabilistic consistency
• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)
3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)
• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)
4. Build tree T with hierarchical clustering using similarity measure E(x, y)
5. Progressive alignment on T to maximize E(.,.)
6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each
sequence and realign the two resulting profiles
![Page 48: CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm](https://reader033.vdocuments.site/reader033/viewer/2022052509/56649d3f5503460f94a18d9b/html5/thumbnails/48.jpg)
CS262 Lecture 9, Win07, Batzoglou
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway
Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2
ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable
http://probcons.stanford.edu/ PROBCONS – most accurate