cmsc423: bioinformatic algorithms, databases and tools ... · • what is the running time of the...
TRANSCRIPT
![Page 1: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/1.jpg)
CMSC423: Bioinformatic Algorithms, Databases and Tools
Lecture 15
Genome assembly
![Page 2: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/2.jpg)
CMSC423 Fall 2008 2
Admin• Project questions?
![Page 3: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/3.jpg)
CMSC423 Fall 2008 3
Questions/answers• Why do you need a multiple alignment for phylogeny?• What is the running time of the neighbor-joining
algorithm, given k sequences of length L?• What is the parsimony score of the following tree, and
what are the labels at internal nodes?
C T AG T T
ACTG2112
ACTG2211
ACTG2322
ACTG4434
ACTG5535
![Page 4: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/4.jpg)
CMSC423 Fall 2008 4
Reading assignment• http://www.cbcb.umd.edu/research/assembly_primer.shtml• Chapter 4.5 – coverage statistics• Chapter 8 – genome assembly
![Page 5: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/5.jpg)
CMSC423 Fall 2008 5
Shotgun sequencing
shearing
sequencing
assemblyoriginal DNA (hopefully)
![Page 6: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/6.jpg)
CMSC423 Fall 2008 6
Overview of terms
Assembly
Scaffolding
![Page 7: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/7.jpg)
CMSC423 Fall 2008 7
Shortest common superstring problem
Given a set of strings, Σ=(s1, ..., sn), determine the shortest string Ssuch that every si is a sub-string of S. NP-hardapproximations: 4, 3, 2.89, ...
Greedy algorithm (4-approximation)
phrap, TIGR Assembler, CAP
...ACAGGACTGCACAGATTGATAG ACTGCACAGATTGATAGCTGA...
![Page 8: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/8.jpg)
CMSC423 Fall 2008 8
Greedy algorithm detailsCompute all pairwise overlaps*Pick best (e.g. in terms of alignment score) overlapJoin corresponding readsRepeat from * until no more joins possible
• How do you compute an overlap alignment?• Hint: modify Smith-Waterman dynamic programming
algorithm
![Page 9: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/9.jpg)
CMSC423 Fall 2008 9
Repeats (where greedy fails)
AAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAA AAAAAA
AAAAAA AAAAAAAAAAAA AAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
![Page 10: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/10.jpg)
CMSC423 Fall 2008 10
Impact of randomness – non-uniform coverage
1 2 3 4 5 6 C
over
age
Contig
Reads
Imagine raindrops on a sidewalk
![Page 11: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/11.jpg)
CMSC423 Fall 2008 11
Lander-Waterman statistics
L = read lengthT = minimum overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L
E(#islands) = Ne-cσ E(island size) = L(ecσ – 1) / c + 1 – σcontig = island with 2 or more reads
See chapter 4.5
![Page 12: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/12.jpg)
CMSC423 Fall 2008 12
All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs are
possible– Build a table of k-mers contained in sequences (single pass
through the genome)– Generate the pairs from k-mer table (single pass through k-
mer table)
k-mer
A
B
C
D H I
F
G E
![Page 13: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/13.jpg)
CMSC423 Fall 2008 13
Additional pairwise-alignment details• 4 types of overlaps• Often – assume first read is “forward”
• Representing the alignment
• Why not store length of overlap?
Normal
Innie
Outie
Anti-normal
A-hang B-hang
![Page 14: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/14.jpg)
CMSC423 Fall 2008 14
Overlap-layout-consensus
Main entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
![Page 15: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/15.jpg)
CMSC423 Fall 2008 15
Paths through graphs and assembly• Hamiltonian circuit: visit each node (city) exactly once,
returning to the start
A
B D C
E
H G
I
F
A
B
C
D H I
F
G E
Genome
![Page 16: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/16.jpg)
CMSC423 Fall 2008 16
Sequencing by hybridization
AAAAAAACAAAGAAATAACAAACGAACTAAGA...
probes - all possible k-mers
AACAGTAGCTAGATGAACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGA
![Page 17: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/17.jpg)
CMSC423 Fall 2008 17
Assembling SBH data
Main entity: oligomer (overlap)Relationship between oligomers: adjacency
ACCTGATGCCAATTGCACT...
CTGAT follows CCTGA (they share 4 nucleotides: CTGA)
Problem: given all the k-mers, find the original string
In assembly: fake the SBH experiment - break the reads into k-mers
![Page 18: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/18.jpg)
CMSC423 Fall 2008 18
Eulerian circuit
• Eulerian circuit: visit each edge (bridge) exactly once and come back to the start
ACCTAGATTGAGGTCG
CCTAGATTGAGGTCGACCTAGATTGAGGTC
![Page 19: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/19.jpg)
CMSC423 Fall 2008 19
deBruijn graph• Nodes – set of k-mers obtained from the reads• Edges – link k-mers that overlap by k-1 lettersACCAGTGCA
CCAGTGCAT
• This formulation particularly useful for very short reads• Solution – Eulerian path through the graph• Note – multiple Eulerian paths possible (exponential
number) due to repeats
![Page 20: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/20.jpg)
CMSC423 Fall 2008 20
deBruijn graph of Mycoplasma genitalium
![Page 21: CMSC423: Bioinformatic Algorithms, Databases and Tools ... · • What is the running time of the neighbor-joining algorithm, given k sequences of length L? • What is the parsimony](https://reader033.vdocuments.site/reader033/viewer/2022050201/5f5539e7e791593d7a2129e1/html5/thumbnails/21.jpg)
CMSC423 Fall 2008 21
Read-length vs. genome complexity