a new approach to fragment assembly in dna sequenceing
DESCRIPTION
A new Approach to Fragment Assembly in DNA Sequenceing. Fei wu April ,24,2006. Preface. Introduce the author The background of the paper The history of DNA Sequencing. Traditional DNA Sequencing. DNA. Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) - PowerPoint PPT PresentationTRANSCRIPT
A new Approach to Fragment Assembly in
DNA SequenceingFei wu
April ,24,2006
2
Preface
Introduce the authorIntroduce the author The background of the paperThe background of the paper The history of DNA SequencingThe history of DNA Sequencing
3
Traditional DNA Traditional DNA SequencingSequencing
• Read 500 – 700 nucleotides at a time Read 500 – 700 nucleotides at a time
from the small fragments (Sanger from the small fragments (Sanger
method)method)• Shear DNA into millions of small Shear DNA into millions of small
fragmentsfragments
Shake
DNA
4
Fragment AssemblyFragment Assembly
• Computational ChallengeComputational Challenge: : assemble individual short fragments assemble individual short fragments (reads) into a single genomic (reads) into a single genomic sequence (“super string”) sequence (“super string”)
• Until late 1990s the shotgun Until late 1990s the shotgun fragment assembly of human fragment assembly of human genome was viewed as intractable genome was viewed as intractable problem problem
5
Shortest Superstring Shortest Superstring ProblemProblem
Problem:Problem: Given a set of strings, find a Given a set of strings, find a shortest string that contains all of themshortest string that contains all of them
InputInput: Strings : Strings ss11, s, s22,…., s,…., snn OutputOutput: A string : A string ss that contains all that contains all
strings strings ss11, s, s22,…., s,…., snn as substrings, such that the as substrings, such that the
length of length of ss is minimized is minimized
Complexity:Complexity: NP – complete NP – complete Note:Note: this formulation does not take into this formulation does not take into
account sequencing errorsaccount sequencing errors
6
Reducing SSP to eulerian Reducing SSP to eulerian path problempath problem
Define Define overlap ( soverlap ( sii, s, sj j )) as the length of the longest prefix as the length of the longest prefix of of ssjj that matches a suffix of that matches a suffix of ssii..
aaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaaaaaggcatcaaa aaaaaaggcatcaaatctaaaggcatcaaaggcatcaaatctaaaggcatcaaa aaaggcatcaaaaaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaa
Construct a graph with Construct a graph with nn vertices representing the vertices representing the nn strings strings ss11, s, s22,…., s,…., snn. .
Insert edges of length Insert edges of length overlap ( soverlap ( sii, s, sjj ) ) between vertices between vertices ssii and and ssjj. .
Find the shortest path which visits every vertex exactly Find the shortest path which visits every vertex exactly once. This is the once. This is the Traveling Salesman ProblemTraveling Salesman Problem (TSP), (TSP), which is also NP – complete.which is also NP – complete.
7
Bruijun graphBruijun graph PropertiesPropertiesIf If nn = 1 then the condition for any two vertices = 1 then the condition for any two vertices
forming an edge holds vacuously, and hence all the forming an edge holds vacuously, and hence all the vertices are connected forming a total of vertices are connected forming a total of mm22 edges. edges.
Each vertex has exactly Each vertex has exactly mm incoming and incoming and mm outgoing outgoing edgesedges
8
Sequencing by HybridizationSequencing by Hybridization
9
ll-mer (tulip) composition-mer (tulip) composition Spectrum ( s, l )Spectrum ( s, l ) - - unorderedunordered multiset of multiset of
all possible all possible (n – l(n – l + + 1) 1) ll-mers in a string -mers in a string ss of length of length nn
The order of individual elements in The order of individual elements in Spectrum ( s, l )Spectrum ( s, l ) does not matter does not matter
For For ss = TATGGTGC all of the following are = TATGGTGC all of the following are equivalent representations of equivalent representations of Spectrum ( s, Spectrum ( s, 3 ): 3 ):
{TAT, ATG, TGG, GGT, GTG, TGC}{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}{TGG, TGC, TAT, GTG, GGT, ATG}
10
SBH: Eulerian Path SBH: Eulerian Path ApproachApproach
SS = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( Vertices correspond to ( l l – 1 ) – mers : { AT, – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }TG, GC, GG, GT, CA, CG }
Edges correspond to Edges correspond to ll – mers from – mers from SS
AT
GT CG
CAGCTG
GG Path visited every EDGE once
11
S S = { AT, TG, GC, GG, GT, CA, CG } = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:corresponds to two different paths:
ATGGCGTGCA ATGCGTGGCA
AT TG GC
GG
GT CGGT CG
CAGCTG
GG
12
Error Correction Or Data Corruption
Euler algorithm sometimes introduces Euler algorithm sometimes introduces errors.errors.
Introduces errors for reducing the Introduces errors for reducing the complexity of the Bruijn graph.complexity of the Bruijn graph.
Reeducation of Bruijn graph eliminate Reeducation of Bruijn graph eliminate false edge.false edge.
For example: N.meningitieds sequencing For example: N.meningitieds sequencing project,orphan elimination corrects project,orphan elimination corrects 234410 errors, and introces 1452 errors.234410 errors, and introces 1452 errors.
13
Observations of the Observations of the EULEREULER
14
Conclusions
Finishing is a bottleneck in large-Finishing is a bottleneck in large-scale DNAscale DNA
EULER has excellent scaling EULER has excellent scaling potential .potential .
The complexity of EULER is mainly The complexity of EULER is mainly defined by the number of tangles defined by the number of tangles rather than the number of rather than the number of repeats/length of the gonomes.repeats/length of the gonomes.
RESULTS AND DISCUSSION
The general performance of SEA on the benchmark
Prediction ambiguity improves alignment quality
Alignment quality versus local structure prediction ambiguity
CONCLUSION
Any Questions?Any Questions?
18
19