short read genome assembly - brown...
TRANSCRIPT
shortreadgenomeassembly
Sorin IstrailCSCI1820Short-readgenomeassembly
algorithms3/6/2014
1
GenomathicaAssembler•Mathematica notebookforgenomeassemblysimulation•Assemblercanbefoundat:http://cs.brown.edu/courses/csci1820/software/minimal_assembler.nb•SampleFASTAgenomephix174.fastacanbefoundinHW5Biology:http://cs.brown.edu/courses/csci1820/software/phix174.fasta•Rememberto– ChangetheinputgenometoyourFASTAfile’slocation– Evaluateeachcellinitially,thenyouonlyneedtoevaluatethelasttwocells
tore-runtheassembly,anddisplaytheresultsrespectively– Mathematica canbedownloadedhere:
http://www.brown.edu/information-technology/software/
2
coverage=1
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
coverage=2
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
coverage=3
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
coverage=4
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
coverage=5
• Sequence reads are in black
• Contiguous strings of assembled DNA (contigs) are in red
coverage=2,pairedends
RawSequenceReads
Sampleprep
Sequencedata
•wet-labexperimentalmethodstoisolate,prepare,andsequencetheDNA•resultsinanumberoflargeFASTQfiles•FASTQCcanbeusedtocheckbasicstatisticsofthefiles–http://www.bioinformatics.babraham.ac.uk/projects/fastqc/•manytoolsavailableforQC–e.g.http://hannonlab.cshl.edu/fastx_toolkit/
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.
http://www.ncbi.nlm.nih.gov/Traces/sra/
GenomeAssemblySoftware
•Overlap-layout-consensus•Celera:http://wgs-assembler.sourceforge.net/•K-mer based•Velvet:http://www.ebi.ac.uk/~zerbino/velvet/•SOAP-denovo:http://soap.genomics.org.cn/soapdenovo.html•ALLPATHS-LG:http://www.broadinstitute.org/software/allpaths-lg/blog/•IDBA-UD:http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
Twographmodels
• Afirstgraphmodel– Nodes(vertices)arecontiguoussequencesofkcharacters(k-mer)
– Directededgefromvi tovj ifvi[2..k]=vj[1..k-1]
A C G T T C
ACG CGT GTT TTC
Twographmodels
• De-bruijngraph– Nodes(vertices)arecontiguoussequencesofk-1characters(k-1-mer)
– Directededgefromvi tovj ifvi[1..k-1]+vj[k-1]areavalidk-mer
A C G T T C
AC CG GT TT TCACG CGT GTT TTC
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Noteedgesthatarenot
reflectedintheinput!
GenomeAssembly
• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap
17
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
1 1
1 1 1
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
GTAC TACG
TAC
1 1
1 1 2
1 1
1
1
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
GACG ACGT
k=4
k=3
GAC ACG CGT
CGTA
GTA
GTAC TACG
TAC
CGTT
GTT
1 1
1 2 2
1 1
1
2
1
1
GenomeAssembly
• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap– nodesas(k-1)-mers,edgesformk-mers
21
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GAC ACG CGT GTA1 1
1 1 1
1
1
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GAC ACG CGT GTA TAC1 1
1 3 2
2 1
2
1
1
GenomeassemblyGenomeGACGTACGTT
ReadsGACGTACGTACGTACGTT
k=4
k=3
GA AC CG GT TA
GT
GAC ACG CGT GTA TAC
GTT
TT
1 2
1 4 2
2 1
2
2
1
1
2
1
GenomeAssembly
• Buildingthek-mer graph– G(k):nodesask-mers,edges(k-1)overlap– H(k):nodesas(k-1)-mers,edgesformk-mers
• H(k)=G(k-1)– Soitreallydoesnotmatterwhichyouchoosetoimplement
• Wheredoesthecomplexitycomefrom?– Sequencingerrors,repeats,unevencoverage,contaminationfromotherorganisms,ploidy,unsequenced regions
25
Poppingbubbles
Erroroccursinthemiddleofareadandispropagatedtomanyk-mers.
Trimmingtips
Errorcreatesanerroneousendingk-mer
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Chimericextensions
Errorsconnecttwonodesinthegraphwhichdonotcorrespondtoavalidextensioninthegenomesequence
Repetitiveregions
• Satellites,SINEs,LINEs• HomologousGenes– Ortholog:descendedfromthesameancestralsequenceandseparatedbyspeciation
– Paralog:genescreatedbyaduplicationevent
29
30Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Velvetassembler
• Fourstages– Hashingreadsintok-mers– ConstructingthedeBruijn graph(notall4^kk-mers,onlythosethatexistininput)
– Correcterrors– Resolverepeats
• Butwhatafter?– Papergivesverylittleinformationonthis...
32
TheChinesepostmanproblem(CPP)
• Computeaclosedtourofminimumlengththatvisitseachedgeatleastonce– Similartowhatwewantexceptwemaywanttovisitedgesmorethanonceduetorepeats• Howdowedealwithrepeats?
– Also,thestartingandendingverticesaredistinctingenomeassembly• Howcanweconverttheclosedtourtoanopenone?
33
Yourhomework
• Youarenot requiredtoimplementsection4ofhttp://web.eecs.umich.edu/~pettie/matching/Edmonds-Johnson-chinese-postman.pdf
• YouarenotevenrequiredtomodelgenomeassemblyasCPP
• Butyoudohavetobuildthek-mer graph,correcterrors,resolverepeats,andcomputeaCPPorEulerian-liketour.
34
Evaluatingassembly
• TheAssemblathon2studylists102measuresforevaluatingassemblyquality.• Bradnam etal.(2013) Assemblathon 2:evaluatingdenovomethodsfo genomeassemblyinthreevertebratespecies
1. NG50scaffoldlength:alengthx whereallscaffolds oflengthx orlongerconsistsofatleast50%ofthegenomesize
2. NG50contig length:alengthx whereallcontigs oflengthx orlongerconsistsofatleast50%ofthegenomesize
3. Amountofgene-sizedscaffolds(>25kbp).Usefulforgenefinding.
4. CEGMA:Numberof458coregenesmapped
5. Fosmid coverage:Howmanyvalidatedfosmid regionswerecapturedinassembly
6. Fosmid validity:Percentageofassemblyvalidatedbyvalidatedfosmid regions
7. Validatedfosmid regiontagscaffoldsummaryscore:numberofvalidatedfosmid regiontagpairsthatmatchthesamescaffoldmultipliedbythepercentageofuniquelymappingtagpairsthatmapwithcorrectdistance.Rewardsshort-rangeaccuracy.
8. and9.Usinglocalandglobalalignmentsofoptimalmapdata,howwelltheassemblyisordered.
10.REAPRsummaryscore:atoolthatevalutes accuracyofassemblyusingpairedreads
Evaluatingassembly