short read genome assembly - brown...

36
short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1

Upload: others

Post on 15-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

shortreadgenomeassembly

Sorin IstrailCSCI1820Short-readgenomeassembly

algorithms3/6/2014

1

Page 2: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomathicaAssembler•Mathematica notebookforgenomeassemblysimulation•Assemblercanbefoundat:http://cs.brown.edu/courses/csci1820/software/minimal_assembler.nb•SampleFASTAgenomephix174.fastacanbefoundinHW5Biology:http://cs.brown.edu/courses/csci1820/software/phix174.fasta•Rememberto– ChangetheinputgenometoyourFASTAfile’slocation– Evaluateeachcellinitially,thenyouonlyneedtoevaluatethelasttwocells

tore-runtheassembly,anddisplaytheresultsrespectively– Mathematica canbedownloadedhere:

http://www.brown.edu/information-technology/software/

2

Page 3: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=1

• Sequence reads are in black

• Contiguous strings of assembled DNA (contigs) are in red

Page 4: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=2

• Sequence reads are in black

• Contiguous strings of assembled DNA (contigs) are in red

Page 5: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=3

• Sequence reads are in black

• Contiguous strings of assembled DNA (contigs) are in red

Page 6: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=4

• Sequence reads are in black

• Contiguous strings of assembled DNA (contigs) are in red

Page 7: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=5

• Sequence reads are in black

• Contiguous strings of assembled DNA (contigs) are in red

Page 8: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

coverage=2,pairedends

Page 9: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

RawSequenceReads

Sampleprep

Sequencedata

•wet-labexperimentalmethodstoisolate,prepare,andsequencetheDNA•resultsinanumberoflargeFASTQfiles•FASTQCcanbeusedtocheckbasicstatisticsofthefiles–http://www.bioinformatics.babraham.ac.uk/projects/fastqc/•manytoolsavailableforQC–e.g.http://hannonlab.cshl.edu/fastx_toolkit/

Page 10: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

Page 11: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

http://www.ncbi.nlm.nih.gov/Traces/sra/

Page 12: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating
Page 13: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeAssemblySoftware

•Overlap-layout-consensus•Celera:http://wgs-assembler.sourceforge.net/•K-mer based•Velvet:http://www.ebi.ac.uk/~zerbino/velvet/•SOAP-denovo:http://soap.genomics.org.cn/soapdenovo.html•ALLPATHS-LG:http://www.broadinstitute.org/software/allpaths-lg/blog/•IDBA-UD:http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

Page 14: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Twographmodels

• Afirstgraphmodel– Nodes(vertices)arecontiguoussequencesofkcharacters(k-mer)

– Directededgefromvi tovj ifvi[2..k]=vj[1..k-1]

A C G T T C

ACG CGT GTT TTC

Page 15: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Twographmodels

• De-bruijngraph– Nodes(vertices)arecontiguoussequencesofk-1characters(k-1-mer)

– Directededgefromvi tovj ifvi[1..k-1]+vj[k-1]areavalidk-mer

A C G T T C

AC CG GT TT TCACG CGT GTT TTC

Page 16: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Noteedgesthatarenot

reflectedintheinput!

Page 17: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeAssembly

• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap

17

Page 18: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

GACG ACGT

k=4

k=3

GAC ACG CGT

CGTA

GTA

1 1

1 1 1

Page 19: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

GACG ACGT

k=4

k=3

GAC ACG CGT

CGTA

GTA

GTAC TACG

TAC

1 1

1 1 2

1 1

1

1

Page 20: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

GACG ACGT

k=4

k=3

GAC ACG CGT

CGTA

GTA

GTAC TACG

TAC

CGTT

GTT

1 1

1 2 2

1 1

1

2

1

1

Page 21: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeAssembly

• Buildingthek-mer graph– nodesask-mers,edges(k-1)overlap– nodesas(k-1)-mers,edgesformk-mers

21

Page 22: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

k=4

k=3

GA AC CG GT TA

GAC ACG CGT GTA1 1

1 1 1

1

1

Page 23: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

k=4

k=3

GA AC CG GT TA

GAC ACG CGT GTA TAC1 1

1 3 2

2 1

2

1

1

Page 24: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeassemblyGenomeGACGTACGTT

ReadsGACGTACGTACGTACGTT

k=4

k=3

GA AC CG GT TA

GT

GAC ACG CGT GTA TAC

GTT

TT

1 2

1 4 2

2 1

2

2

1

1

2

1

Page 25: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

GenomeAssembly

• Buildingthek-mer graph– G(k):nodesask-mers,edges(k-1)overlap– H(k):nodesas(k-1)-mers,edgesformk-mers

• H(k)=G(k-1)– Soitreallydoesnotmatterwhichyouchoosetoimplement

• Wheredoesthecomplexitycomefrom?– Sequencingerrors,repeats,unevencoverage,contaminationfromotherorganisms,ploidy,unsequenced regions

25

Page 26: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Poppingbubbles

Erroroccursinthemiddleofareadandispropagatedtomanyk-mers.

Page 27: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Trimmingtips

Errorcreatesanerroneousendingk-mer

Page 28: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Chimericextensions

Errorsconnecttwonodesinthegraphwhichdonotcorrespondtoavalidextensioninthegenomesequence

Page 29: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Repetitiveregions

• Satellites,SINEs,LINEs• HomologousGenes– Ortholog:descendedfromthesameancestralsequenceandseparatedbyspeciation

– Paralog:genescreatedbyaduplicationevent

29

Page 30: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

30Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Page 31: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Page 32: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Velvetassembler

• Fourstages– Hashingreadsintok-mers– ConstructingthedeBruijn graph(notall4^kk-mers,onlythosethatexistininput)

– Correcterrors– Resolverepeats

• Butwhatafter?– Papergivesverylittleinformationonthis...

32

Page 33: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

TheChinesepostmanproblem(CPP)

• Computeaclosedtourofminimumlengththatvisitseachedgeatleastonce– Similartowhatwewantexceptwemaywanttovisitedgesmorethanonceduetorepeats• Howdowedealwithrepeats?

– Also,thestartingandendingverticesaredistinctingenomeassembly• Howcanweconverttheclosedtourtoanopenone?

33

Page 34: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Yourhomework

• Youarenot requiredtoimplementsection4ofhttp://web.eecs.umich.edu/~pettie/matching/Edmonds-Johnson-chinese-postman.pdf

• YouarenotevenrequiredtomodelgenomeassemblyasCPP

• Butyoudohavetobuildthek-mer graph,correcterrors,resolverepeats,andcomputeaCPPorEulerian-liketour.

34

Page 35: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

Evaluatingassembly

• TheAssemblathon2studylists102measuresforevaluatingassemblyquality.• Bradnam etal.(2013) Assemblathon 2:evaluatingdenovomethodsfo genomeassemblyinthreevertebratespecies

1. NG50scaffoldlength:alengthx whereallscaffolds oflengthx orlongerconsistsofatleast50%ofthegenomesize

2. NG50contig length:alengthx whereallcontigs oflengthx orlongerconsistsofatleast50%ofthegenomesize

3. Amountofgene-sizedscaffolds(>25kbp).Usefulforgenefinding.

4. CEGMA:Numberof458coregenesmapped

Page 36: short read genome assembly - Brown Universitycs.brown.edu/Courses/Csci1810/Lectures/Slides/Genome2.pdfEvaluating assembly •The Assemblathon2 study lists 102 measures for evaluating

5. Fosmid coverage:Howmanyvalidatedfosmid regionswerecapturedinassembly

6. Fosmid validity:Percentageofassemblyvalidatedbyvalidatedfosmid regions

7. Validatedfosmid regiontagscaffoldsummaryscore:numberofvalidatedfosmid regiontagpairsthatmatchthesamescaffoldmultipliedbythepercentageofuniquelymappingtagpairsthatmapwithcorrectdistance.Rewardsshort-rangeaccuracy.

8. and9.Usinglocalandglobalalignmentsofoptimalmapdata,howwelltheassemblyisordered.

10.REAPRsummaryscore:atoolthatevalutes accuracyofassemblyusingpairedreads

Evaluatingassembly