haplotype resolved structural variation assembly with long reads
TRANSCRIPT
Haplotyperesolvedstructuralvaria1onassemblywithlongreads
MountSinai:AliBashir,OscarRodriguez,MaAhewPendleton
ReedCollege:AnnaRitz,AlexLedger
Overview
• Background– AutomatedHybridAssembly
• PhasedDiploidAssembly– Exisi1ngLimita1onofassembly– 10X+PacBio
• IssueswithcallingSVsandcomparingdatasets
Pacbio:44X,~4.9kbavg.length BioNano:80X,278kbmeanspan
TechnologiesofGreaterScaleNeededtoStudyHumanComplexity
Sequence Contigs
Contig Maps
Insilicadiges1on
Genome Maps
Aligned Contig-Scaffold Pairs
NGS
DeNovoAssemble
AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps
ScaffoldGraphConstruc6onandLayout
A.Pang M.Pendleton
BioNanoRaw Molecules
Hybrid Scaffolds
Schema1cforHybridScaffolding
Sequence Contigs
Contig Maps
Insilicadiges1on
Genome Maps
Aligned Contig-Scaffold Pairs
NGS
DeNovoAssemble
AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps
ScaffoldGraphConstruc6onandLayout
AP.ang M.Pendleton
BioNanoRaw Molecules
Hybrid Scaffolds
Scaffoldingissymmetric
Schema1cforHybridScaffolding
Sequence Contigs
Contig Maps
Insilicadiges1on
Genome Maps
Aligned Contig-Scaffold Pairs
NGS
DeNovoAssemble
AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps
ScaffoldGraphConstruc6onandLayout
AP.ang M.Pendleton
BioNanoRaw Molecules
Hybrid Scaffolds
InconsistenciescanbeflaggedtobreakoreliminateConHgsorGenomeMaps
Schema1cforHybridScaffolding
AsanHcipated,heterochromaHcregionsaremostdifficulttospan-Completesasmanyas28gapsinhg38-OthermethodsanddatatypescanbeusedtofurtherresolveaddiHonalgapsintheassembly
HybridAssemblyBetweenBioNano&PacBioSupercon1gs
HybridAssemblyBetweenBioNano&PacBioSupercon1gs
AsanHcipated,heterochromaHcregionsaremostdifficulttospan-Completesasmanyas28gapsinhg38-OthermethodsanddatatypescanbeusedtofurtherresolveaddiHonalgapsintheassembly
OrthogonalErrorProfilesenabledrama1cimprovements
hg19chr1
Superscaffold
hg38chr1
Hg19gap Hg19gap
Superscaffold
Moleculepileup
ComplexStructuralRearrangementscanbeValidatedRela1vetoCurrentReferences
S1llresolvesatleast28gapsinhg38assemblyfor>400kbinpredictedgapintervals
Manylargestructuralvariantspredictedliketheabove.Aretheyreal?
hg19chr1
Superscaffold
hg38chr1
Hg19gap Hg19gap
Superscaffold
Moleculepileup
ComplexStructuralRearrangementscanbeValidatedRela1vetoCurrentReferences
S1llresolvesatleast28gapsinhg38assemblyfor>400kbinpredictedgapintervals
• GIAB– Long-readsequencingofTrios
• 20-30Xparents• 40-70Xchildren
– 10XChromiumData– AshkenaziJewish
• IkG– Long-readsequencing
• ~20-25Xparents• ~45-50Xparents
– 10XChromiumData• Chinese,PuertoRican,andYorbuan
ancestry
TechnologyCon1nuestoMarchForwardGIABAJand1000GenomesTrios
SubRLN50=11,087bpTotal#Bases=220Gb#ofReads=27.4Mreads
CurrentGenomeAssemblyResultsonGIABand1kGTriosSample Contigs Average N50 Max Total Size HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb
HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb
JoyceLee(BioNanoGenomics)
Sample Enzyme Contigs N50 Total Size HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb
HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb
HybridAssemblywithmul1pleEnzymesdrama1callyimprovescon1guityandcoverageofthegenome
CurrentGenomeAssemblyResultsonGIABand1kGTriosSample Contigs Average N50 Max Total Size HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb
HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb
JoyceLee(BioNanoGenomics)
HybridAssemblywithmul1pleEnzymesdrama1callyimprovescon1guityandcoverageofthegenome
Sample Enzyme Contigs N50 Total Size HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb
HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb
PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline
Pendletonetal.,NatureMethods2010
M.Pendleton
A.Pang
J.Chin
PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline
Pendletonetal.,NatureMethods2010
M.Pendleton
A.Pang
J.Chin
PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline
Pendletonetal.,NatureMethods2010
M.Pendleton
A.Pang
J.Chin
SummaryOf1kG/GIABPacBioSVCalls
*RanthroughastreamlinedversionoftheSVpipelinemaynotbecomparableOtherNotes:- MEIcallsuseconserva1veparameters(likelyundercallinginser1ons)- “Other”callscontainsomeimproperlyflaggedinser1ons/dele1onsaswellascomplex
eventsandinversions
InserHon DeleHon Complex
Sample#ofCalls
#ofTRcalls
#ofAlu
#ofL1
#ofSVA #ofCalls
#ofTRcalls*
#ofAlu #ofL1 #ofSVA
HG002 13471 5573 325 68 7 9639 6880 798 201 22 2493HG003* 12947 5133 411 74 5 9692 6776 411 74 5 2580HG004 12769 5066 475 160 96 9509 7233 971 282 33 2599HG00512 9830 4164 366 75 67 7672 5781 768 275 23 2157HG00513 9761 4175 351 86 79 7791 5936 770 258 27 2314HG00514 1285 4866 212 42 3 9636 6770 767 222 26 2635HG00731 9874 4322 357 76 75 7678 5797 790 256 17 2174HG00732 11059 4884 400 85 85 8227 6274 813 271 24 2351HG00733* 11769 5365 330 45 4 8848 6179 743 191 25 2313NA19238 7512 2999 280 72 59 6320 4765 628 237 12 1910NA19239 5909 2357 199 46 50 5061 3809 528 161 21 1468NA19240* 13285 5185 345 78 7 9791 7596 911 275 23 2600
VennDiagramsBetweenallTrios:YRI/PUR/CHSDele1ons
Notseeingallprobandcalls(blue)intheparents-Under-callingofhets?
AssemblySequencesOvenMixHaplotypeInforma1on
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng
Bubblesrepresentdivergentalleles
Blue=MaternalYellow=Paternal
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng
Bubblesrepresentdivergentalleles
Blue=MaternalYellow=Paternal
Aconserva1veassemblywillNOTtrytolinkacrosstheblackbubbleswithoutsomesortofscaffoldinginforma1n
AssemblySequencesOvenMixHaplotypeInforma1on
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng
Bubblesrepresentdivergentalleles
Blue=MaternalYellow=Paternal
Assembliesoventakeasinglepathinthisgraph,thiscouldmixmaternalandpaternalalleles
AssemblySequencesOvenMixHaplotypeInforma1on
RevisedCallSets
*PacBiocallstakefrom“Sniffles”–developedatSchatzLab(JohnHopkins)**10XcallstakeprovidedusingLongRangerfrom10XGenomics
HybridHaplotypeSeparatedCallsetsAddManySVs
*PacBiocallstakefrom“Sniffles”–developedatSchatzLab(JohnHopkins)**10XcallstakeprovidedusingLongRangerfrom10XGenomics
ExampleWorkarounds
-Breakpointcalls(notassembledsequence): IfL1iswithin10%ofL2sizeandL1andL2arebothinRwecallita“match”-AssemblyCalls PerformMSAtoiden1fytruly“homozygous”sequences
L1
L2
TandemRepeat
AssemblyProvidesDetailedIndica1onsOfQuality
• Providessequenceofbreakpoint• Poten1allyprovidesco-locatedevents• Poten1allyprovidesinforma1ononaccuracyoftheassemblyinthatregion
SlidefromJasonChinattheSMRTInforma6csWorkshop
Ctg 33
Ctg
33
map
ped
to C
hr1
Ctg 120
Mis-assembly point
Assemblyhaveaddi1onalinforma1on
Ongoing/FutureWork• GIABisworkingonhowtointegrateassemblycallsrobustly
– Surprisinglypooroverlap!• Typingcalls• Integra1ngparentalinforma1on• Providingfulllengthhaplotypeassembliesforgenomes
– Canbedonewithtriophasing– But,it’snowpossiblewithnewtechnologies!
• Hi-C• StrandSeq
• Integra1ngGraphsfrom10xandPacBio• PullinginverylargeSVswithBioNano• Movingintofullindelresolu1onandtoolsforcomparingdatasets
rapidlytoalthaps
Acknowledgements• Mount Sinai
– Eric Schadt – Matt Pendleton – Ajay Ummat – Oscar Franzen – Gintaras Deikus – Robert Sebra – Oscar Rodriguez*
• Reed College
– Anna Ritz – Alex Ledger
• UCSF – Pui Kwok
• PacBio – Jason Chin
• 1000 Genomes SV Working Group
• UW – Mark Chaisson
• EMBL – Jan Korbel – Markus H.-Y.
Fritz – Tobias Rausch
• BioNano Genomics – Han Cao – Alex Hastie – Heng Dai – Andy Pang – Joyce Lee
• 10X Genomics – Patrick Marks – Deanna Church – Mike Schnall-Levin – Sofia
Kyriazopoulou-Panagiotopoulou