structural variation discovery by de novo genome mapping of … · 2020. 8. 7. · de novo human...

1
De Novo Human Genome Map Assembly and Depth Analysis Structural Variation Discovery by De Novo Genome Mapping of the Human Genome at the Single Molecule Level Using NanoChannel Linearization A Hastie, E Lam, T Chan, M Requa, T Anantharaman M Austin, F Trintchouk, M Saghbini, YYY Lai 2 , ACY Mak 2 , P-Y Kwok 2 , M Xiao 3 , H Cao BioNano Genomics, San Diego, California, USA, 2 UCSF, San Francisco, CA, USA; 3 Drexel University, Philadelphia, PA, USA Abstract As a result of the remaining limitations of DNA sequencing and analysis technologies—even ten years after the completion of the human genome project— there remain about 400 gaps in the human reference sequence assembly, hundreds of millions of unassembled bases in those regions, and no effective tools to comprehensively characterize the structural variation in an individual’s genome. Despite the ungapped reference sequence being of extremely high quality, it is not feasible to create similarly high quality assemblies of individuals to detect and interpret the many types of structural variation that are refractory to high throughput or short-read technologies. We present a single molecule genome analysis system (Irys) based on NanoChannel Array technology that linearizes extremely long DNA molecules for direct observation. This high-throughput platform automates the imaging of single molecules of genomic DNA hundreds of kilobases in size to measure sufficient sequence uniqueness for unambiguous assembly of complex genomes. High- resolution genome maps assembled de novo preserve long-range structural information necessary for structural variation detection and assembly applications. We have used Irys genome mapping for the assembly and characterization of two human genomes. From these assemblies, we have spanned many of the remaining gaps, identified known and novel structural variants and phase some haplotype blocks, including in the MHC region. We also resolve and measure long tandem repeat regions that are likely impossible to assemble by other methods. Background Generating high-quality finished genomes replete with accurate identification of structural variation and high completion (minimal gaps) remains challenging using short read sequencing technologies alone. The Irys platform provides direct visualization of long DNA molecules in their native state, bypassing the statistical inference needed to align paired-end reads with an uncertain insert size distribution. These long labeled molecules are de novo assembled into physical maps spanning the whole genome. The resulting order and orientation of sequence elements in the map can be used for anchoring NGS contigs and structural variation detection. Methods (1) DNA is labeled with IrysPrep™ reagents by incorporation of fluorophore-labeled nucleotides at a specific sequence motif throughout the genome. (2) The labeled genomic DNA is then linearized in the IrysChip™ nanochannels and single molecules are imaged by Irys. (3) Irys performs automated data collection and image processing. (4) Molecules are labeled with a unique signature pattern that is uniquely identifiable. (5) Molecules are assembled into genome maps and downstream analysis of maps is performed with the IrysView™ software suite. Conclusions BioNano Genomics Irys enables visualization of extremely long, single DNA molecules for the direct characterization of complex structural events in the genome. This system permits rapid accurate genome-wide de novo assembly and detection of structural variants that typically confound short-read genome assembly and comparative genomic analysis. Here we demonstrate de novo human Genome Map assembly capabilities of the IrysChip nanochannel arrays and the Irys imaging system to characterize genome- wide structural variation in the human genome. By comparing de novo assemblies of a father-daughter pair we show that genome mapping is able to detect large structural variants with very good cross-validation. We are also able to map regions of the genome that are refractory to assembly by other methods (remaining gaps in the human reference genome). References 1) Hastie, A.R., et al. Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome. PLoS ONE (2013); 8(2): e55864. 2) Lam, E.T., et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nature Biotechnology (2012); 10: 2303 3) Das, S. K., et al. Single molecule linear analysis of DNA in nano-channel labeled with sequence specific fluorescent probes. Nucleic Acids Research (2010); 38: 8 4) Xiao, M., et. al. Rapid DNA mapping by fluorescent single molecule detection. Nucleic Acids Research (2007); 35:e16. 1) IrysPrep reagents label DNA at specific sequence motifs 2) IrysChip linearizes DNA in NanoChannels 3) Irys automates imaging of single molecules in NanoChannels 4) Molecules and labels detected in images by instrument software 5) IrysView software assembles genome maps Nick Site Displaced Strand Polymerase Nickase Recognition Motif Free DNA solution DNA in a microchannel DNA in a NanoChannel Gaussian coil Partially elongated Linearized Position (kb) Irys Genome Maps Are More Complete Than NGS or Third-Gen A Streptomyces genome was sequenced and assembled by a combination of different sequencing platforms (green maps). In contrast to these fragmented assemblies, one intact contiguous genome map (blue) was assembled de novo by the Irys system. The genome map anchored 19 of the 3rd-Gen Long Reads contigs. Short-Read NGS Only (9.08Mb, 124 contigs, 92kb n50) NGS + Cosmids (11.38Mb, 97 contigs, 154kb n50) 3rd-Gen Long Reads (11.63Mb, 20 contigs, 918kb n50) BioNano (11.87Mb, 1 contig) BioNano Genome Map Anchors 3rd Gen Contigs Human MHC Structural Variation and Haplotype Phasing Genome-Wide Structural Variation Analysis Cross-Validation by Pedigree X 11 10 9 8 7 6 5 4 3 2 1 Y 12 13 14 15 16 17 18 19 20 21 22 Chromosome Single-Molecule Depth Plots Chromosome Genome Map Coverage of hg19 Assembled Maps Reference Gaps Map Gaps 70kb Insertion RCCX HLA-D 4.6 kb x 14 units (7 in reference) 4.6kb x 14 (7 in hg19) Repeat Expansion 93.3% Coverage 99.2% Maps aligning to hg19 150 Mb Maps not aligning to hg19 1.37 Mb Map N50 507 201 196 Comparison of SV Size Distribution 0 20 40 60 80 100 <-200,000 -190,000 -170,000 -150,000 -130,000 -110,000 -90,000 -70,000 -50,000 -30,000 -10,000 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 >200,000 0 20 40 60 80 100 120 140 <-200,000 -190,000 -170,000 -150,000 -130,000 -110,000 -90,000 -70,000 -50,000 -30,000 -10,000 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 >200,000 bp bp Insertions (436) Deletions (264) Insertions (167) Deletions (224) Daughter (NA12878) Father (NA12891) 65kb Tandem Amplification Single-Molecules Segmental Duplication 25kb 25kb Long single molecule sequence motif maps (>150kb) from human cell lines from two individuals were aligned to the human reference, as shown in the depth-plot, demonstrating broad genome-wide coverage with interspersed deviations indicating amplifications and deletions relative to one another and to the reference. Single molecule maps were de novo assembled into consensus genome maps, shown aligned to the human reference showing 93% coverage of the non-N base regions. Structural variation across a broad range of sizes refractory to many high throughput and short-read technologies was detected. Insertions are called by the presence of novel label sites and expansion of adjacent labels. Deletions are evident by the absence of label sites or narrowing of inter-label segments. Maps also identify variation in difficult-to-sequence highly repetitive regions, such as those involved in immune function (such as MHC) and near the telomeres and centromeres. Concordance between father and daughter supports the validity of the detected SVs. Spanning Gaps in Human Reference Sequence Mapping into Centromere Gaps Chr1 Gap Sizing Gap 100 kb Chr15 Centromere Chr17 Centromere Long after the completion of the human genome project, there remain many gaps in the assembly. In particular, the regions around centromeres are very hard to accurately assemble as a result of large repeated blocks of DNA. We have assembled much of this area with genome maps that extend into gaps and centromere sequence. The difficult to assemble MHC region has been assembled by genome mapping. Two discrete haplotypes were resolved, phasing variants. A 70 bp insertion was detected in one haplotype in addition to single motif site differences and deletions. NA12878 NA12891

Upload: others

Post on 20-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural Variation Discovery by De Novo Genome Mapping of … · 2020. 8. 7. · De Novo Human Genome Map Assembly and Depth Analysis Structural Variation Discovery by De Novo Genome

De Novo Human Genome Map Assembly and Depth Analysis

Structural Variation Discovery by De Novo Genome Mapping of the Human Genome at the Single Molecule Level Using

NanoChannel Linearization

A Hastie, E Lam, T Chan, M Requa, T Anantharaman M Austin, F Trintchouk, M Saghbini, YYY Lai2, ACY Mak2, P-Y Kwok2, M Xiao3, H CaoBioNano Genomics, San Diego, California, USA, 2UCSF, San Francisco, CA, USA; 3Drexel University, Philadelphia, PA, USA

AbstractAs a result of the remaining limitations of DNA sequencing and analysis technologies—even ten years after the completion of the human genome project—there remain about 400 gaps in the human reference sequence assembly, hundreds of millions of unassembled bases in those regions, and no effective tools to comprehensively characterize the structural variation in an individual’s genome. Despite the ungapped reference sequence being of extremely high quality, it is not feasible to create similarly high quality assemblies of individuals to detect and interpret the many types of structural variation that are refractory to high throughput or short-read technologies. We present a single molecule genome analysis system (Irys) based on NanoChannel Array technology that linearizes extremely long DNA molecules for

direct observation. This high-throughput platform automates the imaging of single molecules of genomic DNA hundreds of kilobases in size to measure sufficient sequence uniqueness for unambiguous assembly of complex genomes. High-resolution genome maps assembled de novo preserve long-range structural information necessary for structural variation detection and assembly applications. We have used Irys genome mapping for the assembly and characterization of two human genomes. From these assemblies, we have spanned many of the remaining gaps, identified known and novel structural variants and phase some haplotype blocks, including in the MHC region. We also resolve and measure long tandem repeat regions that are likely impossible to assemble by other methods.

BackgroundGenerating high-quality finished genomes replete with accurate identification of structural variation and high completion (minimal gaps) remains challenging using short read sequencing technologies alone. The Irys platform provides direct visualization of long DNA molecules in their native state, bypassing the statistical inference needed to align paired-end reads with an uncertain insert size distribution. These long labeled molecules are de novo assembled into physical maps spanning the whole genome. The resulting order and orientation of sequence elements in the map can be used for anchoring NGS contigs and structural variation detection.

Methods(1) DNA is labeled with IrysPrep™ reagents by incorporation of fluorophore-labeled nucleotides at a specific sequence motif throughout the genome. (2) The labeled genomic DNA is then linearized in the IrysChip™ nanochannels and single molecules are imaged by Irys. (3) Irys performs automated data collection and image processing. (4) Molecules are labeled with a unique signature pattern that is uniquely identifiable. (5) Molecules are assembled into genome maps and downstream analysis of maps is performed with the IrysView™ software suite.

ConclusionsBioNano Genomics Irys enables visualization of extremely long, single DNA molecules for the direct characterization of complex structural events in the genome. This system permits rapid accurate genome-wide de novo assembly and detection of structural variants that typically confound short-read genome assembly and comparative genomic analysis. Here we demonstrate de novo human Genome Map assembly capabilities of the IrysChip nanochannel arrays and the Irys imaging system to characterize genome-wide structural variation in the human genome. By comparing de novo assemblies of a father-daughter pair we show that genome mapping is able to detect large structural variants with very good cross-validation. We are also able to map regions of the genome that are refractory to assembly by other methods (remaining gaps in the human reference genome).

References1) Hastie, A.R., et al. Rapid Genome Mapping in Nanochannel Arrays for Highly

Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome. PLoS ONE (2013); 8(2): e55864.

2) Lam, E.T., et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nature Biotechnology (2012); 10: 2303

3) Das, S. K., et al. Single molecule linear analysis of DNA in nano-channel labeled with sequence specific fluorescent probes. Nucleic Acids Research (2010); 38: 8

4) Xiao, M., et. al. Rapid DNA mapping by fluorescent single molecule detection. Nucleic Acids Research (2007); 35:e16.

1) IrysPrep reagents label DNA at specific sequence motifs

2) IrysChip linearizes DNA in NanoChannels

3) Irys automates imaging of single molecules in NanoChannels

4) Molecules and labels detected in images by instrument software

5) IrysView software assembles genome maps

NickSite

Displaced Strand

Polymerase

Nickase Recognition

Motif

Free DNA solution! DNA in a microchannel! DNA in a NanoChannel!

Gaussian coil! Partially elongated! Linearized!

EVOLUTION LUNAR Confidential 2011 Position (kb)

Irys Genome Maps Are More Complete Than NGS or Third-Gen

A Streptomyces genome was sequenced and assembled by a combination of different sequencing platforms (green maps). In contrast to these fragmented assemblies, one intact contiguous genome map (blue) was assembled de novo by the Irys system. The genome map anchored 19 of the 3rd-Gen Long Reads contigs.

Short-Read NGS Only (9.08Mb, 124 contigs, 92kb n50)

NGS + Cosmids (11.38Mb, 97 contigs, 154kb n50)

3rd-Gen Long Reads (11.63Mb, 20 contigs, 918kb n50)

BioNano (11.87Mb, 1 contig)

BioNano Genome Map Anchors 3rd Gen Contigs

Human MHC Structural Variation and Haplotype Phasing

Genome-Wide Structural Variation Analysis

Cross-Validation by Pedigree

Chr

omos

ome

X

11

10

9

8

7

6

5

4

3

2

1

Y

12

13

14

15

16

17

18

19

20

21

22

Chr

om

oso

me

Single-Molecule Depth Plots

Chr

om

oso

me

Genome Map Coverage of hg19

Assembled MapsReference GapsMap Gaps

70kb Insertion

RCCX HLA-D

4.6  kb  x  14  units  (7  in  reference)  4.6kb x 14 (7 in hg19)

Repeat Expansion

93.3% Coverage99.2% Maps aligning to hg19150 Mb Maps not aligning to hg191.37 Mb Map N50

507201196

Comparison of SV Size Distribution

0!

20!

40!

60!

80!

100!

<-200,000!

-190,000!

-170,000!

-150,000!

-130,000!

-110,000!

-90,000!

-70,000!

-50,000!

-30,000!

-10,000!

20,000!

40,000!

60,000!

80,000!

100,000!

120,000!

140,000!

160,000!

180,000!

>200,000!

Daughter: NA12878!

Deletions (224) Insertions (167)!0!

20!

40!

60!

80!

100!

120!

140!

<-200,000!

-190,000!

-170,000!

-150,000!

-130,000!

-110,000!

-90,000!

-70,000!

-50,000!

-30,000!

-10,000!

20,000!

40,000!

60,000!

80,000!

100,000!

120,000!

140,000!

160,000!

180,000!

>200,000!

Father: NA12891!

Deletions (264) Insertions (436)!

bp

bp

Insertions (436)Deletions (264)

Insertions (167)Deletions (224)

Daughter (NA12878)

Father (NA12891)

65kbTandem Amplification

Sin

gle

-Mo

lecu

les

Segmental Duplication

25kb  25kb

Long single molecule sequence motif maps (>150kb) from human cell lines from two individuals were aligned to the human reference, as shown in the depth-plot, demonstrating broad genome-wide coverage with interspersed deviations indicating amplifications and deletions relative to one another and to the reference. Single molecule maps were de novo assembled into consensus genome maps, shown aligned to the human reference showing 93% coverage of the non-N base regions.

Structural variation across a broad range of sizes refractory to many high throughput and short-read technologies was detected. Insertions are called by the presence of novel label sites and expansion of adjacent labels. Deletions are evident by the absence of label sites or narrowing of inter-label segments. Maps also identify variation in difficult-to-sequence highly repetitive regions, such as those involved in immune function (such as MHC) and near the telomeres and centromeres. Concordance between father and daughter supports the validity of the detected SVs.

Spanning Gaps in Human Reference Sequence

Mapping into Centromere Gaps

Chr1 Gap SizingGap

100  kb  

Chr15 Centromere

Chr17 Centromere

Long after the completion of the human genome project, there remain many gaps in the assembly. In particular, the regions around centromeres are very hard to accurately assemble as a result of large repeated blocks of DNA. We have assembled much of this area with genome maps that extend into gaps and centromere sequence.

The difficult to assemble MHC region has been assembled by genome mapping. Two discrete haplotypes were resolved, phasing variants. A 70 bp insertion was detected in one haplotype in addition to single motif site differences and deletions.

NA12878NA12891