diploidgenome assembly and comprehensive haplotype ......arabidopsis thaliana f1 diploid assembly...

20
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Diploid Genome Assembly and Comprehensive Haplotype Sequence Reconstruction Jason Chin, Paul Peluso, David Rank, Fritz Sedlazeck, Maria Nattestad, Michael Schatz, Greg Concepcion, Alicia Clum, Kerrie Barry, Alex Copeland, Ronan O’Malley

Upload: others

Post on 30-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.

Diploid Genome Assembly and Comprehensive Haplotype Sequence ReconstructionJason Chin, Paul Peluso, David Rank, Fri tz Sedlazeck, Maria Nattestad, Michael Schatz, Greg Concepcion, Al icia Clum, Kerrie Barry, Alex Copeland, Ronan O’Malley

Page 2: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

Acknowledgments

-All PacBio Colleagues

-Ronan O’Malley, Chongyuan Luo, Joseph Ecker (HHMI / The Salk

Institute )

-Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute)

-Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL)

- Open source toolsets-Daligner (https://dazzlerblog.wordpress.com), Gene Myers-BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson-Python, NetworkX for rapid algorithm protyping-Gephi, Graphviz for graph visualization-FALCON (https://github.com/PacificBiosciences/falcon,

https://github.com/PacificBiosciences/falcon)

Page 3: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

SOLVING THE DIPLOID ASSEMBLY PROBLEM

- Falcon (a polyploid-aware assembler) : generating the contigs through the bubbles- Falcon Unzip: identifying smaller variants and using them to separate the haplotypes

• Bubbles = big variants between the haplotypes

• Collapsed Path = smaller variants between the haplotypes

Page 4: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

WHY DO WE SEE BUBBLES?

SNPs SNPs SNPsSVsSVs

SNPsSNPs SNPs

SVsSVs

Haplotype 1

Haplotype 2

Genome Sequences

Assembly Graph

In most OLC assembler design, the overlapper does not catch differences at SNP level but structural variations are naturally segregated.

Page 5: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

THE FALCON UNZIP PROCESS

SNPs SNPs SNPsSVsSVs

Associate contig 1(Alternative allele)

Associate contig 2(Alternative allele)

SNPs SNPs SNPsSVsSVs

Primary contig

Augmented with haplotype information of each reads

FALCON

FALCON-Unzip

Updated primary contig + “associate haplotigs”

Page 6: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

PHASING READ INTO HAPLOTYPE GROUPS

Haplotype 0

Haplotype 1

Identify het-SNPs

Phase het-SNPs

Group reads with

phased SNPs

Reconstruct haplotypes

Align SMRT reads to the initial primary contig

More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issuesgiven enough long read coverage for phasing.

Page 7: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

QUESTION: HOW TO RESOLVE STRUCTURAL VARIATIONS & HET-SNPS PHASING AT ONCE

3 kb – 100 kb

300 b – 10 kb

Structural Variations

het-SNP

ü Overlap-layout process catches SV haplotypes

✗ Collapsed paths when there is no SV

ü Easy to group SNPs/reads into different haplotypes

✗ No phasing information associated with SVs

ü Nearby SVs may be phased automatically

✗ Haplotype-fused paths

ü Haplotype-specific paths✗ More fragmented contigs

Information Sources Pros & Cons Assembly graph features

Page 8: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

MERGE HAPLOTYPE INFORMATION AND “UNZIP”

Tiling path of haplotype 0

Tiling path of haplotype 1

Remove edges connectingdifferent haplotypes

Page 9: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

PUT EVERYTHING TOGETHER

“Falcon Unzip P

rocess”

~ 4.80 Mb

Add missing haplotype specific nodes & edges

Remove edges that connect different haplotypes

The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs.

4 major haplotype phased blocks determined by het-SNPs Un-phased region

Page 10: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

POLISHING: ALLELE-SPECIFIC ALIGNMENT FOR FINAL CONSENSUS

“Augmented alignment”: Each read has extra attribute (e.g., contig identifier, phasing block, haplotype phase), an aligner uses those information to place the read to specific reference sequence or regions.

Align the “red” haplotig

Align the “blue” haplotig

Read from same regionbut different haplotypes

Page 11: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

CONSTRUCT ARABIDOPSIS THALIANA COL-0 X CVI-0 DIPLOID F1 LINE

Image credits: Pajoro, et al, Trends in plant science 21.1 (2016): 6-8.

Col-0 Cvi-0

Col-0 x Cvi-0

• Two inbred lines sequenced in 2013 (P4 chemistry), assembled as haploid genomes

• F1 line constructed and sequenced in 2015 (P6 chemistry), assembled with FALCON and FALCON-Unzip

Page 12: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

Col-0 x Cvi-0 assembly

DIPLOID ASSEMBLY PRIMARY CONTIGS AND HAPLOTIGS

.

Col-0 chromosome

Cvi-0 chromosome

haplotigs

primary contig

• Primary contigs ~ 1n representation of the genome• Haplotigs ~ phased sequences from where the homologuous

chromosomes are distinguishable

Page 13: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS

Strain InbredCol-0 InbredCvi-0Col-0 xCvi-0

F1

Assembler CA/HGAP CA/HGAPFALCON FALCON-Unzip FALCON-Unzip

primary contigs primary contigs haplotigsAssemblySize(Mb) 126 119 143 140 105#contigs 1325 194 426 172 248N50size(Mb) 6.210 4.79 7.92 7.96 6.92MaxContig size(Mb) 10.25 11.25 13.39 13.32 11.65

126

119

143

57

140

105

0 20 40 60 80 100 120 140 160

Inbred Col-0

Inbred Cvi-0

F1 FALCON p-contigs

F1 FALCON a-contigs

F1 Unzip p-contigs

F1 Unzip haplotigs

Assembly Size (Mb)

6.21

4.79

7.92

0.146

7.96

6.92

0 2 4 6 8 10

N50 size (Mb)

Page 14: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

Col-0 x Cvi-0 assembly

EVALUATE THE DIPLOID ASSEMBLY RESULT

.

Col-0 assembly

Cvi-0 assembly

haplotigs

primary contig

Haploid-like contig in the inbred-line assemblies

Many variations

Few or no variations

Few or no variations

Many variations

By aligning the haplotigs to the parental genome assemblies, we can evaluate the haplotigs’ quality, e.g. haplotyping accuracy and CDS prediction consistency.

Page 15: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES

- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.

- Most haplotigs can be fully assigned to one of the parental haplotypes.

Page 16: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES

- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.

- Most haplotigs can be fully assigned to one of the parental haplotypes.

Cvi-0Col-0

Primary Contigs

Haplotigs

Col-0Cvi-0

Page 17: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

ANNOTATION COMPARISION

Predicted Coding Sequences

TAIR 10 Genome &

Predicted Transcripts

Homopolymer Length Distributions

Compare de novo gene prediction (with AUGUSTUS (Stanke 2003)) between different assemblies

Assemblies TAIR10 Col-0 Cvi-0Numberofpredicted

CDS 27,946 30,006 27,393

100%indel-free fulllengthoverlaps

Col-0 3000625,966(92.9%)

Col-0xCvi-0 5677525,865(92.5%)

26,537(88.4%)

27,370(99.9%)

Page 18: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

OTHER SMALLER AND LARGER DIPLOID GENOMES

Clavicoronapyxidata

(Coral Fungus)Cabernet

Sauvignon+* Human*Haploid Genome Size: ~ 44 Mb ~ 500 Mb ~ 3 Gb

FALCON-Unzip Results:

Primary contig size 41.9 Mb 591.0 Mb 2.76 GbPrimary contig N50 1.5 Mb 2.2 Mb 22.9 Mb

Haplotig size 25.5 Mb 372.2 Mb 2.0 GbHaplotig N50 872 kb 767 kb 330 kb

+Led by Cantu lab, UC Davis and Cramer lab, UN Reno

*Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes.

Page 19: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

SUMMARY

-Single data type for routine diploid assembly-Large genomes are more computationally challenging but it is mostly an

engineering problem now:-Haplotype phasing improvement, incorporate 3rd party phasing code -Develop a sequence aligner for “augmented alignment” for faster

Quiver consensus process - FALCON-Unzip code: (No code, No truth!!) if you like to hack it for now,

email me ([email protected])- Want to attack the algorithm problem for polyploid assembly? Let us

help you!

Thanks for your attention!

Page 20: DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.

All other trademarks are the sole property of their respective owners.

www.pacb.com