diploidgenome assembly and comprehensive haplotype ......arabidopsis thaliana f1 diploid assembly...

Post on 30-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.

Diploid Genome Assembly and Comprehensive Haplotype Sequence ReconstructionJason Chin, Paul Peluso, David Rank, Fri tz Sedlazeck, Maria Nattestad, Michael Schatz, Greg Concepcion, Al icia Clum, Kerrie Barry, Alex Copeland, Ronan O’Malley

Acknowledgments

-All PacBio Colleagues

-Ronan O’Malley, Chongyuan Luo, Joseph Ecker (HHMI / The Salk

Institute )

-Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute)

-Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL)

- Open source toolsets-Daligner (https://dazzlerblog.wordpress.com), Gene Myers-BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson-Python, NetworkX for rapid algorithm protyping-Gephi, Graphviz for graph visualization-FALCON (https://github.com/PacificBiosciences/falcon,

https://github.com/PacificBiosciences/falcon)

SOLVING THE DIPLOID ASSEMBLY PROBLEM

- Falcon (a polyploid-aware assembler) : generating the contigs through the bubbles- Falcon Unzip: identifying smaller variants and using them to separate the haplotypes

• Bubbles = big variants between the haplotypes

• Collapsed Path = smaller variants between the haplotypes

WHY DO WE SEE BUBBLES?

SNPs SNPs SNPsSVsSVs

SNPsSNPs SNPs

SVsSVs

Haplotype 1

Haplotype 2

Genome Sequences

Assembly Graph

In most OLC assembler design, the overlapper does not catch differences at SNP level but structural variations are naturally segregated.

THE FALCON UNZIP PROCESS

SNPs SNPs SNPsSVsSVs

Associate contig 1(Alternative allele)

Associate contig 2(Alternative allele)

SNPs SNPs SNPsSVsSVs

Primary contig

Augmented with haplotype information of each reads

FALCON

FALCON-Unzip

Updated primary contig + “associate haplotigs”

PHASING READ INTO HAPLOTYPE GROUPS

Haplotype 0

Haplotype 1

Identify het-SNPs

Phase het-SNPs

Group reads with

phased SNPs

Reconstruct haplotypes

Align SMRT reads to the initial primary contig

More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issuesgiven enough long read coverage for phasing.

QUESTION: HOW TO RESOLVE STRUCTURAL VARIATIONS & HET-SNPS PHASING AT ONCE

3 kb – 100 kb

300 b – 10 kb

Structural Variations

het-SNP

ü Overlap-layout process catches SV haplotypes

✗ Collapsed paths when there is no SV

ü Easy to group SNPs/reads into different haplotypes

✗ No phasing information associated with SVs

ü Nearby SVs may be phased automatically

✗ Haplotype-fused paths

ü Haplotype-specific paths✗ More fragmented contigs

Information Sources Pros & Cons Assembly graph features

MERGE HAPLOTYPE INFORMATION AND “UNZIP”

Tiling path of haplotype 0

Tiling path of haplotype 1

Remove edges connectingdifferent haplotypes

PUT EVERYTHING TOGETHER

“Falcon Unzip P

rocess”

~ 4.80 Mb

Add missing haplotype specific nodes & edges

Remove edges that connect different haplotypes

The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs.

4 major haplotype phased blocks determined by het-SNPs Un-phased region

POLISHING: ALLELE-SPECIFIC ALIGNMENT FOR FINAL CONSENSUS

“Augmented alignment”: Each read has extra attribute (e.g., contig identifier, phasing block, haplotype phase), an aligner uses those information to place the read to specific reference sequence or regions.

Align the “red” haplotig

Align the “blue” haplotig

Read from same regionbut different haplotypes

CONSTRUCT ARABIDOPSIS THALIANA COL-0 X CVI-0 DIPLOID F1 LINE

Image credits: Pajoro, et al, Trends in plant science 21.1 (2016): 6-8.

Col-0 Cvi-0

Col-0 x Cvi-0

• Two inbred lines sequenced in 2013 (P4 chemistry), assembled as haploid genomes

• F1 line constructed and sequenced in 2015 (P6 chemistry), assembled with FALCON and FALCON-Unzip

Col-0 x Cvi-0 assembly

DIPLOID ASSEMBLY PRIMARY CONTIGS AND HAPLOTIGS

.

Col-0 chromosome

Cvi-0 chromosome

haplotigs

primary contig

• Primary contigs ~ 1n representation of the genome• Haplotigs ~ phased sequences from where the homologuous

chromosomes are distinguishable

ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS

Strain InbredCol-0 InbredCvi-0Col-0 xCvi-0

F1

Assembler CA/HGAP CA/HGAPFALCON FALCON-Unzip FALCON-Unzip

primary contigs primary contigs haplotigsAssemblySize(Mb) 126 119 143 140 105#contigs 1325 194 426 172 248N50size(Mb) 6.210 4.79 7.92 7.96 6.92MaxContig size(Mb) 10.25 11.25 13.39 13.32 11.65

126

119

143

57

140

105

0 20 40 60 80 100 120 140 160

Inbred Col-0

Inbred Cvi-0

F1 FALCON p-contigs

F1 FALCON a-contigs

F1 Unzip p-contigs

F1 Unzip haplotigs

Assembly Size (Mb)

6.21

4.79

7.92

0.146

7.96

6.92

0 2 4 6 8 10

N50 size (Mb)

Col-0 x Cvi-0 assembly

EVALUATE THE DIPLOID ASSEMBLY RESULT

.

Col-0 assembly

Cvi-0 assembly

haplotigs

primary contig

Haploid-like contig in the inbred-line assemblies

Many variations

Few or no variations

Few or no variations

Many variations

By aligning the haplotigs to the parental genome assemblies, we can evaluate the haplotigs’ quality, e.g. haplotyping accuracy and CDS prediction consistency.

COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES

- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.

- Most haplotigs can be fully assigned to one of the parental haplotypes.

COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES

- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.

- Most haplotigs can be fully assigned to one of the parental haplotypes.

Cvi-0Col-0

Primary Contigs

Haplotigs

Col-0Cvi-0

ANNOTATION COMPARISION

Predicted Coding Sequences

TAIR 10 Genome &

Predicted Transcripts

Homopolymer Length Distributions

Compare de novo gene prediction (with AUGUSTUS (Stanke 2003)) between different assemblies

Assemblies TAIR10 Col-0 Cvi-0Numberofpredicted

CDS 27,946 30,006 27,393

100%indel-free fulllengthoverlaps

Col-0 3000625,966(92.9%)

Col-0xCvi-0 5677525,865(92.5%)

26,537(88.4%)

27,370(99.9%)

OTHER SMALLER AND LARGER DIPLOID GENOMES

Clavicoronapyxidata

(Coral Fungus)Cabernet

Sauvignon+* Human*Haploid Genome Size: ~ 44 Mb ~ 500 Mb ~ 3 Gb

FALCON-Unzip Results:

Primary contig size 41.9 Mb 591.0 Mb 2.76 GbPrimary contig N50 1.5 Mb 2.2 Mb 22.9 Mb

Haplotig size 25.5 Mb 372.2 Mb 2.0 GbHaplotig N50 872 kb 767 kb 330 kb

+Led by Cantu lab, UC Davis and Cramer lab, UN Reno

*Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes.

SUMMARY

-Single data type for routine diploid assembly-Large genomes are more computationally challenging but it is mostly an

engineering problem now:-Haplotype phasing improvement, incorporate 3rd party phasing code -Develop a sequence aligner for “augmented alignment” for faster

Quiver consensus process - FALCON-Unzip code: (No code, No truth!!) if you like to hack it for now,

email me (jchin@pacb.com)- Want to attack the algorithm problem for polyploid assembly? Let us

help you!

Thanks for your attention!

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.

All other trademarks are the sole property of their respective owners.

www.pacb.com

top related