aug2015 ali bashir and jason chin pac bio giab_assembly_summary_ali3

10
FIND MEANING IN COMPLEXITY © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. GIAB workshop, Aug 27, 2015 PacBio GIAB Assembly Summary

Upload: genomeinabottle

Post on 17-Jan-2017

525 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

FIND MEANING IN COMPLEXITY© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures.

GIAB workshop, Aug 27, 2015

PacBio GIAB Assembly Summary

Page 2: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

2

Two Draft Assemblies Generated

• Two Assemblies:– “Family genome”: using all data from three genomes to create a family-level

reference to get better continuity. It can be used for other downstream analysis.– Child genome: standard Falcon diploid-aware genome assembly for doing

“Falcon Unzip” to get “haplotigs” of regions of interests

• Primary Contigs Statistics Summary:

Child Contig Stats#Seqs 9,973Max 39,181,442Total 2,959,326,490*n50 7,162,062n90 668,759n95 66,926

“Family” Contig Stats#Seqs 5,680Max 50,291,873Total 2,892,908,408n50 9,242,933n90 855,896n95 233,041*We use short read length cutoff and more sensitive overlapping parameters for the child assembly. This might explain why the assembler size is bigger than the family one.

Page 3: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

3

“Naïve” Structure Variation Calls by Whole Genome Alignments

• Whole genome alignment are use to identify the difference between the assembled contigs to GRCh38. We have a couple of example shown her.

Haplotype 1

Haplotype 2

~50kb deletion in one haplotype

Child Contigs

Page 4: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

4

Heterozygous Insertion

Haplotype 1

Haplotype 2

~3.7 kb insertion in one haplotype

~ 22000 SV calls

Need to develop a methodto filter out some alignmentartifacts

Page 5: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

Falcon “Unzip”

5

Contig 000400F, ~5.1 Mb

MHC, HLA Class I Region

Primary contig with phased sequence + alternative haplotigs

haplotype block haplotype block

Region of low density het-SNPs

“haplotig”

“Unzipped” Graphs

“Haplotype Fused” Graph

phased block

Step 1

Step 2

Step 3

Page 6: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

6

Phased Variants of All Kinds

haplotype block haplotype block

Sequence alignments between the haplotigs

Phases structure variants + SNPsbetween the haplotypes

Haplotype 0

Hap

loty

pe 1

Page 7: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

Some Haplotype StatisticsEM on hybrid SNPsTest region : 33 Mb Number of haplotigs : 218Haplotig coverage : 72.1%N50 : 287,557 bpSwitches : 336Switch error rate : 4.13%Total phased variants : 8131Concordant variants : 7049S50: 259

*Many small (10-15 kb) gaps where heterozygous SNPs were not present between blocks

Page 8: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

Assembly Statistics• We assembled linked SNP and

indel information from the previous step into finished haplotigs by phasing reads

• Reads partitioned using a greedy algorithm on alleles

• Phased reads were then added to a network graph of read overlaps

Page 9: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

Rough Schematic for Hybrid Scaffolding - Flow ChartContig /Scaffold (fasta)

NGS.cmap

In silica digestion

BN.cmap

De Novo Assemble

NGS.vs.BN.xmap

Filtered NGS.cmap Filtered BN.cmap

LeftOver NGS.cmap

LeftOver BN.cmap

Merged Hybrid Scaffold.cmap

1. Flag Inconsistencies (QC and/or manual curation)

Scaffold Pipeline

2. Scaffolding Pipeline

Fasta AGP3. Export

NGS Genome Maps

Optional Iterations of different stringencies

Page 10: Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3

Hybrid Scaffolding Stats

Input Input Contigs# of

ScaffoldsMean Length N50 Max Total

HG002GIAB upload (Falcon) 248 9.5Mb 22.7Mb 92.8Mb 2.4Gb

HG002 celera child 275 8.1Mb 16.9Mb 61.0Mb 2.2Gb

HG002updated Falcon Child 302 7.4Mb 18.2Mb 61.0Mb 2.3Gb

Trio(more) updated Falcon 210 11.1Mb 29.3Mb 87.6Mb 2.3Gb

2 Step Trio

celera child + falcon trio 187 13.9Mb 34.3Mb 98.0Mb 2.6Gb

2 Step Child

celera child + falcon child 200 7.8Mb 23.8Mb 77.9Mb 1.6Gb