size matters: accurate detection and phasing of structural ...size matters: accurate detection and...
TRANSCRIPT
Size matters: accurate detection and phasing of Structural VariationsFritz Sedlazeck
June, 14, 2018
Identification of SVs with long reads
Sedlazeck et al. Nature Methods (2018)
Short-read validation / False Discovery
ONT data
PacBio data
Illumina data
Insertion In rep. region
Inversion:
Translocation:
Truncated reads:
Insertion In rep. region
Sedlazeck et al. Nature Methods (2018)
How can we leverage these technologies in large cohorts?
CCDG44k+
Short-Read WGS
Long-range sequencing
Comprehensive Genomes
● Informed Sample Selection● Disease Context
● Validate complex variation● Novel SVs● Phasing information● Ethnicity Variant Catalogs
● Technology Strategies● Data Merging Pipelines
How to select samples: SVCollector
SVCollector for
Num Samples
Cum
mula
tive
Fra
ction
of S
Vs
0 10 20 30 40 50 60 70 80 90 100
0.0
0.2
0.4
0.6
0.8
1.0
greedy
topN
random
1000 Genomes: selecting 4% of the samples
Sedlazeck et al. (bioarchive)
Population TopN Greedy
AFR 99 60
SAS 0 16
EAS 0 14
EUR 0 6
AMR 1 4
Subpopulation 30.77% 96.15%
% S
Vs
in p
op
ula
tio
n
Number of Samples
• Quantitative sample selection• Avoid dependency on self reported phenotypes• Assuming observed variation (Freeze 1) is a prior for real variation• Using common SVs (AF>0.001)
• ~100 Baylor CCDG F1 samples• Multiple Ethnicities
Sedlazeck et al. (bioarchive)
Ethnicity Male Female
African American 22 40
Hispanic American 9 8
Caucasian 6 10
Unknown (5 samples)# Samples selected
SVCollector : CCDG Sample Selection
Comprehensive Genomes Pilot: Preliminary Data
@HGSC
External
Family StudyIllumina, PCR-Free
PacBio 10X
Genomics
RNA-Seq (Read-pairs
– M)
Ashkenazi Jewish trio
38x 18x 30x 50
HGSC Control Trio >100x 19x 56x 68
HapMap CEU Trio >100x 40x 30x 35
Sedlazeck et al. (in preparation)
HLA-F deletion: Long + RNA-Seq
PacBio: HS-1011
RNA-SeqHS-1011
PacBio:NA12878
RNA-SeqNA12878
PacBio:NA24385
RNA-SeqNA24385
FPKM: 54.2613
FPKM: 19.3986
FPKM: 16.9305
Sedlazeck et al. (in preparation)
Phasing: chr6
Technology N50 Phasing (Mbp)
PacBio 0.243
10x-Longranger 0.901
10x-Hapcut2
0.978
PacBio+10x 1.039
Technology N50 Phasing (Mbp)
PacBio 0.276
10x-Longranger 8.523
10x-Hapcut2
67.576
PacBio+10x 67.576
MHC LPA
PacBio
10x genomics
Both
PacBio
10x genomics
Both
HLALPA
HS-1011: DNA Mol. Length 27.6 kb
NA24385: DNA Mol. Length 99.9 kb
Sedlazeck et al. (in preparation)
Comprehensive Genomics
Comprehensive Genomes:Sedlazeck et al. (in preparation)
Interaction of SVs+SNV with methylation and RNA-Seq
Overview of long read projects
Diploid genomesRegions (Parkinson, Gaucher):Leija-Salazar (bioRxiv)
Entex consortium (in preparation)
Detection of VariantsNGMLR + Sniffles Sedlazeck et.al. (2018)
SURVIVOR Jeffares et. al. (2017)
ClairvoyanteLuo et al. (bioRxiv)
GiaB (in preparation)
SVs in GenomesCancer (SKBR3)Nattestrad et al (bioRxiv)
44,000 Population (CCDG)Sedlazeck et.al. (in prep)
0.0e+00 5.0e+07 1.0e+08 1.5e+08
0.0
00
.10
0.2
0
CHR6: Average SV Allele Frequency per 100kb
Position
Alle
le f
req
uen
cy
Methods
SURVIVOR:• Tool kit for SVs• Published: Nature Communications (2017)• Available:
github.com/fritzsedlazeck/SURVIVOR
Sniffles:• SVs detection for long reads• Published: Nature Methods (2018)• Also nested SV• Available:
github.com/fritzsedlazeck/Sniffles
NextGenMap-LR:• Long read mapper• Published: Nature Methods (2018)• Available:
github.com/philres/nextgenmap-lr
SVCollector• Automated sample selection• bioRxiv• Available:
github.com/fritzsedlazeck/SVCollector
Clairvoyante• SNV caller• bioRxiv• Available:
github.com/aquaskyline/Clairvoyante
Crossstich• Localized assembly + phasing of SVs+SNV• Available:
github.com/schatzlab/crossstitch
Acknowledgments
William Salerno
Stephen Richards
Richard Gibbs
Philipp Rescheneder
Moritz Smolka
Arndt von Haeseler
Michael Schatz
Schatz lab
3.4 How much coverage do we need?
NA12878 (55x original)SKBR3 (69x original)
Short indels 8-100bp
SV Scalpel Sniffles found(%) Sniffles additional:
DEL 30,988 90.5% 871
INS 191,817 71.70% 13,503
Minimap2: Pacbio