ashg sedlazeck grc_share
TRANSCRIPT
Structural Variation Characterization Across the Human Genome and PopulationsFritz Sedlazeck
October, 17, 2017
Scientific interestsDetection of Variants
Sniffles (in bioRxiv)
SURVIVOR Jeffares et. al. (2017)
BOD-Score Sedlazeck et.al.(2013)
Mapping/ Assembly reads
NextGenMap-LR(in bioRxiv)
Falcon UnzipChin et.al. (2016)
NextGenMapSedlazeck et.al. (2013)
Benchmarking/ Biases
DangerTrackDolgalev et.al. (2017)
TeaserSmolka et.al. (2015)
SequencingJünemann et.al. (2013)
ApplicationsModel organisms:-Cancer (SKBR3) (in bioRxiv)-miRNA editing (Vesely et.al. 2012)
Non Model organisms:-Cottus transposons (Dennenmoseret. al. 2017)-Clunio (Kaiser et. al. 2016)-Seabass (Vij et.al. 2016)-Pineapple (Ming et.al. 2015)
Figure'1'
“moonlight”'
Structural VariationsGenomic DisordersEvolution
Impact on regulation Impact on phenotypes
Reg
ula
tory
Sta
te
Cell Line
A549Aorta
B_cells_PB_Roadmap
CD14CD16__monocyte_CB
CD14CD16__monocyte_VB
CD4_ab_T_cell_VB
CD8_ab_T_cell_CB
CM_CD4_ab_T_cell_VB
DND_41
eosinophil_VB
EPC_VB
erythroblast_CB
Fetal_Adrenal_Gland
Fetal_Intestine_Large
Fetal_Intestine_Small
Fetal_Muscle_Leg
Fetal_Muscle_Trunk
Fetal_Stomach
Fetal_Thymus
Gastric
GM12878
H1_mesenchymal
H1_neuronal_progenitor
H1_trophoblast
H1ESC H9
HeLa_S3
HepG2HMEC
HSMM
HSMMtube
HUVEC_prol_CB
HUVECIM
R90
iPS_20b
iPS_DF_19_11
iPS_DF_6_9K562
Left_Ventric
leLung
M0_macrophage_CB
M0_macrophage_VB
M1_macrophage_CB
M1_macrophage_VB
M2_macrophage_CB
M2_macrophage_VB
Monocytes_CD14_PB_Roadmap
Monocytes_CD14
MSC_VB
naive_B_cell_VB
Natural_Killer_cells_PB
neutrophil_CB
neutrophil_myelocyte_BM
neutrophil_VB
NH_A
NHDF_ADNHEK
NHLF
OsteoblOvary
Pancreas
Placenta
Psoas_Muscle
Right_Atrium
Small_Intestine
Spleen
T_cells_PB_Roadmap
Thymus
CTCF_binding_siteACTIVE
CTCF_binding_siteINACTIVE
CTCF_binding_sitePOISED
CTCF_binding_siteREPRESSED
enhancerACTIVE
enhancerINACTIVE
enhancerPOISED
enhancerREPRESSED
open_chromatin_regionACTIVE
open_chromatin_regionINACTIVE
open_chromatin_regionNA
open_chromatin_regionPOISED
open_chromatin_regionREPRESSED
promoterACTIVE
promoter_flanking_regionACTIVE
promoter_flanking_regionINACTIVE
promoter_flanking_regionPOISED
promoter_flanking_regionREPRESSED
promoterINACTIVE
promoterPOISED
promoterREPRESSED
TF_binding_siteACTIVE
TF_binding_siteINACTIVE
TF_binding_siteNA
TF_binding_sitePOISED
TF_binding_siteREPRESSED
A549Aorta
B_cells_PB_Roadmap
CD14CD16__monocyte_CB
CD14CD16__monocyte_VB
CD4_ab_T_cell_VB
CD8_ab_T_cell_CB
CM_CD4_ab_T_cell_VB
DND_41
eosinophil_VB
EPC_VB
erythroblast_CB
Fetal_Adrenal_Gland
Fetal_Intestine_Large
Fetal_Intestine_Small
Fetal_Muscle_Leg
Fetal_Muscle_Trunk
Fetal_Stomach
Fetal_Thymus
Gastric
GM12878
H1_mesenchymal
H1_neuronal_progenitor
H1_trophoblast
H1ESC H9
HeLa_S3
HepG2HMEC
HSMM
HSMMtube
HUVEC_prol_CB
HUVECIM
R90
iPS_20b
iPS_DF_19_11
iPS_DF_6_9K562
Left_Ventric
leLung
M0_macrophage_CB
M0_macrophage_VB
M1_macrophage_CB
M1_macrophage_VB
M2_macrophage_CB
M2_macrophage_VB
Monocytes_CD14_PB_Roadmap
Monocytes_CD14
MSC_VB
naive_B_cell_VB
Natural_Killer_cells_PB
neutrophil_CB
neutrophil_myelocyte_BM
neutrophil_VB
NH_A
NHDF_ADNHEK
NHLF
OsteoblOvary
Pancreas
Placenta
Psoas_Muscle
Right_Atrium
Small_Intestine
Spleen
T_cells_PB_Roadmap
Thymus
CTCF_binding_siteACTIVE
CTCF_binding_siteINACTIVE
CTCF_binding_sitePOISED
CTCF_binding_siteREPRESSED
enhancerACTIVE
enhancerINACTIVE
enhancerPOISED
enhancerREPRESSED
open_chromatin_regionACTIVE
open_chromatin_regionINACTIVE
open_chromatin_regionNA
open_chromatin_regionPOISED
open_chromatin_regionREPRESSED
promoterACTIVE
promoter_flanking_regionACTIVE
promoter_flanking_regionINACTIVE
promoter_flanking_regionPOISED
promoter_flanking_regionREPRESSED
promoterINACTIVE
promoterPOISED
promoterREPRESSED
TF_binding_siteACTIVE
TF_binding_siteINACTIVE
TF_binding_siteNA
TF_binding_sitePOISED
TF_binding_siteREPRESSED
050
010
0015
0020
00
scale
affec
ted #
Diploid genome
• Impact on Regulation
• Variability of genes
• Need to understand the full structure
Challenges: Pursuing the diploid genome
1. Accurate prediction of SVs
2. Comparison of SVs
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
Layer et.al. (2014)
1.1 How to detect Structural Variations (SVs)
• (+) SVs in repetitive regions
• (+) Span SVs
• (+) Uniform coverage
• (+) Can identify more complex SVs
• (-) Higher seq. error rate
• (-) Hard to align
1.1 Long Read Technologies
1.1 Accurate mapping and SV calling
NextGenMap-LR (NGMLR):• Long read mapper• Convex gap costs• Faster then BWA-MEM
Sniffles:• SV caller for long reads• All types of SVs• Phasing of SVs
1.2 NA12878: SV calling
Tech. Coverage
Avg read len Method SVs TRA
PacBio 55x 4,334 Sniffles 22,877 119
OxfordNanopore @Baylor
34x 4,982 Sniffles 12,596 46
Illumina 50x 2 x 101 Manta, Delly, Lumpy
7,275 2,247
Sedlazeck et.al. (2017)
1.1 NA12878: SV calling
Tech. Coverage
Avg readlen
Method SVs TRA DEL INS
PacBio 55x 4,334 Sniffles 22,877 119 9,933 12,052
OxfordNanopore @Baylor
34x 4,982 Sniffles 12,596 46 7,102 5,166
Illumina 50x 2 x 101 Manta, Delly,
Lumpy
7,275 2,247 3,744 0
Sedlazeck et.al. (2017)
1.1 NA12878: check 2,247 vs 119 TRA
Illumina data
Translocation:
PacBio data
ONT data
Truncated reads:
Insertion In rep. region
Overlap Illumina TRA(%)
Insertions 53.05
Deletions 12.06
Duplications 0.57
Nested 0.31
High coverage 1.87
Low complexity 9.79
Explained 77.65
Sedlazeck et.al. (2017)
1.1 NA12878: check 2,247 vs 119 TRA
ONT data
PacBio data
Illumina data
Insertion In rep. region
Inversion:
Translocation:
Truncated reads:
Insertion In rep. region
Sedlazeck et.al. (2017)
1.2 More complex SVs
Inverted tandem duplication:• Pelizaeus-Merzbacher
disease• MECP2• VIPR2
Sedlazeck et.al. (2017)
PacBio data
Illumina data
1.2 More complex SVs
Inversion flanked by deletions:• Haemophilia A• Only found over long range PCR!
(2007)
Sedlazeck et.al. (2017)
Illumina data
PacBio data
Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
Layer et.al. (2014)
2. Comparison of SVs
SURVIVOR Framework:• Compare SVs
• GiaB: 95 vcf file: 1 minute
• Simulate SVs
• Simulate long reads
• Summarize SVs results
Jeffares et.al. (2017)
New SVs
Observed SVs
2. Genome in a Bottle: merging 95 vcfs (1 min)
10x Genomics
BioNano
Complete Genomics
Illumina
PacBio
Minimum 2 callers:SV Caller Comparison:
Using PCR+Sanger validate SVs form multiple categories.
Join CSHL + Baylor to help with validations!
Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
Histogram over genes impacted
#Gene hit by SVS
Fre
que
ncy
0 20 40 60 80
020
00
4000
6000
3. Annotation: SURVIVOR_ant
Annotating SVs with:• Multiple GTF, BED, VCF
Genome in a Bottle:• 63,677 genes (GTF)
• 1,733,686 regions (3 bed files)
• 22 seconds:• 8,314 Genes impacted
Sedlazeck et.al. (2017)
#G
en
es
# SV hit gene
Genes impacted by SVs
Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs: SURVIVOR_ANT
4. Population analysis
5. Diploid Genome
4. SVs in Population: SURVIVOR
• Birth defect study (Karyn MeltzSteinberg, WashU: Wed. 9am: Room 310A)
• 4 callers, 114 samples
• CCDG (William Salerno, HGSC: poster on Friday, #1281)
• 5 callers, 22,600 samples
• Non human:• S. Pombe: 3 callers, 161 samples• Tomato: 3 callers, 846 samples
4. SVs in 22,600 Individuals
We need large SV studies:• Common vs. rare SVs
• Inform GWAS studies
• Ethnicity specific SVs
• Catalog variability of regions• MHC, LPA, etc.
0.0e+00 5.0e+07 1.0e+08 1.5e+08
0.0
00
.10
0.2
0
CHR6: Average SV Allele Frequency per 100kb
Position
Alle
le f
req
uen
cy
MHC LPA
#SV
s
Shared across individuals
Position
Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs: SURVIVOR_ANT
4. Population analysis: SURVIVOR
5. Diploid Genome
5.1 Diploid Genome
Challenges: • Sequencing technology
• Computational methods
• Money
HGSC Approach: GADGET1. Sequence 100 individuals: PacBio + 10x Genomics
2. SV detection/genotyping
3. Phasing of SVs+ SNP
4. Population based genotyping of SVs short reads.
5.2 Diploid Genome
Selecting 100 samples
• We want to maximize the outcome/ $ spent
• Selection of samples (red)
• Select top 100 (red)
• Random selection of samples (boxplot)
Histogram of mat[, 2]
# SVS
#P
atien
ts
2e+04 4e+04 6e+04 8e+04 1e+050
50
10
01
50
200
250
1 6 12 19 26 33 40 47 54 61 68 75 82 89 96
020
40
60
80
100
Random vs. informed choice of samples (CCDG)
# of chosen Samples
SV
in p
opu
lation (
%)
Informed
Top100
Random
Number of chosen samples
SV in
po
pu
lati
on
(%
)
5.3 Diploid Genome (Prototype)
Challenges/ Summary
1. Accurate prediction of SVs: Sniffles (Talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs:SURVIVOR_ANT
4. Population analysis: SURVIVOR
5. Diploid Genome: GADGET
All methods are available:
https://github.com/fritzsedlazeck
https://fritzsedlazeck.github.io/
1 6 12 19 26 33 40 47 54 61 68 75 82 89 96
020
40
60
80
100
Random vs. informed choice of samples (CCDG)
# of chosen Samples
SV
in p
opu
lation (
%)
Informed
Top100
Random
Number of chosen samples
SV in
po
pu
lati
on
(%
)
William Salerno
Stephen Richards
Richard Gibbs
Michael Schatz
Schatz lab
Acknowledgments
Daniel JeffaresJürg BählerChristophe Dessimoz
Justin Zook
GiaB consortium