genotyping structural variants in pangenome graphs using ... · introduction 3 mapping reads across...
TRANSCRIPT
Genotyping structural variants inpangenome graphs using the vg toolkit
Jean Monlong
November 7, 2019
Genome Informatics
Introduction 2
Pangenome graphs and variant-aware read mapping
A C A TT
T
G
G
G
G
G
C
C
C
T
T
G
G
G
G
G
C
C
Linear reference genome
Variation graph
1KGP 1KGP 1KGP
G
C
G
C
G
Seq
. rea
dsS
eq. r
eads
Introduction 3
Mapping reads across structural variants
Structural variants are genomic variants larger than 50 bp,e.g. insertions, deletions, inversions translocations.
C
Linear reference genome
Variation graph
G
C
G
C
G
DELETION
INSERTION
DELETION
Introduction 4
SV catalogs from long-read sequencing studies
Ref. Project Samples
Chaisson et al. 2019Human Genome Structural
3Variation Consortium (HGSVC)
Audano et al. 2019 SVPOP 15
Zook et al. 2019 Genome in a Bottle (GIAB) 1
HG
SV
Cgnom
ad−S
V
10 100 1,000 10,000 100,000
0
2000
4000
6000
0
20000
40000
60000
size (bp)
varia
nt SV typeDELINS
Goal 5
The vg toolkit is a complete, open sourcesolution for graph construction, readmapping, and variant calling.
https://github.com/vgteam/vg
Garrison et al. Nature Biotech 2018
Can we genotype SVs from short-read sequencing datasetswith the vg toolkit?
Starting from public SV catalogs or de novo assemblies.
Hickey et al. bioRxiv 2019
From SV catalogs in human 6
Genotyping public SV catalogs in human
short reads
vg
vg
HG00514 VCF
VCF
HGSVCSV catalog
genotyped SVsEvaluation
HG
00514
HG
0051
4GRCh38
Evaluate genotype predictions for a sample from the truth set(e.g. HG00514).
From SV catalogs in human 7
Genotyping variants in vg
reads not mapped on linear reference
Line
ar re
fere
nce
Gra
ph re
fere
nce
insertion
readsinsertion
deletion
Snarl 1Path coverage ratio 1:1.6 → het
Snarl 2Path coverage ratio 0:2 → hom
Read mapping to reference pathVariant pathReference path Read mapping to variant path
Genotyping is based on the path coverage.
A snarl is a variant site in the graph, a “bubble”.
From SV catalogs in human 8
Evaluating SV genotypes with a truth set
truthcalls
Deletions/Inversions
<10% rec. overlap
At least 50% coverage and 10% reciprocal overlap<50% coverage
truth
calls
20bp 20bp
Insertions At least 50% of inserted sequence matching nearby insertions
R package: https://github.com/jmonlong/sveval
From SV catalogs in human 9
Results on HGSVC - Simulated reads
whole−genome non−repeat regions
INS
DE
L
vg
Paragraph
BayesT
yper
Delly
SVTyper vg
Paragraph
BayesT
yper
Delly
SVTyper
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00F1
graph-basedSV genotypers
traditionalSV genotypers
Non-repeat regions: regions not overlapping segmental duplications or simple repeats
From SV catalogs in human 10
Results on HGSVC - Real reads
whole−genome non−repeat regions
INS
DE
L
vg
Paragraph
BayesT
yper
Delly
SVTyper vg
Paragraph
BayesT
yper
Delly
SVTyper
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
F1
graph-based SV genotypers
traditionalSV genotypers
Non-repeat regions: regions not overlapping segmental duplications or simple repeats
From SV catalogs in human 11
Simple repeat/low complexity regions are challenging
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00precision
reca
ll
repeat class/family
SINE/Alu
LTR/ERV1
LINE/L1
Retroposon/SVA
Low_complexity
Satellite
Satellite/centr
Simple_repeat
SV type
INS
DEL
SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.
From de novo assemblies in yeast 12
Challenges with the VCF format
Multiple equivalent representations, over-simplification,impractical.
VCF v4.2 specs
Why not start directly from de novo assemblies?
From de novo assemblies in yeast 13
Analysis of 12 yeast strains from 2 clades
Selected 5 strains to build graph: one reference + 2 per clade.
Assemblies for 5 yeast strains
VCF
Cactus graph
Cactus aligner
vg
vg
Pairwise alignment with reference
Illumina reads
VCF
Compare mappingmetrics
vg genotype
vg genotype
vg
VCF vg
map
map
VCF graph
Cactus graph
From de novo assemblies in yeast 14
Evaluating SV genotyping using mapping statistics
No gold-standard to compare with.
Map reads to a sample graph built from the SV calls:
Assemblies for 5 yeast strains
VCF
Cactus graph
Cactus aligner
vg
vg
Pairwise alignment with reference
VCF
Compare mapping metrics
vg genotype
vg genotype
vg
VCF vg
map
map
VCF graph
Cactus graph
short reads
Evaluation
Mapping quality ∼ Sample graph quality ∼ SV callsquality.
From de novo assemblies in yeast 14
Evaluating SV genotyping using mapping statistics
No gold-standard to compare with.
Map reads to a sample graph built from the SV calls:
Assemblies for 5 yeast strains
VCF
Cactus graph
Cactus aligner
vg
vg
Pairwise alignment with reference
VCF
Compare mapping metrics
vg genotype
vg genotype
vg
VCF vg
map
map
VCF graph
Cactus graph
short reads
Evaluation
Mapping quality ∼ Sample graph quality ∼ SV callsquality.
From de novo assemblies in yeast 15
Better mapping for SVs called in the cactus graph
Analysis restricted to reads at variation sites.
●
●
UWOPS919171
UFRJ50816
YPS138
N44
CBS432
UWOPS034614
YPS128
Y12
SK1
DBVPG6765 DBVPG6044
0.4
0.5
0.6
0.7
0.8
0.4 0.5 0.6 0.7 0.8VCF: average mapping identity
Cac
tus:
ave
rage
map
ping
iden
tity
during graphconstruction
●
excluded
included
clade
● cerevisiae
paradoxus
Conclusions and future directions 16
Conclusions
The vg toolkit can integrate and genotype SVs.
Graphs from de novo assemblies alignment performs better.
Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/
Future directions
Experiment with high-quality human de novoassemblies (e.g. the Human PanGenome Project).
Combine public SV catalogs and genotype SVs in a largeand diverse cohort.
Conclusions and future directions 16
Conclusions
The vg toolkit can integrate and genotype SVs.
Graphs from de novo assemblies alignment performs better.
Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/
Future directions
Experiment with high-quality human de novoassemblies (e.g. the Human PanGenome Project).
Combine public SV catalogs and genotype SVs in a largeand diverse cohort.
Acknowledgment 17
Acknowledgment
Benedict PatenGlenn HickeyDavid HellerAdam NovakErik GarrisonJouni SirenJordan EizengaCharles MarkelloXian ChangRobin RounthwaiteJonas SibbesenEric T. Dawson
18
Universal genome graph
19
Some methods “over-genotype” similar variants
●●● ●
●●
●
●●
●●
●●
vg
Paragraph
BayesTyper
SVTyper
Delly Genotyper
SMRT−SV v2 Genotyper
0.9 1.0 1.1 1.2average number of genotyped calls per truth call
met
hod
experiment
●
●
●
●
HGSVC real reads
GIAB
CHM−PD
SVPOP
type
● INS
DEL
S1
S2
S3
Truth setS1
ParagraphS1vg
20
Deletion correctly genotyped by vg
deletion
3’ UTR of LONRF2 gene
reads
graph
GRCh38 chr2
51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.
21
Simulation experiment
True SVs in VCF Errors in VCF
INS
DE
LIN
V
1 3 7 13 20 1 3 7 13 20
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Depth
Bes
t F1
Method
vg
Paragraph
BayesTyper
SVTyper
Delly Genotyper
22
SV catalog summary results
HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP
INS
DE
L
all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Genomic regions
Bes
t F1
Methodvg
Paragraph
BayesTyper
SVTyper
Delly Genotyper
SMRT−SV v2 Genotyper
SV evaluation presence genotype
23
Precision-recall curve
●
●
●
●
●●
●●
●
INS DEL
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
Genomic regions ● all non−repeatMethod
●
●
●
●
●vg
Paragraph
BayesTyper
SVTyper
Delly Genotyper
24
Evaluation per SV size
INS
all regions
INS
non−repeat regions
DEL
all regions
DEL
non−repeat regions
HGSVCsimulated
reads
HGSVCreal
reads
GIAB
[50,100]
(100,200]
(200,300]
(300,400]
(400,600]
(600,800]
(800,1K]
(1K,2.5K]
(2.5K,5K]>5K
[50,100]
(100,200]
(200,300]
(300,400]
(400,600]
(600,800]
(800,1K]
(1K,2.5K]
(2.5K,5K]>5K
[50,100]
(100,200]
(200,300]
(300,400]
(400,600]
(600,800]
(800,1K]
(1K,2.5K]
(2.5K,5K]>5K
[50,100]
(100,200]
(200,300]
(300,400]
(400,600]
(600,800]
(800,1K]
(1K,2.5K]
(2.5K,5K]>5K
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Size (bp)
F1
scor
e
25
Better mapping for SVs called in the cactus graph
Analysis restricted to reads at variation sites.
●
●
UFRJ50816
YPS138
N44
CBS432
UWOPS034614
YPS128
Y12
SK1
DBVPG6765
DBVPG6044
20
25
30
35
40
20 25 30 35 40VCF: average mapping quality
Cac
tus:
ave
rage
map
ping
qua
lity
during graphconstruction
●
excluded
included
clade
● cerevisiae
paradoxus