genotyping structural variants in pangenome graphs using ... · introduction 3 mapping reads across...

27
Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics

Upload: others

Post on 25-May-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Genotyping structural variants inpangenome graphs using the vg toolkit

Jean Monlong

November 7, 2019

Genome Informatics

Page 2: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Introduction 2

Pangenome graphs and variant-aware read mapping

A C A TT

T

G

G

G

G

G

C

C

C

T

T

G

G

G

G

G

C

C

Linear reference genome

Variation graph

1KGP 1KGP 1KGP

G

C

G

C

G

Seq

. rea

dsS

eq. r

eads

Page 3: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Introduction 3

Mapping reads across structural variants

Structural variants are genomic variants larger than 50 bp,e.g. insertions, deletions, inversions translocations.

C

Linear reference genome

Variation graph

G

C

G

C

G

DELETION

INSERTION

DELETION

Page 4: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Introduction 4

SV catalogs from long-read sequencing studies

Ref. Project Samples

Chaisson et al. 2019Human Genome Structural

3Variation Consortium (HGSVC)

Audano et al. 2019 SVPOP 15

Zook et al. 2019 Genome in a Bottle (GIAB) 1

HG

SV

Cgnom

ad−S

V

10 100 1,000 10,000 100,000

0

2000

4000

6000

0

20000

40000

60000

size (bp)

varia

nt SV typeDELINS

Page 5: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Goal 5

The vg toolkit is a complete, open sourcesolution for graph construction, readmapping, and variant calling.

https://github.com/vgteam/vg

Garrison et al. Nature Biotech 2018

Can we genotype SVs from short-read sequencing datasetswith the vg toolkit?

Starting from public SV catalogs or de novo assemblies.

Hickey et al. bioRxiv 2019

Page 6: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 6

Genotyping public SV catalogs in human

short reads

vg

vg

HG00514 VCF

VCF

HGSVCSV catalog

genotyped SVsEvaluation

HG

00514

HG

0051

4GRCh38

Evaluate genotype predictions for a sample from the truth set(e.g. HG00514).

Page 7: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 7

Genotyping variants in vg

reads not mapped on linear reference

Line

ar re

fere

nce

Gra

ph re

fere

nce

insertion

readsinsertion

deletion

Snarl 1Path coverage ratio 1:1.6 → het

Snarl 2Path coverage ratio 0:2 → hom

Read mapping to reference pathVariant pathReference path Read mapping to variant path

Genotyping is based on the path coverage.

A snarl is a variant site in the graph, a “bubble”.

Page 8: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 8

Evaluating SV genotypes with a truth set

truthcalls

Deletions/Inversions

<10% rec. overlap

At least 50% coverage and 10% reciprocal overlap<50% coverage

truth

calls

20bp 20bp

Insertions At least 50% of inserted sequence matching nearby insertions

R package: https://github.com/jmonlong/sveval

Page 9: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 9

Results on HGSVC - Simulated reads

whole−genome non−repeat regions

INS

DE

L

vg

Paragraph

BayesT

yper

Delly

SVTyper vg

Paragraph

BayesT

yper

Delly

SVTyper

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00F1

graph-basedSV genotypers

traditionalSV genotypers

Non-repeat regions: regions not overlapping segmental duplications or simple repeats

Page 10: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 10

Results on HGSVC - Real reads

whole−genome non−repeat regions

INS

DE

L

vg

Paragraph

BayesT

yper

Delly

SVTyper vg

Paragraph

BayesT

yper

Delly

SVTyper

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

F1

graph-based SV genotypers

traditionalSV genotypers

Non-repeat regions: regions not overlapping segmental duplications or simple repeats

Page 11: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From SV catalogs in human 11

Simple repeat/low complexity regions are challenging

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00precision

reca

ll

repeat class/family

SINE/Alu

LTR/ERV1

LINE/L1

Retroposon/SVA

Low_complexity

Satellite

Satellite/centr

Simple_repeat

SV type

INS

DEL

SV sequence annotated with RepeatMasker. Class assigned if covered ≥80% by a repeat element.

Page 12: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From de novo assemblies in yeast 12

Challenges with the VCF format

Multiple equivalent representations, over-simplification,impractical.

VCF v4.2 specs

Why not start directly from de novo assemblies?

Page 13: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From de novo assemblies in yeast 13

Analysis of 12 yeast strains from 2 clades

Selected 5 strains to build graph: one reference + 2 per clade.

Assemblies for 5 yeast strains

VCF

Cactus graph

Cactus aligner

vg

vg

Pairwise alignment with reference

Illumina reads

VCF

Compare mappingmetrics

vg genotype

vg genotype

vg

VCF vg

map

map

VCF graph

Cactus graph

Page 14: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From de novo assemblies in yeast 14

Evaluating SV genotyping using mapping statistics

No gold-standard to compare with.

Map reads to a sample graph built from the SV calls:

Assemblies for 5 yeast strains

VCF

Cactus graph

Cactus aligner

vg

vg

Pairwise alignment with reference

VCF

Compare mapping metrics

vg genotype

vg genotype

vg

VCF vg

map

map

VCF graph

Cactus graph

short reads

Evaluation

Mapping quality ∼ Sample graph quality ∼ SV callsquality.

Page 15: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From de novo assemblies in yeast 14

Evaluating SV genotyping using mapping statistics

No gold-standard to compare with.

Map reads to a sample graph built from the SV calls:

Assemblies for 5 yeast strains

VCF

Cactus graph

Cactus aligner

vg

vg

Pairwise alignment with reference

VCF

Compare mapping metrics

vg genotype

vg genotype

vg

VCF vg

map

map

VCF graph

Cactus graph

short reads

Evaluation

Mapping quality ∼ Sample graph quality ∼ SV callsquality.

Page 16: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

From de novo assemblies in yeast 15

Better mapping for SVs called in the cactus graph

Analysis restricted to reads at variation sites.

UWOPS919171

UFRJ50816

YPS138

N44

CBS432

UWOPS034614

YPS128

Y12

SK1

DBVPG6765 DBVPG6044

0.4

0.5

0.6

0.7

0.8

0.4 0.5 0.6 0.7 0.8VCF: average mapping identity

Cac

tus:

ave

rage

map

ping

iden

tity

during graphconstruction

excluded

included

clade

● cerevisiae

paradoxus

Page 17: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Conclusions and future directions 16

Conclusions

The vg toolkit can integrate and genotype SVs.

Graphs from de novo assemblies alignment performs better.

Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/

Future directions

Experiment with high-quality human de novoassemblies (e.g. the Human PanGenome Project).

Combine public SV catalogs and genotype SVs in a largeand diverse cohort.

Page 18: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Conclusions and future directions 16

Conclusions

The vg toolkit can integrate and genotype SVs.

Graphs from de novo assemblies alignment performs better.

Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/

Future directions

Experiment with high-quality human de novoassemblies (e.g. the Human PanGenome Project).

Combine public SV catalogs and genotype SVs in a largeand diverse cohort.

Page 19: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

Acknowledgment 17

Acknowledgment

Benedict PatenGlenn HickeyDavid HellerAdam NovakErik GarrisonJouni SirenJordan EizengaCharles MarkelloXian ChangRobin RounthwaiteJonas SibbesenEric T. Dawson

Page 20: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

18

Universal genome graph

Page 21: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

19

Some methods “over-genotype” similar variants

●●● ●

●●

●●

●●

●●

vg

Paragraph

BayesTyper

SVTyper

Delly Genotyper

SMRT−SV v2 Genotyper

0.9 1.0 1.1 1.2average number of genotyped calls per truth call

met

hod

experiment

HGSVC real reads

GIAB

CHM−PD

SVPOP

type

● INS

DEL

S1

S2

S3

Truth setS1

ParagraphS1vg

Page 22: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

20

Deletion correctly genotyped by vg

deletion

3’ UTR of LONRF2 gene

reads

graph

GRCh38 chr2

51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene.

Page 23: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

21

Simulation experiment

True SVs in VCF Errors in VCF

INS

DE

LIN

V

1 3 7 13 20 1 3 7 13 20

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Depth

Bes

t F1

Method

vg

Paragraph

BayesTyper

SVTyper

Delly Genotyper

Page 24: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

22

SV catalog summary results

HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP

INS

DE

L

all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Genomic regions

Bes

t F1

Methodvg

Paragraph

BayesTyper

SVTyper

Delly Genotyper

SMRT−SV v2 Genotyper

SV evaluation presence genotype

Page 25: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

23

Precision-recall curve

●●

●●

INS DEL

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

Genomic regions ● all non−repeatMethod

●vg

Paragraph

BayesTyper

SVTyper

Delly Genotyper

Page 26: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

24

Evaluation per SV size

INS

all regions

INS

non−repeat regions

DEL

all regions

DEL

non−repeat regions

HGSVCsimulated

reads

HGSVCreal

reads

GIAB

[50,100]

(100,200]

(200,300]

(300,400]

(400,600]

(600,800]

(800,1K]

(1K,2.5K]

(2.5K,5K]>5K

[50,100]

(100,200]

(200,300]

(300,400]

(400,600]

(600,800]

(800,1K]

(1K,2.5K]

(2.5K,5K]>5K

[50,100]

(100,200]

(200,300]

(300,400]

(400,600]

(600,800]

(800,1K]

(1K,2.5K]

(2.5K,5K]>5K

[50,100]

(100,200]

(200,300]

(300,400]

(400,600]

(600,800]

(800,1K]

(1K,2.5K]

(2.5K,5K]>5K

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Size (bp)

F1

scor

e

Page 27: Genotyping structural variants in pangenome graphs using ... · Introduction 3 Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp,

25

Better mapping for SVs called in the cactus graph

Analysis restricted to reads at variation sites.

UFRJ50816

YPS138

N44

CBS432

UWOPS034614

YPS128

Y12

SK1

DBVPG6765

DBVPG6044

20

25

30

35

40

20 25 30 35 40VCF: average mapping quality

Cac

tus:

ave

rage

map

ping

qua

lity

during graphconstruction

excluded

included

clade

● cerevisiae

paradoxus