ashg sedlazeck grc_share

Post on 22-Jan-2018

74 Views

Category:

Health & Medicine

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Structural Variation Characterization Across the Human Genome and PopulationsFritz Sedlazeck

October, 17, 2017

Scientific interestsDetection of Variants

Sniffles (in bioRxiv)

SURVIVOR Jeffares et. al. (2017)

BOD-Score Sedlazeck et.al.(2013)

Mapping/ Assembly reads

NextGenMap-LR(in bioRxiv)

Falcon UnzipChin et.al. (2016)

NextGenMapSedlazeck et.al. (2013)

Benchmarking/ Biases

DangerTrackDolgalev et.al. (2017)

TeaserSmolka et.al. (2015)

SequencingJünemann et.al. (2013)

ApplicationsModel organisms:-Cancer (SKBR3) (in bioRxiv)-miRNA editing (Vesely et.al. 2012)

Non Model organisms:-Cottus transposons (Dennenmoseret. al. 2017)-Clunio (Kaiser et. al. 2016)-Seabass (Vij et.al. 2016)-Pineapple (Ming et.al. 2015)

Figure'1'

“moonlight”'

Structural VariationsGenomic DisordersEvolution

Impact on regulation Impact on phenotypes

Reg

ula

tory

Sta

te

Cell Line

A549Aorta

B_cells_PB_Roadmap

CD14CD16__monocyte_CB

CD14CD16__monocyte_VB

CD4_ab_T_cell_VB

CD8_ab_T_cell_CB

CM_CD4_ab_T_cell_VB

DND_41

eosinophil_VB

EPC_VB

erythroblast_CB

Fetal_Adrenal_Gland

Fetal_Intestine_Large

Fetal_Intestine_Small

Fetal_Muscle_Leg

Fetal_Muscle_Trunk

Fetal_Stomach

Fetal_Thymus

Gastric

GM12878

H1_mesenchymal

H1_neuronal_progenitor

H1_trophoblast

H1ESC H9

HeLa_S3

HepG2HMEC

HSMM

HSMMtube

HUVEC_prol_CB

HUVECIM

R90

iPS_20b

iPS_DF_19_11

iPS_DF_6_9K562

Left_Ventric

leLung

M0_macrophage_CB

M0_macrophage_VB

M1_macrophage_CB

M1_macrophage_VB

M2_macrophage_CB

M2_macrophage_VB

Monocytes_CD14_PB_Roadmap

Monocytes_CD14

MSC_VB

naive_B_cell_VB

Natural_Killer_cells_PB

neutrophil_CB

neutrophil_myelocyte_BM

neutrophil_VB

NH_A

NHDF_ADNHEK

NHLF

OsteoblOvary

Pancreas

Placenta

Psoas_Muscle

Right_Atrium

Small_Intestine

Spleen

T_cells_PB_Roadmap

Thymus

CTCF_binding_siteACTIVE

CTCF_binding_siteINACTIVE

CTCF_binding_sitePOISED

CTCF_binding_siteREPRESSED

enhancerACTIVE

enhancerINACTIVE

enhancerPOISED

enhancerREPRESSED

open_chromatin_regionACTIVE

open_chromatin_regionINACTIVE

open_chromatin_regionNA

open_chromatin_regionPOISED

open_chromatin_regionREPRESSED

promoterACTIVE

promoter_flanking_regionACTIVE

promoter_flanking_regionINACTIVE

promoter_flanking_regionPOISED

promoter_flanking_regionREPRESSED

promoterINACTIVE

promoterPOISED

promoterREPRESSED

TF_binding_siteACTIVE

TF_binding_siteINACTIVE

TF_binding_siteNA

TF_binding_sitePOISED

TF_binding_siteREPRESSED

A549Aorta

B_cells_PB_Roadmap

CD14CD16__monocyte_CB

CD14CD16__monocyte_VB

CD4_ab_T_cell_VB

CD8_ab_T_cell_CB

CM_CD4_ab_T_cell_VB

DND_41

eosinophil_VB

EPC_VB

erythroblast_CB

Fetal_Adrenal_Gland

Fetal_Intestine_Large

Fetal_Intestine_Small

Fetal_Muscle_Leg

Fetal_Muscle_Trunk

Fetal_Stomach

Fetal_Thymus

Gastric

GM12878

H1_mesenchymal

H1_neuronal_progenitor

H1_trophoblast

H1ESC H9

HeLa_S3

HepG2HMEC

HSMM

HSMMtube

HUVEC_prol_CB

HUVECIM

R90

iPS_20b

iPS_DF_19_11

iPS_DF_6_9K562

Left_Ventric

leLung

M0_macrophage_CB

M0_macrophage_VB

M1_macrophage_CB

M1_macrophage_VB

M2_macrophage_CB

M2_macrophage_VB

Monocytes_CD14_PB_Roadmap

Monocytes_CD14

MSC_VB

naive_B_cell_VB

Natural_Killer_cells_PB

neutrophil_CB

neutrophil_myelocyte_BM

neutrophil_VB

NH_A

NHDF_ADNHEK

NHLF

OsteoblOvary

Pancreas

Placenta

Psoas_Muscle

Right_Atrium

Small_Intestine

Spleen

T_cells_PB_Roadmap

Thymus

CTCF_binding_siteACTIVE

CTCF_binding_siteINACTIVE

CTCF_binding_sitePOISED

CTCF_binding_siteREPRESSED

enhancerACTIVE

enhancerINACTIVE

enhancerPOISED

enhancerREPRESSED

open_chromatin_regionACTIVE

open_chromatin_regionINACTIVE

open_chromatin_regionNA

open_chromatin_regionPOISED

open_chromatin_regionREPRESSED

promoterACTIVE

promoter_flanking_regionACTIVE

promoter_flanking_regionINACTIVE

promoter_flanking_regionPOISED

promoter_flanking_regionREPRESSED

promoterINACTIVE

promoterPOISED

promoterREPRESSED

TF_binding_siteACTIVE

TF_binding_siteINACTIVE

TF_binding_siteNA

TF_binding_sitePOISED

TF_binding_siteREPRESSED

050

010

0015

0020

00

scale

affec

ted #

Diploid genome

• Impact on Regulation

• Variability of genes

• Need to understand the full structure

Challenges: Pursuing the diploid genome

1. Accurate prediction of SVs

2. Comparison of SVs

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Layer et.al. (2014)

1.1 How to detect Structural Variations (SVs)

• (+) SVs in repetitive regions

• (+) Span SVs

• (+) Uniform coverage

• (+) Can identify more complex SVs

• (-) Higher seq. error rate

• (-) Hard to align

1.1 Long Read Technologies

1.1 Accurate mapping and SV calling

NextGenMap-LR (NGMLR):• Long read mapper• Convex gap costs• Faster then BWA-MEM

Sniffles:• SV caller for long reads• All types of SVs• Phasing of SVs

1.2 NA12878: SV calling

Tech. Coverage

Avg read len Method SVs TRA

PacBio 55x 4,334 Sniffles 22,877 119

OxfordNanopore @Baylor

34x 4,982 Sniffles 12,596 46

Illumina 50x 2 x 101 Manta, Delly, Lumpy

7,275 2,247

Sedlazeck et.al. (2017)

1.1 NA12878: SV calling

Tech. Coverage

Avg readlen

Method SVs TRA DEL INS

PacBio 55x 4,334 Sniffles 22,877 119 9,933 12,052

OxfordNanopore @Baylor

34x 4,982 Sniffles 12,596 46 7,102 5,166

Illumina 50x 2 x 101 Manta, Delly,

Lumpy

7,275 2,247 3,744 0

Sedlazeck et.al. (2017)

1.1 NA12878: check 2,247 vs 119 TRA

Illumina data

Translocation:

PacBio data

ONT data

Truncated reads:

Insertion In rep. region

Overlap Illumina TRA(%)

Insertions 53.05

Deletions 12.06

Duplications 0.57

Nested 0.31

High coverage 1.87

Low complexity 9.79

Explained 77.65

Sedlazeck et.al. (2017)

1.1 NA12878: check 2,247 vs 119 TRA

ONT data

PacBio data

Illumina data

Insertion In rep. region

Inversion:

Translocation:

Truncated reads:

Insertion In rep. region

Sedlazeck et.al. (2017)

1.2 More complex SVs

Inverted tandem duplication:• Pelizaeus-Merzbacher

disease• MECP2• VIPR2

Sedlazeck et.al. (2017)

PacBio data

Illumina data

1.2 More complex SVs

Inversion flanked by deletions:• Haemophilia A• Only found over long range PCR!

(2007)

Sedlazeck et.al. (2017)

Illumina data

PacBio data

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Layer et.al. (2014)

2. Comparison of SVs

SURVIVOR Framework:• Compare SVs

• GiaB: 95 vcf file: 1 minute

• Simulate SVs

• Simulate long reads

• Summarize SVs results

Jeffares et.al. (2017)

New SVs

Observed SVs

2. Genome in a Bottle: merging 95 vcfs (1 min)

10x Genomics

BioNano

Complete Genomics

Illumina

PacBio

Minimum 2 callers:SV Caller Comparison:

Using PCR+Sanger validate SVs form multiple categories.

Join CSHL + Baylor to help with validations!

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Histogram over genes impacted

#Gene hit by SVS

Fre

que

ncy

0 20 40 60 80

020

00

4000

6000

3. Annotation: SURVIVOR_ant

Annotating SVs with:• Multiple GTF, BED, VCF

Genome in a Bottle:• 63,677 genes (GTF)

• 1,733,686 regions (3 bed files)

• 22 seconds:• 8,314 Genes impacted

Sedlazeck et.al. (2017)

#G

en

es

# SV hit gene

Genes impacted by SVs

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs: SURVIVOR_ANT

4. Population analysis

5. Diploid Genome

4. SVs in Population: SURVIVOR

• Birth defect study (Karyn MeltzSteinberg, WashU: Wed. 9am: Room 310A)

• 4 callers, 114 samples

• CCDG (William Salerno, HGSC: poster on Friday, #1281)

• 5 callers, 22,600 samples

• Non human:• S. Pombe: 3 callers, 161 samples• Tomato: 3 callers, 846 samples

4. SVs in 22,600 Individuals

We need large SV studies:• Common vs. rare SVs

• Inform GWAS studies

• Ethnicity specific SVs

• Catalog variability of regions• MHC, LPA, etc.

0.0e+00 5.0e+07 1.0e+08 1.5e+08

0.0

00

.10

0.2

0

CHR6: Average SV Allele Frequency per 100kb

Position

Alle

le f

req

uen

cy

MHC LPA

#SV

s

Shared across individuals

Position

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs: SURVIVOR_ANT

4. Population analysis: SURVIVOR

5. Diploid Genome

5.1 Diploid Genome

Challenges: • Sequencing technology

• Computational methods

• Money

HGSC Approach: GADGET1. Sequence 100 individuals: PacBio + 10x Genomics

2. SV detection/genotyping

3. Phasing of SVs+ SNP

4. Population based genotyping of SVs short reads.

5.2 Diploid Genome

Selecting 100 samples

• We want to maximize the outcome/ $ spent

• Selection of samples (red)

• Select top 100 (red)

• Random selection of samples (boxplot)

Histogram of mat[, 2]

# SVS

#P

atien

ts

2e+04 4e+04 6e+04 8e+04 1e+050

50

10

01

50

200

250

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96

020

40

60

80

100

Random vs. informed choice of samples (CCDG)

# of chosen Samples

SV

in p

opu

lation (

%)

Informed

Top100

Random

Number of chosen samples

SV in

po

pu

lati

on

(%

)

5.3 Diploid Genome (Prototype)

Challenges/ Summary

1. Accurate prediction of SVs: Sniffles (Talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs:SURVIVOR_ANT

4. Population analysis: SURVIVOR

5. Diploid Genome: GADGET

All methods are available:

https://github.com/fritzsedlazeck

https://fritzsedlazeck.github.io/

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96

020

40

60

80

100

Random vs. informed choice of samples (CCDG)

# of chosen Samples

SV

in p

opu

lation (

%)

Informed

Top100

Random

Number of chosen samples

SV in

po

pu

lati

on

(%

)

William Salerno

Stephen Richards

Richard Gibbs

Michael Schatz

Schatz lab

Acknowledgments

Daniel JeffaresJürg BählerChristophe Dessimoz

Justin Zook

GiaB consortium

top related