1000g pilot 3 progress in silico analysis and comparison to experimental validation

33
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L Kiran Garimella (Broad Institute) + C February 2, 2010 1

Upload: mirit

Post on 05-Feb-2016

31 views

Category:

Documents


1 download

DESCRIPTION

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation. Gabor Marth (Boston College ) + A + L Kiran Garimella (Broad Institute ) + C February 2, 2010. Acknowledgements. Boston College Amit Indap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

1

1000G Pilot 3 Progress

in silico analysis and comparison to experimental validation

Gabor Marth (Boston College) + A + LKiran Garimella (Broad Institute) + CFebruary 2, 2010

Page 2: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

2

Acknowledgements

BaylorMatthew BainbridgeFuli YuDonna MuznyRichard Gibbs

BroadChris HartlKiran GarimellaCarrie SougnezMark DePristo

WUGSCDan KoboldtBob Fulton

WTSIAarno Palotie

Boston CollegeAmit IndapWen Fung LeongGabor Marth

CornellAndy Clark

StanfordSimon GravelCarlos Bustamante

MichiganTom Blackwell

Page 3: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

3

Data

CEU TSI CHB CHD JPT LWK YRI

Number of samples 90 66 109 107 105 108 112

Sequencing technology SLX+454 SLX SLX+454 SLX+454 SLX+454 454 SLX+454

Per-sample coverage 78.20X 65.20X 45.40X 60.25X 52.79X 31.29X 58.12X

• Capture technologies:– Nimblegen solid phase– Agilent liquid phase

• Sequencing technologies:– SLX– 454

• Data producers:– BCM– BI– WTSI– WUGSC

• Capture targets:– Started with ~1,000 genes / ~10,000 exons / 2.3Mb– 1.43Mb of total target length shared between 4 data

centers used for this analysis

• Samples:– 697 total samples– 7 populations

• Sequence coverage:– Goal was deep per-sample coverage– Effective coverage somewhat reduced by fragment

duplications

Page 4: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

4

PipelinesProcessing step BC BI

Read mapping SW MOSAIK MAQ (SLX)SSAHA2 (454)

Duplicate filtering SW Picard MarkDuplicates (SLX)BCMMarkduplicates (454)

Picard MarkDuplicates (SLX) Picard MarkDuplicates (454)

Base quality recalibration SW

GATK (SLX)None (454)

GATK (SLX)GATK (454)

SNP calling SW GigaBayes (BamBayes) UnifiedGenotyper

CEU

TSI

CHB

CHD

JPT

LWK

YRI

Union of all called sites in all 697 samples

CEU

TSI

CHB

CHD

JPT

LWK

YRI

Segregating sites in each population sample

All 697 samples

All 697 samples

SNP calling

SNP statistics

Page 5: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

5

BC and BI call sets are convergingComparison # BC call

versionBC total calls

BC unique calls

BC & BI(intersection)

BC || BI(union)

BI unique calls

BI total calls BI call version

1 2009/11/20 11,580(55.96%)

733(3.54%)

10,847(54.34%)

20,695(100%)

9,115(44.04%)

19,962(96.46%)

v2

2 2009/11/20 11,580(65.75%)

1,480(8.40%)

10,100(62.60%)

17,613(100%)

6,033(34.25%)

16,133(91.60%)

v3

3 2010/01/20 14,502(79.35%)

2,144(11.73%)

12,358(76.60%)

18,277(100%)

3,775(20.65%)

16,133(88.27%)

v3

4 2010/01/20 14,502(72.91%)

1,741(8.75%)

12,761(64.16%)

19,890(100%)

5,388(27.01%)

18,149(91.25%)

v4

Comparison # CEU TSI CHB CHD JPT LWK YRI

1 3,354 (73.87%) 3,168 (65.88%) 3,279 (66.23%) 3,226 (68.42%) 2,942 (47.79%) 4,922 (70.56%) 4,917 (72.08%)

2 3,036 (70.62%) 2,893 (69.34%) 2,938 (62,23%) 2,783 (60.58%) 2,545 (55.64%) 4,486 (65.33%) 4,253 (66.30%)

3 3,333 (74.63%) 3,155 (73.15%) 3,294 (66.80%) 3,201 (66.69%) 2,795 (58.40%) 5,165 (73.18%) 4,728 (71.29%)

4 3,489 (78.78%) 3,281 (69.32) 3,415 (69.74%) 3,431 (72.81%) 2,900 (50.86%) 5,459 (78.55%) 5,175 (78.59%)

All called sites

Called sites per population (BC/BI intersection)

Intersection (% of union)

Number of sites(% of union)

Page 6: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

6

SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI

samples 90 66 109 107 105 108 112

90 66 109 107 105 108 112

called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891

3,816 4,285 3,972 3,881 4,719 6,370 5,869

dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897

2,352 2,200 1,827 1,753 1,710 2,825 2,856

% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18

61.64 51.34 46.00 45.17 36.24 44.35 48.66

Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92

3.14 2.38 3.15 3.16 1.83 3.17 3.15

novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994

1,464 2,085 2,145 2,128 3,009 3,545 3,013

Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56

2.92 1.72 3.03 3.05 1.36 3.07 2.99

BCBI

Page 7: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

7

SNP calls (all samples)

BC BI

Samples 697 697

Called SNPs 14,502 18,149

dbSNPs 3,948 4,041

dbSNP fraction 27.22% 22.27%

5,388 SNPs172 dbSNPsdbSNP=3.19%

1,741 SNPs 79 dbSNPsdbSNP=4.54%

12,761 SNPs3,869 dbSNPs

dbSNP=30.32%

BC: 14,502 SNPs BI: 18,149 SNPs

BC U BI = 19,890

Page 8: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

8

Genotype call accuracy relative to HapMap3

CEU TSI CHB CHD JPT LWK YRI

FDR of variant genotypes in HapMap3 (%) 0.96 0.23 2.61 1.42 3.60 0.47 0.57

1.41 0.45 2.99 1.82 3.56 0.66 1.25

Correct calls (%) 98.39 98.98 96.76 98.20 95.72 99.06 98.63

97.22 98.26 95.45 97.35 94.55 98.74 96.68

Accuracy of homozygote reference calls (%) 99.20 99.81 97.52 98.62 96.42 99.64 99.59

98.79 99.62 97.07 98.21 96.33 99.48 99.07

Accuracy of heterozygote calls (%) 97.50 97.72 97.98 99.12 96.81 98.37 96.53

94.49 95.43 94.19 97.37 92.30 97.89 90.81

Accuracy of homozygote non-reference calls (%) 97.31 98.44 93.27 95.78 92.69 98.21 98.45

96.77 98.76 93.26 95.16 93.43 97.67 97.46

BCBI

Data quality in CHB and JPT samples seems consistently lower

Statistics only include genotype calls at SNP sites in BC∩BI

Page 9: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

9

Genotype calls

All SNP sites considered Only SNP sites with >= 80% called genotypes

# SNP sites=3,075r=0.9979

# SNP sites=3,489r=0.9921

• Filtering:BC filters on genotype call qualityBI reports a genotype for any site where at least one read covers

• Nominally, BI makes more calls than BC, and has, on average, higher AF

The Broad caller does not filter on genotype quality

• Good allele frequency concordance between BC and BI• At genotype calls that passes BC filter, and BI also makes a call, no discordance was found

Page 10: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

10

1KG validation executive summary

• Evaluated BI and BC calls against validation– 1KG chip1

• 312/697 samples across 7 populations represented• ~300 sites (150 novel) overlap with Pilot 3 target region

• Concordance with 1KG chip is very high– Where covered (> 5 reads):

• 302/312 (97%) of samples have >90% variant sensitivity• 269/312 (86%) of samples have >90% genotype sensitivity

– Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues

• Later sequencing has far greater concordance with chip than earlier sequencing1. Details in Appendix

Page 11: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Nearly all samples in call-set overlap have high sensitivity and specificity

0 50 100 150 200 250 300 3500.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

105.0%

% Variant PPV

% Variant Sensitivity

% Ignored Low-Coverage Bases

Pilot 3 individual (312 individuals total after eliminating low-coverage samples)

These 10 low-sensitivity samples have strange

allele balances and are likely contaminated

All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD)

11

Page 12: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

12

Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations

CEU CHB CHD JPT LWK TSI YRI0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

105.0%

Mean % PPV

Mean % Sensitivity

N Samples: 69 13 27 102 69 3 24

8/2008ILMN/454

All Ctrs

8/2008ILMN/454

All Ctrs

8/2008ILMN/454

BI/BCM

1/2009454BCM

8/2008ILMN/454

BI/BCM

10/2008ILMNBI/SC

2008/2009ILMN/454

All Ctrs

Page 13: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

13

Low-frequency / singleton validation: executive summary

• Low-frequency Sequenom assay1

– Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers)

– Validated sites in those 46 individuals• 89/105 are true singletons• 16/105 are false-positive singletons (hom-refs and two non-singletons)

• Concordance with low-frequency assay is very high– Callsets today (January 2010)

• In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons

• In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons

1. Details in Appendix

Page 14: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

14

Call Set Loci Tested (after Sequenom filtering)

Overlap with Test Set

TP (PPV) FP True, but not Singleton

BC ∩ BI 105 71 71 (100%) 0 0

BC BI∪ 105 92 89 (97%) 3 0

Whole Assay 105 105 89 (85%) 16 2

Callers are able to detect most singletons with very low false-positive rate

Joint calls find every singleton in the assay, with exceedingly few

false positives.

Page 15: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

15

Conclusions / future directions

• Data quality has improved significantly over the life of the project

• Both BC and BI pipelines produce high-quality call sets– Good agreement between call sets– intersection highly concordant with experimental validation

data– Estimated FP rate below 5%

• The current Pilot 3 release is the BC∩BI (intersection) call set

• We are proceeding with validations– Dual focus: accuracy and functional classes– Results will inform future releases

Page 16: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

APPENDIX

Page 17: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Population spectrum of called SNPs

Page 18: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

18

Population-spectrum of called SNPs

CEU TSI CHB CHD JPT LWK YRI ALL

called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891 14,502

3,816 4,285 3,972 3,881 4,719 6,370 5,869 18,149

BCBI

• Observation: BC call more SNPs on the population level, but less SNP sites overall

• Reason: BC tends to call the same site in more populations…

Page 19: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

BC/BI SNP calls per population (more detail)

Page 20: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

20

SNP calls (per population)CEU TSI CHB CHD JPT LWK YRI

samples 90 66 109 107 105 108 112

90 66 109 107 105 108 112

called SNPs 4,102 3,729 4,340 4,262 3,883 6,039 5,891

3,816 4,285 3,972 3,881 4,719 6,370 5,869

dbSNPs 2,422 2,257 2,042 1,924 1,950 2,872 2,897

2,352 2,200 1,827 1,753 1,710 2,825 2,856

% dbSNP 59.04 60.53 47.05 45.14 50.22 47.56 49.18

61.64 51.34 46.00 45.17 36.24 44.35 48.66

Ts/Tv (called SNPs) 2.73 2.78 2.82 3.06 2.85 3.45 2.92

3.14 2.38 3.15 3.16 1.83 3.17 3.15

novel SNPs 1,680 1,472 2,298 2,338 1,933 3,167 2,994

1,464 2,085 2,145 2,128 3,009 3,545 3,013

Ts/Tv (novel SNPs) 2.05 2.10 2.44 2.81 2.43 3.44 2.56

2.92 1.72 3.03 3.05 1.36 3.07 2.99

singletons 1,378 1,264 1,654 1,686 1,284 1,430 1,457

1,240 1,911 1,555 1,500 2,347 1,692 1,489

Ts/Tv (singletons) 2.72 3.36 3.33 3.39 3.09 4.68 3.04

2.84 1.72 2.81 3.03 1.11 3.26 2.73

BCBI

Page 21: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: CEUPopulation: CEU (90 samples) BC Broad

# SNPs called (Ts/Tv) 4,102 (2.73) 3,816 (3.14)

#dbSNP (Ts/Tv) 2,422 (3.40) 2,352(3.28)

# novel SNPs (Ts/Tv) 1,680 (2.05) 1,464 (2.92)

# Singleton (Ts/Tv) 1,378 (2.72) 1,240 (2.84)

32752(15.90%)

1.32

BC613

122(19.90%)0.92

3,4892,300(65.92%)

3.47

SNP#dBSnp(%)Ts/Tv

Broad

Page 22: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: CHBPopulation: CHB (109 samples) BC Broad

# SNPs called (Ts/Tv) 4,340 (2.82) 3,972 (3.15)

#dbSNP (Ts/Tv) 2,042 (3.37) 1,827 (3.30)

# novel SNPs (Ts/Tv) 2,298 (2.44) 2,145 (3.03)

# Singleton (Ts/Tv) 1,654 (3.33) 1,555 (2.81)

55732(5.75%)

1.37

BC925

247(26.70%)1.23

3,4151,795(52.56%)

3.74

Broad

SNP#dBSnp(%)Ts/Tv

Page 23: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: CHDPopulation: CHD (107 samples) BC Broad

# SNPs called (Ts/Tv) 4,262 (3.06) 3,881 (3.16)

#dbSNP (Ts/Tv) 1,924 (3.40) 1,753 (3.30)

# novel SNPs (Ts/Tv) 2,338 (2.81) 2,128 (3.05)

# Singleton (Ts/Tv) 1,686 (3.39) 1,500 (3.03)

45031(6.44%)

1.33

BC

831200(24.07%)

1.68

34311,724(50.25%)

3.64

Broad

SNP#dBSnp(%)Ts/Tv

Page 24: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: JPTPopulation: JPT (105 samples) BC Broad

# SNPs called (Ts/Tv) 3,883 (2.85) 4,719 (1.83)

#dbSNP (Ts/Tv) 1,950 (3.39) 1,710 (3.31)

# novel SNPs (Ts/Tv) 1,933 (2.43) 3,009 (1.36)

# Singleton (Ts/Tv) 1,284 (3.09) 2,347 (1.11)

983271(27.57%)

1.54

BC1819

31(1.70%)0.74

2,9001,679 (57.90%)

3.67

Broad

SNP#dBSnp(%)Ts/Tv

Page 25: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: LWKPopulation: LWK (108 samples) BC Broad

# SNPs called (Ts/Tv) 6,039 (3.45) 6,370 (3.17)

#dbSNP (Ts/Tv) 2,872 (3.46) 2,825 (3.31)

# novel SNPs (Ts/Tv) 3,167 (3.44) 3,545 (3.08)

# Singleton (Ts/Tv) 1,430(4.68) 1,692 (3.26)

580136(23.45%)

2.09

BC911

89(9.77%)1.56

5,4592,736(50.12%)

3.67

Broad

SNP#dBSnp(%)Ts/Tv

Page 26: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: TSIPopulation: TSI (66 samples) BC Broad

# SNPs called (Ts/Tv) 3,729 (2.78) 4,285 (2.39)

#dbSNP (Ts/Tv) 2,257 (3.42) 2,200 (3.40)

# novel SNPs (Ts/Tv) 1,472 (2.10) 2,085 (1.72)

# Singleton (Ts/Tv) 1,264(3.36) 1,911 (1.72)

448105(23.44%)

0.71

BC1,004

48(4.78%)0.85

3,2812152(65.59%)

3.54

Broad

SNP#dBSnp(%)Ts/Tv

Page 27: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Broad & BC calls: YRIPopulation: TSI (66 samples) BC Broad

# SNPs called (Ts/Tv) 5,891(2.92) 5,869 (3.15)

#dbSNP (Ts/Tv) 2897 (3.38) 2,856 (3.34)

# novel SNPs (Ts/Tv) 2,994 (2.56) 3,013 (2.99)

# Singleton (Ts/Tv) 1,489 (3.04) 1,457 (2.73)

716112(15.64%)

0.95

BC694

71(1023%)1.48

5,1752,785(53.82%)

3.56

Broad

SNP#dBSnp(%)Ts/Tv

Page 28: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call

set

Page 29: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

BC/BI genotype calls (CHB & CHD)

All SNPs SNPs with >= 80% called genotypes

All SNPs SNPs with >= 80% called genotypes

#sites=3415r=0.9925

#sites=3431r=0.9941

CHD

CHB

#sites=3028r=0.9993

#sites=3310r=0.9991

Page 30: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

BC/BI genotype calls (TSI & JPT)

#sites=2900r=0.9922

#sites=2370r=0.9991

#sites=3108r=0.9973

#sites=3281r=0.9912

TSI

JPT

All SNPs SNPs with >= 80% called genotypes

All SNPs SNPs with >= 80% called genotypes

Page 31: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

BC/BI genotype calls (LWK & YRI)

#sites=5337r=0.9984

#sites=5459r=0.9924

#sites=4276r=0.9978

#sites=5175r=0.9917

YRI

LWK

All SNPs SNPs with >= 80% called genotypes

All SNPs SNPs with >= 80% called genotypes

Page 32: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Low frequency / singleton validation design

Page 33: 1000G Pilot 3 Progress in  silico  analysis and comparison to experimental validation

Per population PPV and sensitivity