using 1000 genomes data in disease studies · imputation, and re-imputation with other references,...

26
Using 1000 genomes data in disease studies Jeff Barrett ASHG, 3 November 2010

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Using 1000 genomes data in disease studies

Jeff Barrett

ASHG, 3 November 2010

Page 2: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Two parallel goals in complex disease genetics

For a given disease, can we:

1. Explain heritability

2. Understand biology

Using 1000 genomes data in disease studies ASHG, 3 November 2010 2 / 12

Page 3: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Two parallel goals in complex disease genetics

For a given disease, can we:

1. Explain heritability (prediction/prognosis?)

2. Understand biology (treatment?)

Using 1000 genomes data in disease studies ASHG, 3 November 2010 2 / 12

Page 4: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Two parallel goals in complex disease genetics

For a given disease, can we:

1. Explain heritability (imputation)

2. Understand biology (annotation)

1000 genomes data can play an important role in boththese goals.

Using 1000 genomes data in disease studies ASHG, 3 November 2010 2 / 12

Page 5: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Accurate and deep reference sets are key to imputing lowfrequency variants

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

MAF

Impr

ovem

ent v

s HM

2CEU

CEUCEU+TSICEU+TSI+GIH+MEXWORLD

Using 1000 genomes data in disease studies ASHG, 3 November 2010 3 / 12

Page 6: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Imputation software

I IMPUTE v2 (Howie & Marchini)mathgen.stats.ox.ac.uk/impute/impute v2.html

I BEAGLE (Browning & Browning)faculty.washington.edu/browning/beagle/beagle.html

I MACH & Minimac (Li, Fuchsberger & Abecasis)www.sph.umich.edu/csg/abecasis/MACH/genome.sph.umich.edu/wiki/Minimac

Using 1000 genomes data in disease studies ASHG, 3 November 2010 4 / 12

Page 7: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Imputation is computationally heavy duty

I Imputing into ≈ 16, 000 WTCCC samples using combined SNP/indelpilot data

I Formatting files, aligning strands, etc. can be fiddly

I IMPUTE v2 ‘factory default’ settings

I Genome split into ≈ 600 chunks, each chunk submitted as a job toSanger farm, each job requiring 4–6 GB memory

I Total processing time > 2 CPU years

I 1–2 CPU hours per sample (scales approx linearly with sample size)

Using 1000 genomes data in disease studies ASHG, 3 November 2010 5 / 12

Page 8: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Imputation is computationally heavy duty

I Imputing into ≈ 16, 000 WTCCC samples using combined SNP/indelpilot data

I Formatting files, aligning strands, etc. can be fiddly

I IMPUTE v2 ‘factory default’ settings

I Genome split into ≈ 600 chunks, each chunk submitted as a job toSanger farm, each job requiring 4–6 GB memory

I Total processing time > 2 CPU years

I 1–2 CPU hours per sample (scales approx linearly with sample size)

Using 1000 genomes data in disease studies ASHG, 3 November 2010 5 / 12

Page 9: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Pre-phasing can save a great deal of time

I Simplistically, imputation aims to match skeletal target haplotypes tomore complete (in terms of variation) reference haplotypes.

I In the past, target datasets have been unphased genotype data (e.g.basic GWAS output). This requires a combination of phasing andmatching, which underlies much of the computational burden.

I Phasing target data in advance (and saving the result) meansimputation, and re-imputation with other references, is much faster(comparing haplotypes to each other, rather than genotypes to pairsof haplotypes).

I Implemented via flags in IMPUTE v2, BEAGLE and via Minimac forMACH.

Using 1000 genomes data in disease studies ASHG, 3 November 2010 6 / 12

Page 10: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Reference data, past, present & future

I Past: HapMap2 and HapMap3 (270–1000 samples, 2 million SNPs)

I Present: 1000 genomes pilot (179 samples, >10 million SNPs & smallindels, SV coming)www.1000genomes.orgmathgen.stats.ox.ac.uk/impute/impute v2.html

I Future: 1000 genomes complete data (2,500 samples, 30(?) millionSNPs, indels, SVs). Phased releases of data integrated from allplatforms (low coverage sequence, high coverage exomes, genotypingarrays, arrayCGH. . . )

Using 1000 genomes data in disease studies ASHG, 3 November 2010 7 / 12

Page 11: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Reference data, past, present & future

I Past: HapMap2 and HapMap3 (270–1000 samples, 2 million SNPs)

I Present: 1000 genomes pilot (179 samples, >10 million SNPs & smallindels, SV coming)www.1000genomes.orgmathgen.stats.ox.ac.uk/impute/impute v2.html

I Future: 1000 genomes complete data (2,500 samples, 30(?) millionSNPs, indels, SVs). Phased releases of data integrated from allplatforms (low coverage sequence, high coverage exomes, genotypingarrays, arrayCGH. . . )

Using 1000 genomes data in disease studies ASHG, 3 November 2010 7 / 12

Page 12: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Reference data, past, present & future

I Past: HapMap2 and HapMap3 (270–1000 samples, 2 million SNPs)

I Present: 1000 genomes pilot (179 samples, >10 million SNPs & smallindels, SV coming)www.1000genomes.orgmathgen.stats.ox.ac.uk/impute/impute v2.html

I Future: 1000 genomes complete data (2,500 samples, 30(?) millionSNPs, indels, SVs). Phased releases of data integrated from allplatforms (low coverage sequence, high coverage exomes, genotypingarrays, arrayCGH. . . )

Using 1000 genomes data in disease studies ASHG, 3 November 2010 7 / 12

Page 13: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Example: Crohn’s disease

I Hit at 28 MB missed inWTCCC and 2008meta-analysis (p > 10−4).Hit SNP MAF: 3%

I Picked up in 2010meta-analysis (> 20, 000total samples). Hit SNPMAF: 13%

I Hit at 42 MB not supportedat all in meta-analysis. . .

Using 1000 genomes data in disease studies ASHG, 3 November 2010 8 / 12

Page 14: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Example: Crohn’s disease

I Hit at 28 MB missed inWTCCC and 2008meta-analysis (p > 10−4).Hit SNP MAF: 3%

I Picked up in 2010meta-analysis (> 20, 000total samples). Hit SNPMAF: 13%

I Hit at 42 MB not supportedat all in meta-analysis. . .

Using 1000 genomes data in disease studies ASHG, 3 November 2010 8 / 12

Page 15: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Example: Crohn’s disease

I Hit at 28 MB missed inWTCCC and 2008meta-analysis (p > 10−4).Hit SNP MAF: 3%

I Picked up in 2010meta-analysis (> 20, 000total samples). Hit SNPMAF: 13%

I Hit at 42 MB not supportedat all in meta-analysis. . .

Using 1000 genomes data in disease studies ASHG, 3 November 2010 8 / 12

Page 16: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Example: Crohn’s disease

I Hit at 28 MB missed inWTCCC and 2008meta-analysis (p > 10−4).Hit SNP MAF: 3%

I Picked up in 2010meta-analysis (> 20, 000total samples). Hit SNPMAF: 13%

I Hit at 42 MB not supportedat all in meta-analysis. . .

Using 1000 genomes data in disease studies ASHG, 3 November 2010 8 / 12

Page 17: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Validation and ‘gold standards’

Using 1000 genomes data in disease studies ASHG, 3 November 2010 9 / 12

Page 18: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Validation and ‘gold standards’

Using 1000 genomes data in disease studies ASHG, 3 November 2010 9 / 12

Page 19: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Validation and ‘gold standards’

Using 1000 genomes data in disease studies ASHG, 3 November 2010 9 / 12

Page 20: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Validation and ‘gold standards’

Using 1000 genomes data in disease studies ASHG, 3 November 2010 9 / 12

Page 21: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Functional annotation

Using 1000 genomes data in disease studies ASHG, 3 November 2010 10 / 12

Page 22: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Functional annotation

. . . < 10% of GWAS hit SNPs have r2 > 0.9 with a coding SNP

Using 1000 genomes data in disease studies ASHG, 3 November 2010 10 / 12

Page 23: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Conclusions

I 1000 genomes data will increasingly become thedefault imputation reference panel.

I The project also enables the discovery and annotationof variants, standardization of files and developmentof genotyping products.

I Coming to grips with the subtleties of the data willtake time and continue to evolve.

Using 1000 genomes data in disease studies ASHG, 3 November 2010 11 / 12

Page 24: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Conclusions

I 1000 genomes data will increasingly become thedefault imputation reference panel.

I The project also enables the discovery and annotationof variants, standardization of files and developmentof genotyping products.

I Coming to grips with the subtleties of the data willtake time and continue to evolve.

Using 1000 genomes data in disease studies ASHG, 3 November 2010 11 / 12

Page 25: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Conclusions

I 1000 genomes data will increasingly become thedefault imputation reference panel.

I The project also enables the discovery and annotationof variants, standardization of files and developmentof genotyping products.

I Coming to grips with the subtleties of the data willtake time and continue to evolve.

Using 1000 genomes data in disease studies ASHG, 3 November 2010 11 / 12

Page 26: Using 1000 genomes data in disease studies · imputation, and re-imputation with other references, is much faster (comparing haplotypes to each other, rather than genotypes to pairs

Thanks

1000 Genomes ProjectGoncalo AbecasisBrian BrowningBryan HowieJonathan MarchiniDaniel MacarthurJames Morris

Luke Jostins

Using 1000 genomes data in disease studies ASHG, 3 November 2010 12 / 12