large-scale association studies: brute force and ignorance

30
Large-scale association studies: brute force and ignorance Thomas Lumley BIOINF 744/STATS 771

Upload: stacia

Post on 23-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Large-scale association studies: brute force and ignorance. Thomas Lumley BIOINF 744/STATS 771. My experience: the CHARGE consortium. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large-scale association studies: brute force and ignorance

Large-scale association studies:brute force and ignorance

Thomas LumleyBIOINF 744/STATS 771

Page 2: Large-scale association studies: brute force and ignorance

My experience: the CHARGE consortium

I’m in CHARGEI’m in CHARGEI’m in CHARGE

I’m in CHARGEI’m in CHARGE

I’m in CHARGEI’m in CHARGE

I’m in CHARGEI’m in CHARGE

I’m in CHARGEI’m in CHARGE

Page 3: Large-scale association studies: brute force and ignorance

Genome-wide Association Studies

• If a protein is important in a particular disease or trait then small changes in function or expression of that protein should have a small effect on the disease or trait.

• A small effect. Very small. No, smaller than that.

• But not confounded with environmental or lifestyle factors : like a tiny randomised experiment

• And not relying on prior biological knowledge– we can find surprises.

• Because SNP associations are very weak, need 103-105 people in the study

Page 4: Large-scale association studies: brute force and ignorance

Does it work?

• Not for risk prediction or treatment choice– exception: some adverse drug reactions

• Yes, for discovering new mechanisms or potential drug targets– mystery 9p21 variant in CHD– autophagy in Crohn’s Disease– sodium transporter affecting uric acid levels– new ion channels important in heart rhythm

Page 5: Large-scale association studies: brute force and ignorance

Measurement technologies

• SNP chips– cDNA attached to glass/silicon– multiple probes per SNP– planned layout (Affymetrix) or random (Illumina)– DNA binds to cDNA, fluorescent tags for readout

• Off-the-shelf– 105 to 107 SNPs, genome-wide coverage

• Custom chips– 384 to 250000 SNPs for a particular purpose

Page 6: Large-scale association studies: brute force and ignorance

Scale

• SNP chips are cost-effective only for large sample sizes and numbers of SNPs– new `exome chip’ has all known coding variants

segregating in the population– 1.5 million chips sold– a few hundred dollars in large volumes

= dozens of SNPs per 1c.

Page 7: Large-scale association studies: brute force and ignorance

Quality controlhomozygote

heterozygote

homozygote

Page 8: Large-scale association studies: brute force and ignorance

Quality control

?? ?

??

Page 9: Large-scale association studies: brute force and ignorance

Quality control

• Easy by hand, but there are 500000 of them– 1 minute each=24hrs/day for a year.– Need to be brutal and automated– 10% of SNPs discarded is not unusual

• Batch effects• Missingness (per SNP and per sample)• Hardy-Weinberg equilibrium• low minor allele frequency (eg Np<100)• Big differences from expected allele frequency

Page 10: Large-scale association studies: brute force and ignorance

HWE

• Not looking at population structure here• Bad SNPs tend to drop out either

heterozygotes or rare-allele homozygotes• Calling error leads to massive HWE violations– p-value <10-5 is a common standard

Page 11: Large-scale association studies: brute force and ignorance

Batch effects

• Experimental design is important– mix cases and controls in same batches– especially important with new technologies

After online publication of our report “Genetic Signatures of Exceptional Longevity in Humans” (1) we discovered that technical errors in the Illumina 610 array and an inadequate quality control protocol introduced false positive single nucleotide polymorphisms (SNPs) in our findings.

-- Sebastiani et al. Science retraction.

Page 12: Large-scale association studies: brute force and ignorance

Analyses

• Must be simple and fast• Usually additive genetic model• Adjust for – sampling factors such as recruitment site– precision variables (eg heart rate in QT interval)

– age and sex (because epidemiologist can’t stop themselves)

– population structure summaries– not for any post-conception exposures that don’t

affect genes.

Page 13: Large-scale association studies: brute force and ignorance

Resultsnumber of zeroes in p-value (need 7-8)

One dot for each SNP, ordered by position within chromosome

Page 14: Large-scale association studies: brute force and ignorance

Zoom in

Page 15: Large-scale association studies: brute force and ignorance

Meta-analysis

• Most genome-wide studies involve multiple samples

• Usually share results, not individual data– combine by precision-weighted meta-analysis – no loss of efficiency for single-parameter analyses

Page 16: Large-scale association studies: brute force and ignorance

Computation: not a big deal

• In R, roughly 12 cpu-hrs for quantitative traits, 36 cpu-hrs for binary, time-to-event

• Parallelises very well– we split by chromosome

• Limited by disk bandwidth– eg, six parallel R sessions on a cheap eight-core

server– eg, 500 parallel R sessions on high-quality

supercomputer

Page 17: Large-scale association studies: brute force and ignorance

Population structure

• Full Bayesian modelling is too slow at this scale

• Use first few principal components of the genotype correlation matrix– population structure is a concern because it leads to

systematic variation in allele frequencies along the whole genome

– systematic variation in allele frequencies along the whole genome shows up in principal components

Page 18: Large-scale association studies: brute force and ignorance

Principal components

• Genotype matrix G has 106 columns, 104 rows– don’t want to form GTG, with 1012 entries– work with GGT, with 108 entries– first few eigenvectors are population structure

components (or common inversions)– ‘EIGENSTRAT’ was first program to do this– Reduce effort further by using just 105 or 104

random SNPs (some loss in quality)

Page 19: Large-scale association studies: brute force and ignorance

Principal components: MESA study

Page 20: Large-scale association studies: brute force and ignorance

Principal components

• Does it work?– if not, ancestry-informative loci would be over-

represented in association findings– largely not the case• slight suggestion in very largest studies that ABO blood

group and lactase persistence loci are cropping up too often.

Page 21: Large-scale association studies: brute force and ignorance

Imputation

• Meta-analysis often involves studies using different SNP chips

• Can only combine results for the same SNPs– usually a minority

• Imputation allows everyone to use the same SNPs

• Based on linkage disequilibrium– with 500,000 SNPs, we are very far from linkage

equilibrium

Page 22: Large-scale association studies: brute force and ignorance

Imputation

• Haplotyping– estimate possible haplotypes and their

probabilities for each person in your sample• In reference panel with all the SNPs (eg HapMap)

– look up which allele is on each haplotype• Compute posterior mean genotype

Page 23: Large-scale association studies: brute force and ignorance

Imputation

• Imputation does not use phenotype data– slightly underestimates association– but only for SNPs that explain a large fraction of

variation in phenotype– which basically don’t exist.

• Just plug imputed genotype into regression as if it was measured. – some people filter out SNPs where imputation is

low-quality: compare to 2p(1-p)

Page 24: Large-scale association studies: brute force and ignorance

Imputation

• For meta-analysis, need to impute to the same set of SNPs before analysis– most people us 2.5 million HapMap Phase II SNPs– starting to use 38 million 1000 Genomes SNPs– for additive genetic model, doesn’t matter

whether SNPs are measured or imputed.– slightly more work needed for non-additive

genetic models or SNP:SNP interaction models

Page 25: Large-scale association studies: brute force and ignorance

Resequencing

• 2.5 million SNPs is one per 1000 bases• Every base varies somewhere in the human

population• Association studies by sequencing are just

becoming possible

• US$1000 genome probably coming next year

Page 26: Large-scale association studies: brute force and ignorance

Resequencing

• Basic idea is similar to GWAS, but– most variants will be rare– some variants will have stronger associations– the true functional variant will be measured.

• For sufficiently-common SNPs, use the same analysis as in GWAS

• For rare variants (SNPs and indels), use a burden test

Page 27: Large-scale association studies: brute force and ignorance

Burden tests

• Might expect most mutations to reduce function– people with more copies of rare variants should have

lower function for that gene (or non-gene locus)• Use number of variants for each person as

predictor in a regression model– rarer variants may have larger effects: give them more

weight– we know or guess that some bases are more likely to

matter: give them more weight

Page 28: Large-scale association studies: brute force and ignorance

Omnidirectional burden tests

• ‘Loss of function’ is tricky– ion channel function is to open and close: which

direction is loss of function?– Leiden variant in Factor V removes ability to be

turned off: loss or gain of function?• Would be nice to find important genes even if

variants act both ways• Hard: huge increase in dimension of problem • Simple meta-analyses are no longer efficient.

Page 29: Large-scale association studies: brute force and ignorance

Omnidirectional burden tests

• Typically based on correlation– do people with more similar genotypes have more

similar phenotypes?• Power is very low if there are many

unimportant variants

Page 30: Large-scale association studies: brute force and ignorance

Third generation sequencing

• Pacific Biosciences: tethered polymerase copies single DNA molecule, with spotlight small enough to see just one base fluoresce

• Oxford Nanopore: drag a single DNA strand through a tiny hole and measure its shadow

• Ion Torrent: tethered polymerases copy one base at a time, read-out uses H+ ion released by adding the base.