large-scale association studies: brute force and ignorance

Large-scale association studies:brute force and ignorance

Thomas LumleyBIOINF 744/STATS 771

My experience: the CHARGE consortium

I’m in CHARGEI’m in CHARGEI’m in CHARGE

I’m in CHARGEI’m in CHARGE




Genome-wide Association Studies

• If a protein is important in a particular disease or trait then small changes in function or expression of that protein should have a small effect on the disease or trait.

• A small effect. Very small. No, smaller than that.

• But not confounded with environmental or lifestyle factors : like a tiny randomised experiment

• And not relying on prior biological knowledge– we can find surprises.

• Because SNP associations are very weak, need 103-105 people in the study

Does it work?

• Not for risk prediction or treatment choice– exception: some adverse drug reactions

• Yes, for discovering new mechanisms or potential drug targets– mystery 9p21 variant in CHD– autophagy in Crohn’s Disease– sodium transporter affecting uric acid levels– new ion channels important in heart rhythm

Measurement technologies

• SNP chips– cDNA attached to glass/silicon– multiple probes per SNP– planned layout (Affymetrix) or random (Illumina)– DNA binds to cDNA, fluorescent tags for readout

• Off-the-shelf– 105 to 107 SNPs, genome-wide coverage

• Custom chips– 384 to 250000 SNPs for a particular purpose

Scale

• SNP chips are cost-effective only for large sample sizes and numbers of SNPs– new `exome chip’ has all known coding variants

segregating in the population– 1.5 million chips sold– a few hundred dollars in large volumes

= dozens of SNPs per 1c.

Quality controlhomozygote

heterozygote

homozygote

Quality control

?? ?

??

Quality control

• Easy by hand, but there are 500000 of them– 1 minute each=24hrs/day for a year.– Need to be brutal and automated– 10% of SNPs discarded is not unusual

• Batch effects• Missingness (per SNP and per sample)• Hardy-Weinberg equilibrium• low minor allele frequency (eg Np<100)• Big differences from expected allele frequency

HWE

• Not looking at population structure here• Bad SNPs tend to drop out either

heterozygotes or rare-allele homozygotes• Calling error leads to massive HWE violations– p-value <10-5 is a common standard

Batch effects

• Experimental design is important– mix cases and controls in same batches– especially important with new technologies

After online publication of our report “Genetic Signatures of Exceptional Longevity in Humans” (1) we discovered that technical errors in the Illumina 610 array and an inadequate quality control protocol introduced false positive single nucleotide polymorphisms (SNPs) in our findings.

-- Sebastiani et al. Science retraction.

Analyses

• Must be simple and fast• Usually additive genetic model• Adjust for – sampling factors such as recruitment site– precision variables (eg heart rate in QT interval)

– age and sex (because epidemiologist can’t stop themselves)

– population structure summaries– not for any post-conception exposures that don’t

affect genes.

Resultsnumber of zeroes in p-value (need 7-8)

One dot for each SNP, ordered by position within chromosome

Zoom in

Meta-analysis

• Most genome-wide studies involve multiple samples

• Usually share results, not individual data– combine by precision-weighted meta-analysis – no loss of efficiency for single-parameter analyses

Computation: not a big deal

• In R, roughly 12 cpu-hrs for quantitative traits, 36 cpu-hrs for binary, time-to-event

• Parallelises very well– we split by chromosome

• Limited by disk bandwidth– eg, six parallel R sessions on a cheap eight-core

server– eg, 500 parallel R sessions on high-quality

supercomputer

Population structure

• Full Bayesian modelling is too slow at this scale

• Use first few principal components of the genotype correlation matrix– population structure is a concern because it leads to

systematic variation in allele frequencies along the whole genome

– systematic variation in allele frequencies along the whole genome shows up in principal components

Principal components

• Genotype matrix G has 106 columns, 104 rows– don’t want to form GTG, with 1012 entries– work with GGT, with 108 entries– first few eigenvectors are population structure

components (or common inversions)– ‘EIGENSTRAT’ was first program to do this– Reduce effort further by using just 105 or 104

random SNPs (some loss in quality)

Principal components: MESA study

Principal components

• Does it work?– if not, ancestry-informative loci would be over-

represented in association findings– largely not the case• slight suggestion in very largest studies that ABO blood

group and lactase persistence loci are cropping up too often.

Imputation

• Meta-analysis often involves studies using different SNP chips

• Can only combine results for the same SNPs– usually a minority

• Imputation allows everyone to use the same SNPs

• Based on linkage disequilibrium– with 500,000 SNPs, we are very far from linkage

equilibrium

Imputation

• Haplotyping– estimate possible haplotypes and their

probabilities for each person in your sample• In reference panel with all the SNPs (eg HapMap)

– look up which allele is on each haplotype• Compute posterior mean genotype

Imputation

• Imputation does not use phenotype data– slightly underestimates association– but only for SNPs that explain a large fraction of

variation in phenotype– which basically don’t exist.

• Just plug imputed genotype into regression as if it was measured. – some people filter out SNPs where imputation is

low-quality: compare to 2p(1-p)

Imputation

• For meta-analysis, need to impute to the same set of SNPs before analysis– most people us 2.5 million HapMap Phase II SNPs– starting to use 38 million 1000 Genomes SNPs– for additive genetic model, doesn’t matter

whether SNPs are measured or imputed.– slightly more work needed for non-additive

genetic models or SNP:SNP interaction models

Resequencing

• 2.5 million SNPs is one per 1000 bases• Every base varies somewhere in the human

population• Association studies by sequencing are just

becoming possible

• US$1000 genome probably coming next year

Resequencing

• Basic idea is similar to GWAS, but– most variants will be rare– some variants will have stronger associations– the true functional variant will be measured.

• For sufficiently-common SNPs, use the same analysis as in GWAS

• For rare variants (SNPs and indels), use a burden test

Burden tests

• Might expect most mutations to reduce function– people with more copies of rare variants should have

lower function for that gene (or non-gene locus)• Use number of variants for each person as

predictor in a regression model– rarer variants may have larger effects: give them more

weight– we know or guess that some bases are more likely to

matter: give them more weight

Omnidirectional burden tests

• ‘Loss of function’ is tricky– ion channel function is to open and close: which

direction is loss of function?– Leiden variant in Factor V removes ability to be

turned off: loss or gain of function?• Would be nice to find important genes even if

variants act both ways• Hard: huge increase in dimension of problem • Simple meta-analyses are no longer efficient.

Omnidirectional burden tests

• Typically based on correlation– do people with more similar genotypes have more

similar phenotypes?• Power is very low if there are many

unimportant variants

Third generation sequencing

• Pacific Biosciences: tethered polymerase copies single DNA molecule, with spotlight small enough to see just one base fluoresce

• Oxford Nanopore: drag a single DNA strand through a tiny hole and measure its shadow

• Ion Torrent: tethered polymerases copy one base at a time, read-out uses H+ ion released by adding the base.

large-scale association studies: brute force and ignorance

Documents

small effect

dozens of snps

small changes

particular disease

large volumes

chips solda

snp associations

quality controleasy