linear existing new method: regression mixed model bolt ... · efficient bayesian mixed model...

20
Efficient Bayesian mixed model analysis increases association power in large cohorts 10/20/14: ASHG 2014 Po-Ru Loh Harvard School of Public Health Linear regression Existing mixed model methods New method: BOLT-LMM Time () ( ) . Corrects for confounding? Power

Upload: hoangdien

Post on 14-Mar-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Bayesian mixed model analysis increases association power in large cohorts

10/20/14: ASHG 2014

Po-Ru Loh Harvard School of Public Health

Linear regression

Existing mixed model methods

New method: BOLT-LMM

Time 𝑶(𝑴𝑴) 𝑶(𝑴𝑴𝟐) ≈ 𝑶 𝑴𝑴𝟏.𝟓

Corrects for confounding? Power

Mixed model association is hot • Linear mixed models (LMMs) for GWAS have

attracted intense research interest in the past few years

• High-profile computational speedups: – EMMAX (Kang et al. 2010 Nat Gen) – P3D/TASSEL (Zhang et al. 2010 Nat Gen) – FaST-LMM (Lippert et al. 2011 Nat Meth) – GEMMA (Zhou & Stephens 2012 Nat Gen) – GRAMMAR-Gamma (Svishcheva et al. 2012 Nat Gen)

• Extensions to standard LMM: – Multiple loci (Segura et al. 2012 Nat Gen) – Multiple traits (Korte et al. 2012 Nat Gen) – Sparse modeling (Listgarten et al. 2013 Nat Gen) – GCTA-LOCO (Yang et al. 2014 Nat Gen)

• See also: ASHG 2014 talk 169, Fusi & poster 1458S, Heckerman

Mixed model association corrects for confounding and increases power

Linear regression

Existing mixed model methods

Time 𝑶(𝑴𝑴) 𝑶(𝑴𝑴𝟐)

Corrects for confounding? Power

M = # SNPs N = # samples

New mixed model method (BOLT-LMM) increases speed, further increases power

Linear regression

Existing mixed model methods

New method: BOLT-LMM

Time 𝑶(𝑴𝑴) 𝑶(𝑴𝑴𝟐) ≈ 𝑶 𝑴𝑴𝟏.𝟓

Corrects for confounding? Power

Models non-infinitesimal genetic architectures

BOLT-LMM’s iterative algorithm increases speed; non-infinitesimal model increases power

New retrospective statistic allows rapid computation via iterative methods Speed:

𝒙𝒎𝑻 𝑽−𝟏𝒚𝟐

(𝑽−𝟏𝒚)𝑻𝚯∗(𝑽−𝟏𝒚)

Power:

Normal Non-normal: Heavier tails

Flexible Bayesian prior on SNP effect sizes models genetic architectures with large-effect loci

“infinitesimal model”

“non-infinitesimal model”

7500 30000 120000 48000010

-1

100

101

102

103

104

N = # of samples

CP

U ti

me

(hrs

)

N2 N1.5

a

FaST-LMM-SelectGEMMAEMMAXGCTA-LOCOBOLT-LMM

7500 30000 120000 480000

100

101

102

103

N = # of samples

Mem

ory

(GB

)

N2

N1

b

FaST-LMM-SelectGCTA-LOCOEMMAXGEMMABOLT-LMM

BOLT-LMM requires far less time and memory than existing methods

10x RAM 100x CPU

BOLT-LMM can handle big data

Number of samples (N)

BOLT-LMM: Time, memory

Existing methods: Time, memory

60,000 5 hours 4.6 GB

~2 weeks >60 GB

480,000 3 days 35 GB

~4 years >3.5 TB

Benchmark data sets: • Simulated genotypes generated from WTCCC2 samples

M = 300K SNPs • Simulated phenotypes with ℎ𝑔2 = 0.2 SNP heritability

BOLT-LMM increases power in simulations (while controlling false positives)

0 2500 5000 100002

3

4

5

6

7

8

9

10

Mcausal = # of causal SNPs

Mea

n st

anda

rdiz

ed e

ffect

SN

P χ

2h2

causal = 0.5, N = 15633

a

BOLT-LMMStandard LMMPCA

0.3 0.5 0.72

3

4

5

6

7

8

9

10

h2causal

Mcausal = 2500, N = 15633

b

BOLT-LMMStandard LMMPCA

5211 156332

3

4

5

6

7

8

9

10

N = # of samples

Mcausal = 2500, h2causal = 0.5

c

BOLT-LMMStandard LMMPCA

Data: • Real genotypes from N=15,633 WTCCC2 samples • Simulated phenotypes with varying numbers of causal SNPs

+ environmental stratification

Increased 𝝌𝟐 stats at causal SNPs = Increased effective sample size

0

2

4

6

8

10

12

14

Incr

ease

in χ2 s

tatis

tics

at k

now

n lo

civs

. PC

A (%

)

ApoB

LDL

HDL

Choles

terol

Triglyc

eride

s

Height

BMI

Systol

ic BP

Diastol

ic BP

a

Standard LMM vs. PCABOLT-LMM vs. PCA

0

2

4

6

8

10

12

14

Pre

dict

ion

R2 (%)

b

PCAStandard LMMBOLT-LMM

BOLT-LMM increases association power in WGHS phenotypes

Data: Women’s Genome Health Study N=23,294 European-ancestry samples

How does BOLT-LMM increase power?

Under the hood, Part 1

Mixed models increase power by conditioning on polygenic effects

• Joint modeling reveals association! • Example:

Is phenotype 𝑦 associated with 𝑥1?

Hard to tell…

Mixed models increase power by conditioning on polygenic effects

• Joint modeling reveals association! • Example:

Is phenotype 𝑦 associated with 𝑥1? How about if we also have 𝑥2?

Mixed models increase power by conditioning on polygenic effects

• Joint modeling reveals association! • Example:

Is phenotype 𝑦 associated with 𝑥1? Knowledge of 𝑥2 reveals assoc of 𝑥1 through joint model Hard to tell…

Note: In mixed model, 𝑥1 (test SNP) is modeled as fixed effect; 𝑥2 is modeled as random effect

Better joint model of phenotype => better conditioning => more power

Infinitesimal model • Standard mixed model:

all SNPs causal with normally distributed effect sizes:

𝛽 ~ 𝑁 0,ℎ𝑔2

𝑀

Non-infinitesimal model • Reality: Only a small

fraction of SNPs causal with larger effects

• BOLT-LMM: SNP effect sizes modeled with mixture of two Gaussians

Normal Non-normal: Heavier tails

ASHG 2014 poster 1767S, Vilhjalmsson

How does BOLT-LMM increase speed?

(And what exactly does it compute?)

Under the hood, Part 2

New retrospective statistic + algorithm increases speed (for infinitesimal LMM)

Retrospective mixed model assoc statistic Fast iterative algorithm

Idea 1:

𝒙𝒎𝑻 𝑽−𝟏𝒚𝟐

(𝑽−𝟏𝒚)𝑻𝚯∗(𝑽−𝟏𝒚)

• Compute quasi-likelihood score test similar to MASTOR: – Jakobsdottir & McPeek 2013

AJHG • Calibrate using approach

similar to GRAMMAR-Gamma – Svishcheva et al. 2012 NG

• Replace expensive eigendecomposition with linear system solving – Legarra & Misztal 2008, Van

Raden 2008 J Dairy Sci • No need to compute genetic

relationship matrix (GRM) => save time, RAM

Bayesian extension of linear mixed model (non-inf. model) increases power

Retrospective mixed model assoc statistic

Idea 2:

𝒙𝒎𝑻 𝑽−𝟏𝒚𝟐

(𝑽−𝟏𝒚)𝑻𝚯∗(𝑽−𝟏𝒚)

𝒙𝒎𝑻 𝒚𝒓𝒓𝒓𝒓𝒓𝟐

𝒚𝒓𝒓𝒓𝒓𝒓𝑻 𝚯∗𝒚𝒓𝒓𝒓𝒓𝒓

Retrospective mixed model assoc statistic using Bayesian prior

• Compute fast variational approximation of posteriors – Meuwissen et al. 2009, Logsdon et al. 2010, Carbonetto & Stephens 2012

• Calibrate using LD score regression – Bulik-Sullivan et al. bioRxiv

(under revision, Nat Genet) ASHG 2014 talk 351, Finucane & poster 1787T, Bulik-Sullivan

New mixed model method (BOLT-LMM) increases speed, further increases power

Linear regression

Existing mixed model methods

New method: BOLT-LMM

Time 𝑶(𝑴𝑴) 𝑶(𝑴𝑴𝟐) ≈ 𝑶 𝑴𝑴𝟏.𝟓

Corrects for confounding? Power

Models non-infinitesimal genetic architectures

Limitations and future directions

• Limitations – Loss of power from case-control ascertainment

• ASHG 2014 poster 1746S, Hayeck – Potential breakdown of approximations in family

studies, rare variant tests, non-human data sets • Future directions

– Fast heritability estimation – Multiple variance components – Multiple phenotypes

Acknowledgments

• Alkes Price • Nick Patterson • Bjarni Vilhjalmsson • Hilary Finucane • Sasha Gusev

• Daniel Chasman

• Brendan Bulik-Sullivan • Benjamin Neale • Rany Salem

• George Tucker • Bonnie Berger

BOLT-LMM software: http://hsph.harvard.edu/alkes-price/software/ (v1.1 handles imputed SNPs) Google “BOLT-LMM” Loh et al. bioRxiv (under revision, Nat Genet)