correlation and large-scale simultaneous significance...

37
Correlation and Large-Scale Simultaneous Significance Testing, Bradley Efron, 2007, JASA Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Correlation and Large-Scale SimultaneousSignificance Testing, Bradley Efron, 2007, JASA

Stat 300C: Final Presentation

Leonid Pekelis

June 03, 2011

Page 2: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Main Points

I Correlation between test statistics can have varied effects onmultiple hypothesis testing procedures, making it harder totrust FDR procedures which don’t account for correlation.

I Allowing for some assumptions, can formalize a model whichdescribes how correlations propogate to false discoveryestimates.

I There is some evidence that this model is actually how theworld works (at least for microarrays).

I It is straightforward to adjust FDR procedures to account forsuch correlations.

Page 3: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Effect of Correlations

Page 4: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Effect of Correlations

1. Breast Cancer study (BC) compared gene activity groups ofpatients observed to have one of two different geneticmutations known to increase breast cancer risk, “BRCA1” or“BRCA2”, Hendenfalk et al. (2001)

I 7 BRCA1, 8 BRCA2, 15 patients totalI N = 3225 genes measured

2. HIV study, van’t Wout et al. (2003)I 4 HIV positive, 4 HIV negative controlsI N = 7680 genes per microarray

Page 5: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Ensemble Distribution

zi = Φ−1(G0(ti )) ∼ N (0, 1), i = 1, 2, . . . ,Nzbci ∼ N (−0.09, 1.552) zHIVi ∼ N (−0.11, 0.752)

Page 6: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Outline of the talk

1. Count vector model

1.1 Covariance of count vector under correlation

2. Poisson process model for counts

3. Numerical examples of model’s accuracy

4. Conditional FDR estimates

5. Numerical simulation comparing conditional to traditionalFDR

6. Data example, NBA

Page 7: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Counts Model

K = 82 bins of width ∆ = 0.1 from −4.1 to 4.1, Z = ∪Kk=1Zk

Count vector y, yk = #{zi in kth bin}

πk(i) = P(zi ∈ Zk), πk· =N∑i=1

πk(i)/N.

= ∆φ(z [k])

γkl(i , j) = P(zi ∈ Zk ∩ zj ∈ Zl), γkl · =

∑i 6=j γkl(i , j)

N(N − 1)

E (y) = Nπ, Cov(y) = C0 + C1

C0 = N(diag(π)− ππ′)

C1 = N(N − 1)diag(π)δdiag(π), δkl =γkl ·πk·πl ·

− 1

Page 8: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Counts Model

Further assume bivariate normality, Cov(zi , zj) = ρij .

γkl(i , j) =

∫Zk

∫Zl

ψ2(zi , zj , ρij)dz.

=∆2

2π√

1− ρ2ije− 1

2

z[k]2−2ρij z[k]z[l ]+z[l ]2

1−ρ2ij

δkl + 1 =

∑i 6=j P(zi ∈ Zk ∩ zj ∈ Zl)∑i P(zi ∈ Zk)

∑j P(zj ∈ Zl)

.=

∫ 1

−1

1√1− ρ2

2(1−ρ2)(ρz[k]2−2z[k]z[l ]+ρz[l ]2)

g(ρ)dρ

=

∫ 1

−1Rkl(ρ)g(ρ)dρ

Page 9: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Counts Model

Suppose ρ ∼ (0, α2), α2 =∫ 1−1 ρ

2g(ρ)dρ,then 2nd order Taylor approximation of of Rkl(ρ) around ρ = 0gives

δ.

= α2qq′, qk = (z [k]2 − 1)/√

2.

Putting the previous results together (Theorem 1)

Cov(y).

= N(diag(π)− ππ′) +N(N − 1)

2α2ww′

wk = ∆w(z [k]), ,w(z) = φ′′(z) = φ(z)(z2 − 1)

Page 10: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Poisson Model

Suppose y|u ∼ Po(u), u ∼ (v, Γ), will need N ∼ Po(N0).

Simplifies Cov(y).

= N(diag(π) + N2

2 α2ww′. Match with

y ∼ (v, diag(v) + Γ) ⇒

y ∼ Po(Nπ + AN√

2w), A ∼ (0, α2)

Page 11: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.05

Page 12: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.10

Page 13: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.15

Page 14: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.20

Page 15: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.25

Page 16: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.30

Page 17: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.35

Page 18: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.40

Page 19: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.45

Page 20: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples, α = 0.50

Page 21: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples

Page 22: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Numerical Examples

α: 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

C1 0.9958 0.9925 0.9828 0.9657 0.9291 0.8679 0.8085 0.7758 0.7748 0.8081Cnorm 0.1007 0.2776 0.4582 0.5962 0.6794 0.7059 0.6996 0.6938 0.7043 0.7390Cpois 0.1074 0.2790 0.4563 0.5931 0.6765 0.7036 0.6978 0.6923 0.7028 0.7374

α: 0.40 0.45 0.50

0.7758 0.7748 0.80810.6938 0.7043 0.73900.6923 0.7028 0.7374

Table: Proportion of total variance explained by first eigenvector, as afunction of α.

Page 23: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Conditional FDR

Given A, can approximate

u = Nπ + AN√

2w.

= N∆fA(z [k])

fA(z) = φ(z)(1 + Aq(z)),

Matching moments, can approximate uk.

= N 1σAψ(x/σA), with

σ2A = 1 +√

2A.I took 2nd term in Edgeworth expansion,

fA(x).

=1

σAψ(x/σA)

(1 +

µ4 − 3σ4

24σ4H4(x)

).

Page 24: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Conditional FDR

Page 25: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Conditional FDR

Use GLM to fit distribution of yk ∼ Po(eβ0+β1z[k]+β2z[k]2) for

k ∈ K0.Using normal approximation for with p0 proportion of nulls givesE (yk) = p0uk , hence

σ̂A = (−2β̂2)−.5

Estimate p0 by p̂0 = P̂0/P0(σ̂A, P0(σ) = 2Φ(x0;σ)− 1,P̂0 = Y0/N

Fdr(x |σ̂A) = Np̂0Φ̄(x ;σA)/T (x)

Page 26: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Simulation

Page 27: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

1. What professional basketball players can really be calledexceptional?

2. Data from http://www.databasebasketball.com/

3. 1946-2009, stats on every player, each year, ≈ 22, 000 entries

4. Will focus on ppm = points scored in seasonminutes played in season

5. Idea: get z-value for each player, apply BH procedure todetermine non-null players

6. Can hypothesise there is some correlation between playersppm scores.

7. Cleaned data (year > 1950, minutes ≥ 10)

Page 28: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Page 29: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Page 30: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Page 31: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Page 32: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

I Detrend: year effect, shot clock (1954), 3 pointer (1979),center

I Aggregate years by players, keep only careers ≥ 5 years

I Gives N = 1535 players

I Calculate tk =∑ck

i=1 ppmi/ckSE , ck - career length

I Convert to z values, zk = Φ−1(Tck−1(tk))

Page 33: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Max = 6.74 (Kareem , Abdul-jabbar ’69-’89), Min = -6.43 (E.c.Coleman ’94-’00)

Wilt Chamberlain (’59-’72) = 3.31, Michael Jordan (’84-’02) =6.49

Page 34: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

I Naive BH(.10) procedure gives 891 rejections,

I Est. correlation from central spread Poisson glm,znull ∼ N (0, 22)

I Trying BH(.10) with correlated null gives 1 rejection,

I Third approach: estimate p̂0 = P̂0/P0(1.92) ≈ 0.588,P̂0 = Y0(1)/N

I Conditional Fdr estimates Fdr(naive|2) = .347 ,Fdr(cor|2) = 0.673

I Both > .10!

I x∗ = arg max{x : Fdr(x |2) ≤ 0.10}, gives 36 rejections

I Actually used x∗ = arg minFdr(x |2), sinceminFdr(x |2) = .121 > .10.

Page 35: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA

Theoretical Null Dist N (0, 1), Correlated Null Dist N (0, 22)

Page 36: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA (Best Players)

[1] ”Kareem , Abdul-jabbar” ”Tim , Duncan” ”Shaquille , O’neal”[4] ”Michael , Jordan” ”Karl , Malone” ”Julius , Erving”[7] ”Walter , Davis” ”Glenn , Robinson” ”Jerry , West”[10] ”Dominique , Wilkins” ”Tim , Thomas” ”Calvin , Murphy”[13] ”Bob , Pettit” ”Eddie , Johnson” ”Sam , Cassell”[16] ”James , Worthy” ”George , Gervin” ”John , Drew”[19] ”Allen , Iverson” ”Dan , Issel”

Page 37: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011

Data Example: NBA (Worst Players)

[1] ”Charles , Jones” ”Tree , Rollins” ”Ben , Wallace”[4] ”Nate , Mcmillan” ”Greg , Kite” ”Manute , Bol”[7] ”Harvey , Catchings” ”Paul , Mokeski” ”Don , Buse”[10] ”Adonal , Foyle” ”Kurt , Rambis” ”Bo , Outlaw”[13] ”Matt , Guokas” ”Bruce , Bowen” ”George , Johnson”[16] ”Chris , Dudley”