dimension reduction methods with application to high ...jrojo/pasi/lectures/tuan.pdf · dimension...

90
Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response Tuan Nguyen under supervision of Javier Rojo Pan-American Statistics Institute May 1, 2010 1 / 86

Upload: others

Post on 15-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

Dimension Reduction Methods with Applicationto High-dimensional Data with a Censored

Response

Tuan Nguyen

under supervision of Javier Rojo

Pan-American Statistics Institute

May 1, 2010

1 / 86

Page 2: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SMALL n, LARGE p PROBLEM

I Dimension reductionI Motivation: Survival analysis using

microarray gene expression dataI few patients, but 10-30K gene expression

levels per patientI patients’ survival times

I Predict patient survival taking intoaccount microarray gene expression data

2 / 86

Page 3: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE

I Microarray DataI Dimension Reduction Methods

I Random ProjectionI Rank-based Partial Least Squares

I Application to microarray data withcensored response

I Conclusions

3 / 86

Page 4: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE: OUR CONTRIBUTIONS

Dimension Reduction MethodsI Random Projection

I Improvements on the lower bound for k fromJohnson-Lindenstrauss (JL) Lemma

I L2-L2: Gaussian and Achlioptas random matricesI L2-L1: Gaussian and Achlioptas random matrices

I Variant of Partial Least Squares (PLS):Rank-based PLS

I insensitive to outliersI weight vectors as solution to optimization problem

4 / 86

Page 5: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE

I Microarray DataI Dimension Reduction MethodsI Small applicationI Conclusions

5 / 86

Page 6: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

DNA MICROARRAY

I Traditional Methods: one gene, oneexperiment

I DNA Microarray: thousands of genesin a single experiment

I Interactions among the genes

I Identification of gene sequences

I Expression levels of genes

6 / 86

Page 7: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

WHAT IS DNA MICROARRAY?

I Medium for matching known and unknown DNA samples

I Goal: derive an expression level of each gene (abundanceof mRNA)

Figure: Oligonucleotide array Figure: Oligonucleotide array

7 / 86

Page 8: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OLIGONUCLEOTIDE MICROARRAY

I GeneChip (Affymetrix)I ∼30,000 sample spots

I each gene is represented by morethan 1 spots

I thousands of genes on array

I Hybridization Principle

I glowing spots: gene expressed

Figure: Oligonucleotide Microarray8 / 86

Page 9: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OLIGONUCLEOTIDE MICROARRAY

I Intensity of each spot is scannedI Expression level for each gene = total expression

across all the spots

I Multiple Samples: 1 array - 1 sampleI Matrix of gene expression levels

X =

X11 X12 . . . X1p

X21 X22 . . . X2p

. . . . . . . . . . . .XN1 XN2 . . . XNp

9 / 86

Page 10: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SOME OTHER ARRAYS

I 1 flavor in past: Expression ArraysI Innovation: many flavors

I SNPs: single nucleotide polymorphisms within or bet.populations

I Protein: interactions bet. protein-protein,protein-DNA/RNA, protein-drugs

I TMAs: comparative study bet. tissue samplesI Exon: alternative splicing and gene expression

10 / 86

Page 11: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

APPLICATIONS OF MICROARRAY

I Gene discovery and disease diagnosisI Functions of new genes

I Inter-relationships among the genesI Identify genes involved in development of diseases

I AnalysesI Gene selection, classification, clustering, prediction

11 / 86

Page 12: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

DIFFICULTIES OF MICROARRAY

I thousands of genes, few samples(N � p)

I Survival Information

I Observe the triple (X, T, δ)

I X = (x1, . . . , xp)T gene expression

data matrix

I Ti = min(yi, ci) observed survivaltimes (i = 1, . . . ,N)

I δi = I(yi ≤ ci) censoring indicators

12 / 86

Page 13: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ANALYZING MICROARRAY DATA

2-stage procedure:

I 1) Dimension reduction methods

MN x k = XN x pWp x k , k < N � p

I 2) Regression model

13 / 86

Page 14: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE: OUR CONTRIBUTIONS

Dimension Reduction MethodsI Random Projection

I Improvements on the lower bound for k fromJohnson-Lindenstrauss (JL) Lemma

I L2-L2: Gaussian and Achlioptas random matricesI L2-L1: Gaussian and Achlioptas random matrices

I Variant of Partial Least Squares (PLS):Rank-based PLS

14 / 86

Page 15: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

RANDOM PROJECTION (RP)The original matrix X is projected onto M by arandom matrix Γ,

MN x k = XN x p Γp x k (1)

I Johnson-Lindenstrauss Lemma (1984)I preserve pairwise dist. among the points (within 1± ε)I k cannot be too small

CPS296.2 Geometric Optimization April 12, 2007

Lecture 25: Johnson Lindenstrauss LemmaLecturer: Pankaj K. Agarwal Scribe: Albert Yu

The topic of this lecture is dimensionality reduction. Many problems have been efficiently solved in low di-mensions, but very often the solution to low-dimensional spaces are impractical for high dimensional spacesbecause either space or running time is exponential in dimension. In order to address the curse of dimen-sionality, one technique is to map a set of points in a high dimensional space to another set of points ina low-dimensional space while all the important characterisitics of the data set are preserved. In this lec-ture, we will study Johnson Lindenstrauss Lemma. Essentially all the dimension reduction techniques viarandom projection rely on the Johnson Lindenstrauss Lemma.

25.1 Johnson Lindenstrauss Lemma

The goal of Johnson Lindenstrauss Lemma is to, for a point set P in Rd, find a point set Q in Rk anda mapping from P to Q, such that the pairwise distances of P are preserved (approximately) under themapping.

Rk

Rd

Figure 25.1: Projection

Definition 1 Given a set of n points in Rd and a projection onto a random k-dimension linear subspace, adistance ||p − q||2 is ε-preserved if (1 − ε)||p − q||2 ≤

√n/k||f(p) − f(q)||2 ≤ (1 + ε)||p − q||2 for any

0 < ε < 1.

Theorem 1 (Johnson-Lindenstrauss Lemma) Let P be a set of n points in Rd, let ε > 0 be a parameter,

25-1

15 / 86

Page 16: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

RP: JOHNSON-LINDENSTRAUSS (JL)LEMMA (1984)

————————————————————————————Johnson-Lindenstrauss LemmaFor any 0 < ε < 1 and integer n, let k = O(ln n/ε2). For any setV of n points in Rp, there is a linear map f : Rp → Rk such thatfor any u, v ∈ V ,

(1− ε) ||u− v||2 ≤ ||f(u)− f(v)||2 ≤ (1 + ε) ||u− v||2 .————————————————————————————

I n points in p−dimensional space can be projected onto ak−dimensional space such that the pairwise distance bet. any 2points is preserved within (1± ε).

I Euclidean distance in both original and projected spaces(L2-L2 projection)

I f is linear, but not specified16 / 86

Page 17: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: IMPROVEMENT ON LOWERBOUND FOR k

for x = u− v ∈ Rp

I L2 distance: ||x|| =√∑p

j=1 x2i

I L1 distance: ||x||1 =∑p

j=1 |xi|

Γ is of dimension p by k

I Frankl and Maehara (1988):

k ≥⌈

27 ln n3ε2 − 2ε3

⌉+ 1 (2)

I Indyk and Motwani (1998):I entries to Γ are i.i.d. N(0, 1)

17 / 86

Page 18: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: IMPROVEMENT ON LOWER BOUNDFOR k Compare

————————————————————————————Dasgupta and Gupta (2003): mgf techniqueFor any 0 < ε < 1 and integer n, let k be such that

k ≥ 24 ln n3ε2 − 2ε3 . (3)

For any set V of n points in Rp, there is a linear mapf : Rp → Rk such that for any u, v ∈ V ,

P[(1− ε) ||u− v||2 ≤ ||f(u)− f(v)||2 ≤ (1 + ε) ||u− v||2

]≥ 1− 2

n2 .

————————————————————————————

I f(x) = 1√kxΓ, where x = u− v

I Gaussian random matrix: k ||f(x)||2

||x||2 ∼ χ2k

I best available lower bound for k18 / 86

Page 19: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: DASGUPTA AND GUPTA

Considering the tail probabilities separately

P[||f(x)||2 ≥ (1 + ε) ||x||2

]≤ 1

n2 , (4)

P[||f(x)||2 ≤ (1− ε) ||x||2

]≤ 1

n2 . (5)

I f(x) = 1√kxΓ, where x = u− v

k||f(x)||2

||x||2∼ χ2

k (6)

I using mgf technique to bound the tail probabilities

19 / 86

Page 20: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: DASGUPTA AND GUPTA

Sketch of Proof

I define f(x) = 1√kxΓ, and y =

√k f(x)||x||

I Let yj =xrj

||x|| ∼ N(0, 1), and y2j ∼ χ2

1 with E(||y||2) = k

I Let α1 = k(1 + ε), and α2 = k(1− ε), then

20 / 86

Page 21: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: DASGUPTA AND GUPTASketch of Proof

I Right-tail prob:

P[||f(x)||2 ≥ (1 + ε) ||x||2

]= P

[||y||2 ≥ α1

]

≤(

e−s(1+ε)E(

esy2j

))k, s > 0

= e−sα1(1− 2s)−k/2, s ∈ (0, 1/2).

I Left-tail prob:

P[||f(x)||2 ≤ (1− ε) ||x||2

]= P

[||y||2 ≤ α2

]

≤(

es(1−ε)E(

e−sy2j

))k, s > 0

= esα2(1 + 2s)−k/2

≤ e−sα1(1− 2s)−k/2, s ∈ (0, 1/2).

21 / 86

Page 22: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA: DASGUPTA AND GUPTA

Sketch of ProofI Minimize e−s(1+ε)(1− 2s)−1/2 wrt s, with s∗ = 1

2

1+ε

),

P[||y||2 ≥ α1

]≤ exp

(−k

2(ε− ln(1 + ε))

)

≤ exp(− k

12(3ε2 − 2ε3)

)(7)

I if k ≥ 24 ln n3ε2−2ε3 , then

P[||f(x)||2 ≥ (1 + ε) ||x||2

]≤ 1/n2

andP[||f(x)||2 ≤ (1− ε) ||x||2

]≤ 1/n2

22 / 86

Page 23: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMA

Projection of all(n

2

)pairs of distinct points

I Results are given in terms of preserving distances between1 pair of points

I lower bound for probability chosen to be 1− 2/n2

I Interest: simultaneously preserve distancesamong all

(n2

)pairs of distinct points

23 / 86

Page 24: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

JL LEMMAProjection of all

(n2

)pairs of distinct points

P{ ⋂

u,v∈Vu6=v

(1− ε) ||u− v||2 ≤ ||f(u)− f(v)||2 ≤ (1 + ε) ||u− v||2}

(8)

≥ 1−∑

u,v∈Vu6=v

P[{

(1− ε) ||u− v||2 ≤ ||f(u)− f(v)||2 ≤ (1 + ε) ||u− v||2}c]

≥ 1−(n

2

) 2n2

=1n.

I Introduce β > 0 (Achlioptas (2001)) so that for each pairu, v ∈ V ,

P[(1− ε) ||u− v||2 ≤ ||f(u)− f(v)||2 ≤ (1 + ε) ||u− v||2

]

≥ 1− 2/n2+β .

I prob. in Eq. (8) is bounded from below by 1− 1/nβ24 / 86

Page 25: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

IMPROVEMENT ON DASGUPTA ANDGUPTA BOUND

Rojo and Nguyen (2009a) (NR1 Bound): k is smallesteven integer satisfying

(1 + ε

ε

)g(k, ε) ≤ 1

n2+β(9)

I g(k, ε) = e−λ1 λd−11

(d−1)!is decreasing in k

I λ1 = k(1 + ε)/2 and d = k/2.

I Gaussian random matrix

I work directly with the exact distribution of k ||f(x)||2

||x||2 ,where x = u− v

I 12%-34% improvement25 / 86

Page 26: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE OF PROOF

I Gamma-Poisson Relationship: Suppose X ∼ Gamma(d, 1), ford = 1, 2, 3, . . . , and Y ∼ Poisson(x), then

P(X ≥ x) =

∫ ∞

x

1Γ(d)

zd−1e−zdz =

d−1∑

y=0

xye−x

y!= P(Y ≤ d − 1)

(10)I ||y||2 =

∑kj=1 y2

j ∼ χ2k

I Let d = k/2, α1 = k(1 + ε), and α2 = k(1− ε), then,

I Right tail prob.:

P[||y||2 ≥ α1] =

∫ ∞α1/2

1Γ(a)

zd−1e−zdz =

d−1∑y=0

(α1/2)ye−α1/2

y!

I Left tail prob.:

P[||y||2 ≤ α2] =

∫ α2/2

0

1Γ(a)

zd−1e−zdz =∞∑

y=d

(α2/2)ye−α2/2

y!

26 / 86

Page 27: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE OF PROOF

Theorem 1: Given d as a positive integera) Suppose 1 ≤ d < λ1, then

d−1∑

y=0

λy1

y!≤(

λ1

λ1 − d

)(λd−1

1(d − 1)!

)(11)

b) Suppose 0 < λ2 < d, then

∞∑

y=d

λy2

y!≤(

λ2

d − λ2

)(λd−1

2(d − 1)!

)(12)

27 / 86

Page 28: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

IMPROVEMENT ON DASGUPTA ANDGUPTA BOUND back

Right-tail prob.

P[||y||2 ≥ α1] = e−λ1

d−1∑y=0

λy1

y!

≤(

1 + ε

ε

)(λd−1

1

(d − 1)!

)e−λ1 (13)

Left-tail prob.

P[||y||2 ≤ α2] = e−λ2

∞∑y=d

λy2

y!

≤(

1− εε

)(λd−1

2

(d − 1)!

)e−λ2

≤(

1 + ε

ε

)(λd−1

1

(d − 1)!

)e−λ1

I lower bound for k:

P[||y||2 ≥ α1] + P[||y||2 ≤ α2] ≤ 2e−λ1(

1+εε

)(λd−1

1(d−1)!

)≤ 2/n2+β

28 / 86

Page 29: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

COMPARISON OF THE LOWER BOUNDSON k Prob.

Table: L2-L2 distance: NR1 Bound, and DG Bound.N(0,1) entries NR1 Bound DG Bound % Improv.

n=10 ε = .1, β = 1 2058 2961 30ε = .3, β = 1 254 384 34ε = .1, β = 2 2962 3948 25ε = .3, β = 2 368 512 28

n=50 ε = .1, β = 1 3976 5030 21ε = .3, β = 1 494 653 24ε = .1, β = 2 5572 6707 17ε = .3, β = 2 692 870 20

n=100 ε = .1, β = 1 4822 5921 19ε = .3, β = 1 598 768 22ε = .1, β = 2 6716 7895 15ε = .3, β = 2 834 1024 19

n=500 ε = .1, β = 1 6808 7991 15ε = .3, β = 1 846 1036 18ε = .1, β = 2 9390 10654 12ε = .3, β = 2 1168 1382 15

29 / 86

Page 30: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

EXTENSION OF JL LEMMA TO L1 NORM

I JL Lemma (L2-L2 projection): space of originalpoints in L2, and space of projected points in L2.

I L1-L1 Random Projection

I JL Lemma cannot be extended to the L1 norm

I Charikar & Sahai (2002), Brinkman & Charikar(2003), Lee & Naor (2004), Indyk (2006)

30 / 86

Page 31: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE: OUR CONTRIBUTIONS

Dimension Reduction MethodsI Random Projection

I Improvements on the lower bound for k fromJohnson-Lindenstrauss (JL) Lemma

I L2-L2: Gaussian and Achlioptas random matricesI L2-L1: Gaussian and Achlioptas random matrices

I Variant of Partial Least Squares (PLS):Rank-based PLS

31 / 86

Page 32: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

EXTENSION OF JL LEMMA TO L1 NORM

I L2-L1 Random Projection: space of originalpoints in L2, and space of projected points in L1.

I Ailon & Chazelle (2006) and Matousek (2007)

I the original L2 pair-wise distances arepreserved within (1± ε)

√2/π of the

projected L1 distances with k,

k ≥ Cε−2(2 ln(1/δ)) (14)

I δ ∈ (0, 1), ε ∈ (0, 1/2), C sufficiently largeI when δ = 1/n2+β , then k = O((4 + 2β) ln n/ε2)

I sparse Gaussian random matrix (Ailon & Chazelle, andMatousek), sparse Achlioptas random matrix (Matousek)

32 / 86

Page 33: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L1 RP: IMPROVEMENT TO AILON &CHAZELLE, AND MATOUSEK BOUND

Proof Ach

Rojo and Nguyen (2009a) NR2 Bound

k ≥ (2 + β) ln n− ln(A(s∗ε))

(15)

I A(s) = 2e−s√

2/π(1+ε)+s2/2Φ(s), s > 0

I s∗ε is unique minimizer of A(s)

I based on mgf technique

I Gaussian and Achlioptas random matrixI 36%-40% improvement

33 / 86

Page 34: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE OF PROOF back

I Define f(x) = 1k xΓ,

I Let yj =xrj||x||2∼ N(0, 1), then E(||y||1) = k

√2/π and

M|yj|(s) = 2es2/2Φ(s), ∀s.

I Let α1 = k√

2/π(1 + ε), and α2 = k√

2/π(1− ε), then

I Right tail prob.:

P[||y||1 ≥ α1] ≤(

2e−(sα1/k)+(s2/2)Φ(s))k, s > 0.

I Left tail prob.:

P[||y||1 ≤ α2] ≤(

2e(sα2/k)+(s2/2)(1− Φ(s)))k, s > 0

≤(

2e−(sα1/k)+(s2/2)Φ(s))k, s > 0. (16)

I Eq. (16) is obtained since e2s√

2/π < Φ(s)1−Φ(s) , s > 0.

I minimize Eq. (16) wrt s, and plug s∗ to get the bound.34 / 86

Page 35: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

COMPARISON OF THE LOWER BOUNDSON k

Table: L2-L1 distance: Matousek bound (C = 1), and NR2

Bound.

N(0,1) entries Matousek NR2 Bound % Improv.n=10 ε = .1, β = 1 1382 823 40

ε = .3, β = 1 154 99 36ε = .1, β = 2 5527 3290 40ε = .3, β = 2 615 394 36

n=50 ε = .1, β = 1 2348 1398 40ε = .3, β = 1 261 168 36ε = .1, β = 2 3130 1863 40ε = .3, β = 2 348 223 36

n=100 ε = .1, β = 1 2764 1645 40ε = .3, β = 1 308 197 36ε = .1, β = 2 3685 2193 40ε = .3, β = 2 410 263 36

n=500 ε = .1, β = 1 3729 2220 40ε = .3, β = 1 415 266 36ε = .1, β = 2 4972 2960 40ε = .3, β = 2 553 354 36

35 / 86

Page 36: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE: OUR CONTRIBUTIONS

Dimension Reduction MethodsI Random Projection

I Improvements on the lower bound for k fromJohnson-Lindenstrauss (JL) Lemma

I L2-L2: Gaussian and Achlioptas random matricesI L2-L1: Gaussian and Achlioptas random matrices

I Variant of Partial Least Squares (PLS):Rank-based PLS

36 / 86

Page 37: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ACHLIOPTAS-TYPED RANDOMMATRICES

I Achlioptas (2001): for q = 1 or q = 3

rij =√

q

+1 with prob. 12q

0 with prob. 1− 1q

−1 with prob. 12q .

(17)

I L2-L2 projection

I same lower bound for k as in the case of Gaussian randommatrix

k ≥ (24 + 12β) ln n3ε2 − 2ε3 . (18)

37 / 86

Page 38: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ACHLIOPTAS

I Define f(x) = 1√kxΓ, and yj =

xrj

||x||2

I Goal: bound mgf of y2j

I bounding the even moments of yj by the even moments inthe case rij ∼ N(0, 1)

I =⇒ mgf of y2j using rij from Achlioptas proposal is

bounded by mgf using rij ∼ N(0, 1).

38 / 86

Page 39: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L2 RP: IMPROVEMENT ON ACHLIOPTASBOUND

I Rademacher random matrix Eq. (17) with q = 1

I r2ij = 1 and rljrmj

D= rij independent

Rojo and Nguyen (2009b)I Let yj =

∑pi=1 cirij, with ci = xi

||x||2, then

k

(||f(x)||2

||x||2

)=

k∑

j=1

y2j

D= k + 2

k∑

j=1

p∑

l=1

p∑

m=l+1

clmrlmj (19)

with clm = clcm, and rlmj = rljrmj

I 3 improvements on Achlioptas bound

39 / 86

Page 40: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L2 RP: IMPROVEMENT ON ACHLIOPTASBOUND

Rojo and Nguyen (2009b): Method 1I Hoeffding’s inequality based on mgf

———————————————————————Let Ui’s be independent and bounded random variables such thatUi falls in the interval [ai, bi] (i = 1, . . . , n) with prob. 1. LetSn =

∑ni=1 Ui, then for any t > 0,

P[Sn − E(Sn) ≥ t] ≤ e−2t2/∑n

i=1(bi−ai)2

andP[Sn − E(Sn) ≤ t] ≤ e−2t2/

∑ni=1(bi−ai)2

———————————————————————I Method 1: lower bound for k

k ≥(

(8 + 4β) ln nε2

)(p− 1

p

). (20)

40 / 86

Page 41: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE OF PROOF

I Right tail prob.:

P[||y||2 ≥ k(1 + ε)] = P

p∑

j=1

p∑

l=1

p∑

m=l+1

clmrlmj ≥kε2

≤ exp

(− kε2

8∑p

l=1∑p

m=l+1 c2l c2

m

). (21)

I Left tail prob.:

P[||y||2 ≤ k(1− ε)] = P

p∑

j=1

p∑

l=1

p∑

m=l+1

clmrlmj ≤ −kε2

≤ exp

(− kε2

8∑p

l=1∑p

m=l+1 c2l c2

m

). (22)

41 / 86

Page 42: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE OF PROOF

I∑p

l=1

∑pm=l+1 c2

l c2m is max. at cl = cm = 1/p

4p∑

l=1

p∑

m=l+1

c2l c2

m ≤ 2(

p− 1p

)

I Right tail prob.:

P[||y||2 ≥ k(1 + ε)] ≤ exp(−kε2

4

(p− 1

p

))

I Left tail prob.:

P[||y||2 ≤ k(1− ε)] ≤ exp(−kε2

4

(p− 1

p

))

42 / 86

Page 43: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L2 RP: IMPROVEMENT ON ACHLIOPTASBOUND

Rojo and Nguyen (2009b): Method 2I Berry-Esseen inequality

———————————————————————Let X1, . . . ,Xm be i.i.d. random variables with E(Xi) = 0,σ2 = E(X2

i ) > 0, and ρ = E|Xi|3 <∞. Also, let Xm be thesample mean, and Fm the cdf of Xm

√n/σ. Then for all x and m,

there exists a positive constant C such that

|Fm(x)− Φ(x)| ≤ Cρσ3√m

———————————————————————I ρ/σ3 = 1, and C = 0.7915 for ind. Xi’s (Siganov)I Method 2: lower bound for k: 10%-30% improvement

1− Φ

√kp

2(p− 1)

)+

0.7915√kp(p− 1)/2

≤ 1/n2+β. (23)

43 / 86

Page 44: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L2 RP: IMPROVEMENT ON ACHLIOPTASBOUND

Rojo and Nguyen (2009b): Method 3I Pinelis inequality

———————————————————————Let Ui’s be independent rademacher random variables. Letd1, . . . , dm be real numbers such that

∑mi=1 d2

i = 1. LetSm =

∑mi=1 diUi, then for any t > 0,

P[|Sm| ≥ t] ≤ min(

1t2 , 2(1− Φ(t − 1.495/t))

)

———————————————————————I Method 3: lower bound for k: 15% improvement (ε = 0.1)

k ≥ 2(p− 1)a2n

pε2 , (24)

where an =Qn+√

Q2n+4(1.495)

2 , and Qn = Φ−1(1− 1

n2+β

).

44 / 86

Page 45: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

COMPARISON OF THE LOWER BOUNDSON k: RADEMACHER RANDOM MATRIX

Simulations

Table: L2-L2 distance with ε = 0.1: 1) Method 1 (based on Hoeffding’sinequality), 2) Method 2 (using Berry-Esseen inequality), 3) Method 3(based on Pinelis inequality), and 4) Achlioptas Bound (does not depend onp). Last column is the % improv. of Method 2 on Achlioptas Bound.

β p Method 1 Method 3 Method 2 Achlioptas % Improv.n = 10 0.5 5000 2303 2045 1492 40

10000 2303 2046 1492 2468 40n = 10 1 5000 2763 2472 1912 35

10000 2763 2472 1911 2961 35n = 50 0.5 5000 3912 3553 3009 28

10000 3912 3554 2995 4192 29n = 50 1 5000 4694 4300 3948 22

10000 4694 4300 3821 5030 24n = 100 0.5 5000 4605 4214 3809 23

10000 4605 4215 3715 4935 25n = 100 1 30000 5527 5100 4816 19

70000 5527 5100 4623 5921 22

45 / 86

Page 46: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

L2-L1: ACHLIOPTAS RANDOM MATRIX Gaussian

I same lower bound as in L2-L1 case using Gaussian RandomMatrix (NR2 bound)

I bounding the mgf of Achlioptas-typed r.v. by that of astandard Gaussian

I 36%-40% improvement

46 / 86

Page 47: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SUMMARYRandom Projection

I Gaussian random matrix

I L2-L2 projection: improvement of 12%-34% onDasgupta and Gupta bound

I L2-L1 projection: improvement of 36%-40% onChazelle & Ailon bound

I Achlioptas-typed random matrix

I L2-L2 projection: improvement of 20%-40% onAchlioptas bound (Rademacher random matrix)

I L2-L1 projection: improvement of 36%-40% onMatousek bound

I lower bound for k is still large for practical purposes: activeresearch

47 / 86

Page 48: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

RP VS. PCA, PLS, . . .

Random Projection (RP)

I criterion: preserve pairwise distance (within 1± ε)I k is large

PCA, PLSI optimization criterion

48 / 86

Page 49: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PRINCIPAL COMPONENT ANALYSIS(PCA)

I Karl Pearson (1901)———————————————————–

I Variance optimization criteria,

wk = arg maxw′w=1

Var(Xw) = arg maxw′w=1

(N−1)−1w′X′Xw

s.t. w′kX′Xwj = 0 for all 1 ≤ j < k.

———————————————————-I ignores response y

I eigenvalue decomposition of sample cov. matrix

49 / 86

Page 50: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PCA

I Sample covariance matrix S = (N − 1)−1X′X

I Eigenvalue decomposition: S = V∆V ′

I ∆ = diag(λ1 ≥ · · · ≥ λN) eigenvaluesI V = (v1, . . . , vN) unit eigenvectors

I weight vectors wk = vk

I PCs are Mk = Xwk, k = 1, . . . ,NI Cumulative variation explained by the 1st K PCs

is∑K

k=1 λk

50 / 86

Page 51: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PCA

How to Select K: number of PC’s

I Proportion of Variation Explained

I Cross-validation

51 / 86

Page 52: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PARTIAL LEAST SQUARES (PLS) PCA

I Herman Wold (1960)———————————————————–

I Covariance optimization criteria,

wk = arg maxw′w=1

Cov(Xw, y) = arg maxw′w=1

(N − 1)−1w′X′y

s.t. w′kX′Xwj = 0 for all 1 ≤ j < k.

———————————————————-

I incorporates both the covariates and response

I ignores censoring

52 / 86

Page 53: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

INCORPORATING CENSORING IN PLS

I Nguyen and Rocke (2002):I PLS weights: wk =

∑Ni=1 θikvi, where vi eigenvectors

of X′X

I θik depend on y only through ai = u′iy, where ui

eigenvectors of XX′

I ai is slope coeff. of simple linear regression of y on uiif X is centered.

I Nguyen and Rocke: Modified PLS (MPLS)I replace ai by slope coeff. from Cox PH regression of y

on ui to incorporate censoring PLS

53 / 86

Page 54: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PLS: NONPARAMETRIC APPROACHESTO INCORPORATE CENSORING

I Datta (2007):I Reweighting: Inverse Probability of

Censoring Weighted RWPLS

I replace censored response with 0I reweigh the uncensored response by the inverse

prob. that it corresponds to a censored obs.

I Mean Imputation: MIPLS

I keep the uncensored responseI replace the censored response by its expected

value given that the true survival time exceeds thecensoring time

54 / 86

Page 55: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE: OUR CONTRIBUTIONS

Dimension Reduction Methods

I Random Projection

I Variant of Partial Least Squares (PLS):Rank-based PLS

I insensitive to outliersI weight vectors as solution to optimization problem

55 / 86

Page 56: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

NOTATIONS:

for vectors u = (u1, . . . , un)′ and v = (v1, . . . , vn)′

I Ranks of ui’s: indices of positions in ascending ordescending order

I sample Pearson correlation coeff.

Cor(u, v) =

∑ni=1(ui − u)(vi − v)√∑n

i=1(ui − u)2√∑n

i=1(vi − v)2

I sample Spearman correlation coeff.: corr. on the ranks

56 / 86

Page 57: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

NGUYEN AND ROJO (2009A, B):RANK-BASED PARTIAL LEASTSQUARES (RPLS)

I cov/corr. measure in PLS is influenced byoutliers

I replace Pearson corr. by Spearman rank corr.

wk = arg maxw′w=1

(N − 1)−1w′R′XRy (25)

I use Nguyen and Rocke’s procedure (MPLS) and Datta’sRWPLS and MIPLS to incorporate censoring

57 / 86

Page 58: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

RPLS: DERIVATION OF THE WEIGHTS

w1 =R′XRy

||R′XRy||(26)

wk ∝ Pk−1w1, k ≥ 2 (27)

wherePk−1 = I − ζ1SRX − ζ2S2

RX− · · · − ζk−1Sk−1

RX(28)

with SjRX

= SRX SRX . . . SRX︸ ︷︷ ︸j times

, SRX = R′XRX , SX = X′X, and

ζ’s obtained from

w′1Pk−1SXw1 = 0

w′1Pk−1SXw2 = 0

. . .

w′1Pk−1SXwk−1 = 0

58 / 86

Page 59: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

OUTLINE

I Microarray DataI Dimension Reduction MethodsI Small application to Microarray data

with censored responseI Conclusions

59 / 86

Page 60: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

REAL DATASETS Simulations

DLBCL

surv. times

Freq.

0 20 40 60 80 100

020

6010

0

Harvard

surv. times

Freq.

0 20 40 60 80 100

020

6010

0

Michigan

surv. times

Freq.

0 20 40 60 80 100

020

6010

0

Duke

recur. times

Freq.

0 20 40 60 80 100

020

6010

0

I Diffuse Large B-cell Lymphoma (DLBCL): 240 cases, 7399genes, 42.5% cens.

I Harvard Lung Carcinoma: 84 cases, 12625 genes, 42.9% cens.I Michigan Lung Adenocarcinoma: 86 cases, 7129 genes,

72.1% cens.I Duke Breast Cancer: 49 cases, 7129 genes, 69.4% cens.

60 / 86

Page 61: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SURVIVAL ANALYSIS

Cox Proportional Hazards (PH) Model

h(t; zi, β) = h0(t)ez′iβ (29)

I h0: unspecified baseline hazard

S(t; zi, β) = S0(t)ez′iβ (30)

I incorporates the covariates and censored informationI proportional hazards assumption

61 / 86

Page 62: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SURVIVAL ANALYSIS

Accelerated Failure Time (AFT) model

I logarithm of true survival time

log(yi) = µ+ X′iβ + σui (31)

I yi’s are true survival times

I µ and σ are location and scale parameters

I ui’s are the errors i.i.d. with some distribution

62 / 86

Page 63: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ASSESSMENT OF THE METHODS

Selection of K1) K is fixed across all methods

I 1st K PCs explain a certain proportion of predictorvariability

2) K is chosen by Cross-validation (CV)

I K is inherent within each method

63 / 86

Page 64: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ASSESSMENT OF THE METHODSSelection of K by CV

I Cox PH Model

CV(surv.error) =1

sM

s∑

i=1

M∑

m=1

t∈Dm

[ˆS−m(t)− ˆSm(t)

]2

I i = 1, . . . , s is index of simulationI m = 1, . . . ,M is index for the foldI Dm is the set of death times in mth foldI ˆSm denotes the est. surv. function for mth foldI ˆS−m denotes the est. surv. function when the mth fold is

removedˆSm(t) =

1Nm

Nm∑

n=1

Sm,n(t)

ˆS−m(t) =1

N−m

N−m∑

n=1

S−m,n(t)

64 / 86

Page 65: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ASSESSMENT OF THE METHODS

Selection of K by CVI AFT Model

CV(fit.error) =1

sM

s∑

i=1

M∑

m=1

∑Nm

l=1 δm,l(i)(

y∗m,l(i)− y∗m,l(i))2

∑Nml=1 δm,l(i)

I i = 1, . . . , s is index of simulationI m = 1, . . . ,M is index for the foldI l = 1, . . . ,Nm is index for the individual in mth foldI y∗m,l(i) = ln (ym,l(i))I y∗m,l(i) are the estimates of y∗m,l(i)

y∗m,l(i) = µ−m,AFT(i) + Mm,l(i)′β−m,AFT(i)

65 / 86

Page 66: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PCA, MPLS, RWPLS, MIPLS, RMPLS

I Principal Component Analysis (PCA):I variance optimization criterionI ignores response

I Modified Partial Least Squares (MPLS), Reweighted PLS(RWPLS), and Mean Imputation PLS (MIPLS)

I cov/corr. optimization criterionI incorporates the response and censoring

I Proposed Method: Rank-based Partial Least Squares(RPLS)

I cov/corr. optimization criterionI incorporates the response and censoring; based on ranks

66 / 86

Page 67: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

UNIV, SPCR, CPCR

I Univariate Selection (UNIV): Bolvestad1. fit univariate regression model for each gene, and test null

hypothesis βg = 0 vs. alternative βg 6= 02. arrange the genes according to increasing p-values3. pick out the top K−ranked genes

I Supervised Principal Component Regression (SPCR): Bairand Tibshirani

1. use UNIV to pick out a subset of original genes2. apply PCA to the subsetted genes

I Correlation Principal Component Regression (CPCR):Zhao and Sun

1. use PCA but retain all PCs2. apply UNIV to pick out the top K-ranked PCs

67 / 86

Page 68: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

REAL DATASETS

Table: Cox model: DLBCL and Harvard datasets. K chosen by CV for thedifferent methods. The min(CV(surv.error)) of the 1000 repeated runs areshown.

DLBCL HARVARDMethod K error K errorPCA 7 0.1026 13 0.121MPLS 1 0.1076 1 0.1304RMPLS 1 0.1056 1 0.1124

CPCR 2 0.1014 2 0.1402SPCR 1 0.1063 3 0.1473UNIV 11 0.1221 14 0.1663

68 / 86

Page 69: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

REAL DATASETS

Table: AFT Lognormal Mixture model: DLBCL, Harvard, Michigan andDuke datasets. K chosen by CV for the different methods. Themin(CV(fit.error)) of the 1000 repeated runs are shown.

DLBCL HARVARD MICHIGAN DUKEMethod K error K error K error K errorPCA 5 5.33 6 1.43 4 4.14 4 19.15MPLS 3 2.56 1 0.51 3 1.04 1 15.91RMPLS 5 1.62 2 0.36 4 0.64 3 5.41

RWPLS 1 5.54 1 1.35 1 4.90 1 13.11RRWPLS 1 5.32 1 2.06 1 4.07 2 9.60

MIPLS 3 2.80 2 0.58 2 1.72 2 11.79RMIPLS 4 1.92 2 0.56 2 1.11 1 10.42CPCR 7 4.59 5 1.11 4 2.47 4 9.88SPCR 1 5.48 1 2.12 2 5.10 2 22.14UNIV 5 4.09 6 0.84 6 1.72 7 10.80

69 / 86

Page 70: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

GENE RANKING

I Select significant genes

I Ranking of Genes: Absolute Value of the EstimatedWeights on the Genes (AEW)

AEW = |Wβ∗R| (32)

where β∗R = βR

se(βR)

I R denotes either Cox or AFT model

70 / 86

Page 71: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

GENE RANKING

Table: Number of top-ranked genes in common between the ranked versionsof PLS and their un-ranked counterparts for DLBCL, Harvard, Michigan andDuke datasets using the absolute of the estimated weights for the genes for1st component. The first row shows the number of considered top-rankedgenes.

K top-ranked genes 25 50 100 250 500 1000DLBCL MPLS and RMPLS 15 33 74 188 397 802

RWPLS and RRWPLS 0 0 1 32 140 405MIPLS and RMIPLS 18 36 76 201 409 843

HARVARD MPLS and RMPLS 12 28 58 171 368 822RWPLS and RRWPLS 0 0 3 15 80 273MIPLS and RMIPLS 14 28 69 170 371 804

MICHIGAN MPLS and RMPLS 10 20 46 117 273 601RWPLS and RRWPLS 0 0 0 2 20 126MIPLS and RMIPLS 0 0 1 12 45 158

DUKE MPLS and RMPLS 3 3 3 21 73 210RWPLS and RRWPLS 0 3 7 36 105 287MIPLS and RMIPLS 0 0 2 18 59 194

71 / 86

Page 72: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ASSESSMENT OF THE METHODS

Simulation: Generating Gene Expression values

xij = exp(x∗ij) (33)

x∗ij =

d∑

k=1

rkiτkj + εij, k = 1, . . . , d (34)

I τkjiid∼ N(µτ , σ

2τ )

I εij ∼ N(µε, σ2ε )

I rki ∼ Unif (−0.2, 0.2) fixed for all simulationsI d = 6, µε = 0, µτ = 5/d, στ = 1, σε = 0.3

I 5,000 simulationsI p ∈ 100, 300, 500, 800, 1000, 1200, 1400, 1600I N = 50

72 / 86

Page 73: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION SETUP

Simulation: Generating Survival Times

I Cox PH model

Exponential distributionyi = y0ie−X′iβ

ci = c0ie−X′iβ

y0i ∼ Exp(λy)

c0i ∼ Exp(λc)

Weibull distributionyi = y0i

(e−X′iβ

)1/λy

ci = c0i(e−X′iβ

)1/λc

y0i ∼ Weib(λy, αy)

c0i ∼ Weib(λc, αc)

73 / 86

Page 74: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION SETUP

Simulation: Generating Survival Times

I AFT modelI ln(yi) = µ+ X′iβ + ui and ln(ci) = µ+ X′iβ + wi

I log normal mixture:fui(u) = 0.9φ(u) + 0.1

10 φ(u/10)

I exponential, lognormal, log-t

I wi ∼ Gamma(ac, sc)

74 / 86

Page 75: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION SETUP

Simulation: Generating Survival Times

I both Cox and AFT modelsI observed surv. time Ti = min(yi, ci)

I censoring indicator δi = I(yi < ci)

I censoring rate P[yi < ci]

I βj ∼ N(0, σ2π), σπ = 0.2 for all p’s

I outliers in the response for large p

75 / 86

Page 76: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PERFORMANCE MEASURESPerformance Measures: once K is selected

I Mean squared error of weights on the genes

MSE(β) =1s

s∑

i=1

p∑

j=1

(βj − βj(i))2

I β = WβCox , AFT

I AFT ModelI Mean squared error of fit

MSE(fit) =1s

s∑

i=1

[∑Nn=1 δn(i) (y∗n(i)− y∗n(i))2

∑Nn=1 δn(i)

]

I y∗n(i) = log (yn(i))I y∗n(i) = µAFT(i) + Mn(i)′βAFT(i)

76 / 86

Page 77: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

PERFORMANCE MEASURES

Performance Measures:I Cox PH Model

I MSE of est. surv. function evaluated at the average ofthe covariates

ave(d2) =1s

s∑

i=1

t∈Ds

(Si(t)− ˆSi(t)

)2

I MSE of est. surv. function evaluated using thecovariates of each individual

ave(d2.ind) =1

sN

s∑

i=1

N∑

n=1

t∈Ds

(Sin(t)− Sin(t)

)2

77 / 86

Page 78: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION RESULTS: COX MODEL, FIX K

500 1000 1500

0.0

0.5

1.0

1.5

2.0

33% cens, 40% var

number of genes

ave

(d^2

)

PCAMPLSRMPLSCPCRSPCRUNIV

500 1000 1500

0.0

0.5

1.0

1.5

2.0

33% cens, 70% var

number of genesa

ve(d

^2)

Cox model: 1/3 censored. MSE of est. surv. function evaluated at average of covariates

(ave(d2)) for datasets with approximately 40% and 70% TVPE accounted by the first 3 PCs

comparing PCA, PLS, MPLS, CPCR, SPCR, and UNIV.

78 / 86

Page 79: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION RESULTS: COX MODEL, FIX K

500 1000 1500

01

23

433% cens, 40% var

number of genes

ave

(d^2

.ind

)

PCAMPLSRMPLSCPCRSPCRUNIV

500 1000 1500

01

23

4

33% cens, 70% var

number of genesa

ve(d

^2.in

d)

Cox model: 1/3 censored. MSE of est. surv. function evaluated using the covariates of each

individual (ave(d2.ind)) for datasets with approximately 40% and 70% TVPE accounted by

the first 3 PCs comparing PCA, PLS, MPLS, CPCR, SPCR, and UNIV.

79 / 86

Page 80: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION RESULTS: COX MODEL, CV

500 1000 1500

0.2

00

.25

0.3

00

.35

0.4

0

min(CV(surv.error))

number of genes

min

(CV

(su

rv.e

rro

r))

PCAMPLSRMPLSCPCRSPCRUNIV

500 1000 1500

0.0

0.5

1.0

1.5

2.0

ave(d^2)

number of genes

ave

(d^2

)

500 1000 1500

1.0

1.5

2.0

2.5

3.0

3.5

4.0

ave(d^2.ind)

number of genes

ave

(d^2

.in

d)

Cox model: 1/3 censored. K is chosen by CV. Minimized CV of squared error of est. surv.

function min(CV(surv.error)), MSE of est. surv. function evaluated at average of

covariates (ave(d2)), and MSE of est. surv. function using the covariates of each individual

(ave(d2.ind)) comparing PCA, MPLS, RMPLS, CPCR, SPCR, and UNIV.

80 / 86

Page 81: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION RESULTS: AFT MODEL, CV

500 1000 1500

020

4060

8010

0

min(CV(fit.error))

number of genes

min

(CV

(fit.

erro

r))

RWPLSRRWPLSMIPLSRMIPLSMPLSRMPLS

500 1000 1500

020

4060

8010

0

min(CV(fit.error))

number of genes

min

(CV

(fit.

erro

r))

PCAMPLSRMPLSCPCRSPCRUNIV

500 1000 1500

020

4060

80

MSE(beta)

number of genes

MS

E(b

eta)

500 1000 1500

020

4060

80

MSE(beta)

number of genes

MS

E(b

eta)

500 1000 1500

1020

3040

5060

MSE(fit)

number of genes

MS

E(f

it)

500 1000 1500

1020

3040

5060

MSE(fit)

number of genes

MS

E(f

it)

Figure 1: AFT lognormal mixture model: 1/3 censored. K is chosen by CV. min(CV (fit.error)),MSE(β), andMSE(fit) comparing RWPLS, RRWPLS, MIPLS, RMIPLS, MPLS, and RMPLS (top row),and comparing PCA, MPLS, RMPLS, SPCR, CPCR, and UNIV (bottom row) based on 5000 simulations.

81 / 86

Page 82: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SIMULATION RESULTS: AFT MODEL, CV

500 1000 1500

050

100

150

min(CV(fit.error))

number of genes

min

(CV

(fit.e

rror

))

RWPLSRRWPLSMIPLSRMIPLSMPLSRMPLS

500 1000 1500

050

100

150

min(CV(fit.error))

number of genes

min

(CV

(fit.e

rror

))

PCAMPLSRMPLSCPCRSPCRUNIV

500 1000 1500

020

4060

80

MSE(beta)

number of genes

MS

E(b

eta)

500 1000 1500

020

4060

80

MSE(beta)

number of genes

MS

E(b

eta)

500 1000 1500

1020

3040

5060

70

MSE(fit)

number of genes

MS

E(fi

t)

500 1000 1500

1020

3040

5060

70

MSE(fit)

number of genes

MS

E(fi

t)Figure: AFT exponential model: 1/3 censored. K is chosen by CV based on5000 simulations.

82 / 86

Page 83: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

SUMMARY

I Rank-based Partial Least Squares (RPLS)I replace Pearson correlation with Spearman rank

correlationI incorporate censoring: Nguyen and Rocke’s MPLS,

Datta’s RWPLS and MIPLS

I Rank-based Partial Least Squares (RPLS) works well

I in presence of outliers in responseI comparable to MPLS and PCA in absence of outliers

Outlook

83 / 86

Page 84: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

CONCLUSIONS

Dimension reduction methodsI Random Projection

I Johnson-Lindenstrauss (JL) LemmaI Improvements on the lower bound for k

I L2-L2: Gaussian and Achlioptas-type randommatrices

I L2-L1: Gaussian and Achlioptas-type randommatrices

I Rank-based Partial Least Squares (RPLS)I competitive dimension reduction method

84 / 86

Page 85: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

THANK YOU!

Special Thanks:

Dr. Rojo, Chair and Advisor, Professor ofStatistics, Rice University

85 / 86

Page 86: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

ACKNOWLEDGMENT

This work is supported by:

NSF Grant SES-0532346, NSA RUSIS GrantH98230-06-1-0099, and NSF REU GrantMS-0552590

86 / 86

Page 87: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

DATTA: REWEIGHTING (RWPLS)

I Kaplan-Meier estimator

Sc(t) =∏

ti≤t

[1− ci

ni

]

where t1 < · · · < tm are ordered cens. times, ci = no. cens.observations, ni = no. still alive at time ti

I Let yi = 0 for δi = 0 and yi = Ti/Sc(T−i ) forδi = 1

I apply PLS to (X, y) back

87 / 86

Page 88: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

DATTA: MEAN IMPUTATION (MIPLS)

I Conditional expectation

y∗ =

∑tj>ci

tj∆S(tj)

S(ci)

where tj are ordered death times, ∆S(tj) is jump size of S attj, ni = no. still alive at time ti

I Let yi = yi for δi = 1 and yi = y∗i for δi = 0

I apply PLS to (X, y) back

88 / 86

Page 89: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

BIBLIOGRAPHY

I Achlioptas, D. Database-friendly random projections. Proc. ACM Symp. on the principles of database systems, 2001.pp. 274-281.

I Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Broldrick, J.C., Sabet, H., Tran, T., andYu, X. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 2000. pp.503-511.

I Bair, E., Hastie, T., and Tibshirani, R. Prediction by supervised principal components. Journal of American StatisticalAssociation 101. 2006. pp. 119-137.

I Bovelstad, H.M., Nygard, S., Storvold, H.L., Aldrin, M., Borgan, O., Frigessi, A., and Lingjaerde, O.C. Predictingsurvival from microarray data - a comparative study. Bioinformatics Advanced Access 2007.

I Cox, D.R. Regression Models and life tables (with discussion). Statistical Society Series B34, 1972. pp. 187.I Dai, J.J., Lieu, L., and Rocke, D.M. Dimension reduction for classification with gene expression microarray data.

Statistical Applications in Genetics and Molecular Biology, vol. 5, issue 1, article 6, 2006.I Dasgupta, S. and Gupta, A. An elementary proof of the Johson- Lindenstrauss lemma. Technical report 99-006, UC

Berkeley, March 1999.I Datta, S., Le Rademacher, J., and Datta, S. Predicting patient survival by accelerated failure time modeling using

partial least squares and lasso. Biometrics 63, 2007. pp. 259-271.I Johnson, W. and Lindenstrauss, J. Extensions of Lipschitz maps into a Hilbert space. Contemp. Math. 26, 1984, pp.

189-206.I Kaplan E.L., and Meier, P. Nonparametric estimation from incomplete observations. Journal of American Statistics

Association 53, 1958. pp. 467-481.I Klein, J.P., and Moeschberger, M.L. Survival Analysis: techniques for censored and truncated data. Springer, second

edition. New York, 2003.

I Kohonen, T. et al. Self organization of massive document collection. IEEE Transactions on Neural Networks, 11 (3),

May 2000. pp. 574-585.

89 / 86

Page 90: Dimension Reduction Methods with Application to High ...jrojo/PASI/lectures/Tuan.pdf · Dimension Reduction Methods with Application to High-dimensional Data with a Censored Response

BIBLIOGRAPHY

I Li, K.C., Wang, J.L., and Chen C.H. Dimension reduction for censored regression data. The Annals of Statistics 27,1999. pp. 1-23.

I Li, L. and Li, H. Dimension reduction methods for microarrays with application to censored survival data. Center forBioinformatics and Molecular Biostatistics, Paper surv2, 2004.

I Li, P., Hastie, T.J., and Church K.W. Very Sparse Random Projections. Proceedings of the 12th ACM SIGKDDinternational conference on knowledge discovery and data mining, 2006. pp. 287-296.

I Mardia, K.V., Kent, J.T., and Bibby, J.M. Multivariate Analysis. Academic Press, 2003.I Nguyen, D.V. and Rocke, D.M. Assessing patient survival using microarray gene expression data via partial least

squares proportional hazard regression. Computational Science Statistics, 33, 2001. pp. 376.I Nguyen, D.V. and Rocke, D.M. Partial least squares proportional hazard regression for application to DNA microarray

survival data. Bioinformatics, 18, 1625, 2002.I Nguyen, D.V. Partial least squares dimension reduction for microarray gene expression data with a censored response.

Mathematical Biosciences, 193, 2005. pp. 119-137.I Nguyen TS, Rojo J, Dimension reduction of microarray data in the presence of a censored survival response: a

simulation study, Statistical Applications in Genetics and Molecular Biology 8.1.4, 2009.I Nguyen TS, Rojo J, Dimension reduction of microarray gene expression data: the accelerated failure time model,

Journal of Bioinformatics and Computational Biology, 7.6, 2009. pp. 939-954.I Rojo J, Nguyen TS, Improving the Johnson-Lindenstrauss Lemma, Journal of Computer and System Sciences.

Submitted, 2009.I Rojo J, Nguyen TS, Improving the Achlioptas bound for Rademacher random matrices. In Preparation, 2009.I Sun, J. Correlation principal component regression analysis of NIR data. Journal of Chemometrics, 9, 1995. pp. 21-29.I Wold, H. Estimation of principal components and related models by iterative least squares. In Krishnaiah, P. (ed.),

Multivariate Analysis. Academic Press, N.Y., 1966. pp. 391-420.

I Zhao, Q., and Sun, J. Cox survival analysis of microarray gene expression data using correlation principal component

regression. statistical applications in genetics and molecular biology vol.6.1, article 16. Berkeley Electronic Press,

2007.

90 / 86