tree based methods for analyzing

40
Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles

Upload: pammy98

Post on 10-Dec-2014

321 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Tree Based Methods for Analyzing

Tree Based Methodsfor Analyzing

Tissue Microarray Data

Steve HorvathHuman Genetics and Biostatistics

University of California, Los Angeles

Page 2: Tree Based Methods for Analyzing

Acknowledgements

• Horvath Lab– Yunda Huang – Xueli Liu Ph.D.– Zeke Fang Ph.D.– Tuyen Hoang

• UCLA Tissue Microarray Core– David Seligson– Aarno Palotie

• Clinicians– Hyung Kim– Arie Belldegrun

Page 3: Tree Based Methods for Analyzing

Contents

• Statistical issues with tissue microarray (TMA) data

• Random forest (RF) predictors

• RF clustering

• Application of RF clustering to TMA data

• Supervised Learning Methods

Page 4: Tree Based Methods for Analyzing

Background TMA data

Page 5: Tree Based Methods for Analyzing

Description of TMA data

• TMA data are a high-throughput tool in validating newly-identified biomarker in genome wide discovery

• Basic technique was summarized in Kononen et al. 1998

Page 6: Tree Based Methods for Analyzing

donor block array block slide

Tissue Microarray (TMA) TechnologyKononen et al. Nature Medicine 1998

• Hundreds of tiny (typically 0.6 mm diameter) cylindrical tissue cores

–densely and precisely arrayed into a single histologic paraffin block.

• From this new array block, up to 300 serial 4-8 m thick sections may be produced.

• Targets for fluorescence in situ hybridization (FISH) and protein expression by immunohistochemical studies.

Page 7: Tree Based Methods for Analyzing

Pathologists score each spot by looking through a microscope. slide by David Seligson

Non-normal and highly correlated

Page 8: Tree Based Methods for Analyzing

Several Spots per Pathology Case Several “Scores” per Spot

• Maximum intensity = Max (1 – 4)

• Percent of cells staining = Pos (0 – 100)

• Percent of cells staining with the

maximum intensity = PosMax (0 – 100)

• Spots have a spot grade: NL,1,2,..

• Indicator of informativeness

• Each case is usually represented by 4 or more spots

– >3 malignant lesions, 1 matched normal

Page 9: Tree Based Methods for Analyzing

0 20 40 60 80 100

05

01

00

15

02

00

25

0

0 20 40 60 80 100

05

01

00

15

02

00

25

0

0 20 40 60 80 100

05

01

00

15

02

00

0 0.5 1 1.5 2 2.5 3

05

01

00

15

0

0 0.5 1 1.5 2 2.5 3

05

01

00

15

02

00

0 0.5 1 1.5 2 2.5 3

05

01

00

15

0

P53 CA9 EpCamPercent of Cells Staining(POS)

Maximum Intensity (MAX)

Histogram of tumor marker expression scores: POS and MAX

Page 10: Tree Based Methods for Analyzing

P53 and Ki67: Max versus Pos

0.0 0.5 1.0 1.5

1.5 2.0 2.5 3.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5KiNuclMax

0 20 40

40 60 80

40

60

80

0

20

40KiPos

0.0 0.5 1.0 1.5

1.5 2.0 2.5 3.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5P5NuclMax

0 20 40

60 80 100

60

80

100

0

20

40P5Pos

Page 11: Tree Based Methods for Analyzing

Characteristics of TMA data

• Non-normal, discrete, strongly correlated• Mixed variable types • Pooling (combining) spot measurements across

every patient – between 1 to 10 spots of different grade

– current strategy pools tumor spots and forms median, mean, minimum or max

• Message: tumor marker intensity is measured by up to 12 highly correlated staining scores multicollinearity

Page 12: Tree Based Methods for Analyzing

Our main tool are random forest predictors

• Unsupervised analysis of TMA data– RF clustering

• Supervised Analysis– RF based pre-validation method

Page 13: Tree Based Methods for Analyzing

Background random forest predictors

L. Breiman 1999

Page 14: Tree Based Methods for Analyzing

Random Forests (RFs)

• RFs are a collection of tree predictors such that each tree depends on the values of an independently sampled random vector

Page 15: Tree Based Methods for Analyzing

Classification and Regression Trees (CART)

by– Leo Breiman,

UC Berkeley– Jerry Friedman,

Stanford University– Charles J. Stone,

UC Berkeley– Richard Olshen,

Stanford University

Page 16: Tree Based Methods for Analyzing

An example of CART

• Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack

• Training data set:– # of subjects = 215– Outcome variable = High/Low Risk

determined– 19 noninvasive clinical and lab variables were

used as the predictors

Page 17: Tree Based Methods for Analyzing

High 12%Low 88%

High 17%Low 83%

Is BP <= 91?

High 70%Low 30%

High 11%Low 89%

High 50%Low 50%

High 2%Low 98%

High 23%Low 77%

Is age <= 62.5?Classified as high risk!

Classified as low risk!

Classified as high risk! Classified as low risk!

Is ST present?

CART construction

Yes No

No

No

Yes

Yes

Page 18: Tree Based Methods for Analyzing

CART Construction

BINARY RECURSIVE PARTITIONING

• Binary: split parent node into two child nodes

• Recursive: each child node can be treated as parent node

• Partitioning: data set is partitioned into mutually exclusive subsets in each split

Page 19: Tree Based Methods for Analyzing

RF Construction

Page 20: Tree Based Methods for Analyzing

Prediction by plurality voting

• The forest consists of N trees.

• Class prediction: – Each tree votes for a class; the predicted

class C for an observation is the plurality, maxC k [fk(x,T) == C]

• Regression random forest: – predicted value is the average prediction

Page 21: Tree Based Methods for Analyzing

Clustering with random forest predictors

Page 22: Tree Based Methods for Analyzing

Intrinsic Proximity Measure

• Terminal tree nodes contain few observations

• If case i and case j both land in the same terminal node, increase the proximity between i and j by 1.

• At the end of the run divide by 2* no. of trees.

• Dissimilarity=sqrt(1-Proximity)

Page 23: Tree Based Methods for Analyzing

Casting an unsupervised problem into a supervised RF

problem • Key Idea (Breiman 1999)

– Label observed data as class 1– Generate synthetic observations and

label them as class 2– Construct a RF predictor to distinguish

class 1 from class 2– Use the resulting dissimilarity measure

in unsupervised analysis

Page 24: Tree Based Methods for Analyzing

How to generate synthetic observations

• Synthetic observations are simulated to contain no clusters– e.g. randomly sampling from the product of

empirical marginal distributions of the input.

Page 25: Tree Based Methods for Analyzing

RF clustering

• Compute distance matrix from RF– distance matrix = sqrt(1-proximity matrix)

• Compute the first 2~3 classical multi-dimensional scaling coordinates based on the distance matrix

• Conduct partitioning around medoid (PAM) clustering analysis

– input parameter=no. of clusters k – use the Euclidean distance between the resulting

scaling points

Page 26: Tree Based Methods for Analyzing

Theoretical Study of RF Clustering

Ref: Using random forest proximity for unsupervised learning, BIOKDD-CBGI'03, 7th Joint Conference on Information Sciences, Cary, North Carolina.

Page 27: Tree Based Methods for Analyzing

Applying Random Forest Clustering to Tissue Microarray Data--Application to Kidney Cancer

Tao Shi and Steve Horvath

Page 28: Tree Based Methods for Analyzing

Scientific Question:Can one discover cancer subtypes

based on the protein expression patterns of tumor markers?

Page 29: Tree Based Methods for Analyzing

Why use RF clustering for TMA data?

• no need to transform the often highly skewed features– based on ranks of features

• natural way of weighing tumor marker contributions to the dissimilarity

• elegant way to deal with missing covariates

• intrinsic proximity matrix handles mixed variable types well

Page 30: Tree Based Methods for Analyzing

Kidney Multi-marker Data

• 366 patients with Renal Cell Carcinoma (RCC) admitted to UCLA between 1989 and 2000.

• Immuno-histological measures of total 8 tumor markers were obtained from tissue microarrays constructed from the tumor samples of these patients.

Page 31: Tree Based Methods for Analyzing

MDS plot of clear cell patients

• Labeled and colored by their RF cluster

-0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

cmd plot

coordinate 1

coo

rd 2

1

2

1

2

1

1 11

2

1

2

2

1

2

2 2

1

22

2

11

3

1

3

2

1

1

3

11

1

2

2

2

3

1

2

3

2

22

2 2

2

2

3

1

22

2

1

1

3

1

32

2

1

2

3

1

2 2

1

2 22

2

3322

2

22

2

2

3

2

22

2

1

22

22

11

2

1

2

2

2

1

2

2

2

2

3

1

2

3

3

2

3

2

2

2

2

1

2

22

2

22

2

2

1

2

1

222

1

2

2

1

2

1

1

2

2

1

2

2

2

3

22

1

2

2 3

1

21

2

2

2

1

2

2

222

2

2

2

1

2

2

222

2

2

2

3

2

222

1

2

2

1

3

2

1

2

2

2

2

2

22

1

1

1

2

1

1

22

1

22

2

2

1

22

2

22

2

2

22

2

3

2

11

1

2

2

2

1

22

1

2

1

2

2

3

2

2

1

3

2

22

3

2

3

1

1

2

1

1

31

22

22

1

2

2

2

2

1

2 2

2

22

22

2

2

2

2

1

22

3

2

3

2

2

2

1

2

23

1

2

2

3

1

3

1

2

11

1

22

22

1

2

23

2

2

2

1

3

2 2

2

2

1

22

22

31

3

1

2

2

2

2

2

22

1

22

22

1

2

3

1

1

2

2

3

2

2

1

2

1

1

1

1

3

2

3

2

22

2

22

2

2

1

2

2

22

2

2

1

2

Page 32: Tree Based Methods for Analyzing

Interpreting the clusters in terms of survival

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

K-M curves

Time to death(Months)

Su

rviv

al

1 Log Rank p value= 0.00037423

Clustering label

Non clearCell

patients

Clear cellpatients

1 0 92

2 20 215

3 30 9

Page 33: Tree Based Methods for Analyzing

Hierarchical clustering with Euclidean distance leads to less satisfactory results

11 1

11 1

11

1 11

1 11 1 1

01 1

1 11

1 11

1 11

1 11

1 11

11 1

11

1 11

1 11 1

11 1

11

1 11 1

11

1 11

1 11 1

1 11 1

11 1 1

11 1

1 1 1 11

11 1

11 1 11 1

11 1 1 1

11 1

1 11

1 11 1

1 11 1

11 1

11 1 1

11 1

1 11 1 1 1 1

1 1 1 11

1 1 11

1 1 1 1 1 11 1

11 1 1 1 11 1 1

1 1 11

1 1 11 1

11

1 11

1 1 1 1 1 1 1 11 1

1 11

11

1 11 1

1 11 1

1 1 1 11

1 11

1 11

1 11

1 1 1 11 1

11 1

11 1

1 1 1 1 1 0 11 0

11

1 11

11

1 11 1

11 1

11 1

1 11

11 1 1 1

1 11

1 11 1

1 11 1

11

0 1 11 1

11

1 11

11 1

01 1

11

0 11

11 1

1 01

1 10

01

1 11 1

01

00 0

11 1

11 1

10 0

0 00 0

1 11 0 0

0 00 0

11

1 01

00 1

10 0

0 10

10 1

1 10

00 0 0 0

0 00

0 00 0

11

0 10

0 01

1 1

05

01

00

15

0

Cluster Dendrogram

hclust (*, "average")dist(KidneyRF)

He

igh

t

Cluster-ing label

NonclearCell

patients

Clearcell

patients

1 9 (20)

286 (307)

2 41(30)

30 (9)

* RF clustering grouping in red

Page 34: Tree Based Methods for Analyzing

Euclidean vs. RF Distance

RF

dis

tan

ce

Euclidean distance

Page 35: Tree Based Methods for Analyzing

Molecular grouping vs. Pathological grouping

Message: molecular grouping is superior to pathological grouping

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al

327 patients in cluster 1 and 239 patients in cluster 3

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al316 non-clear cell patients50 clear cell patients

p = 0.0229p = 9.03e-05

Molecular Grouping Pathological Grouping

Page 36: Tree Based Methods for Analyzing

Identify “irregular” patients

Clustering label

Non clearCell

patients

Clear cellpatients

1 0 92

2 20 215

3 30 9

Message: molecular grouping can be used to refine clear celldefinition.

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al

p = 0.00522

9 irregular clear cell patients307 regular clear cell patients

50 non-clear cell patients

Page 37: Tree Based Methods for Analyzing

Detect novel cancer subtypes

• Group clear cell grade 2 patients into two clusters with significantly different survival.

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

K-M curves

Time to death (years)

Su

rviv

al

p value= 0.0125

Page 38: Tree Based Methods for Analyzing

Results TMA clustering

• Clusters reproduce well known clinical subgroups– Ex: global expression differences between

clear cell and non-clear cell patients– RF clustering works better than clustering

based on the Euclidean distance for TMA data

• RF clustering allows one to identify “outlying” tumor samples.

• Can detect previously unknown sub-groups

Page 39: Tree Based Methods for Analyzing

Boxplots of tumor marker expression vs. cluster

1 2 3

020

40

60

80

100

CA

9M

em

PosM

n

p= 9.95e-28

1 2 3

020

40

60

80

100

CA

12M

em

PosM

n

p= 4.61e-15

1 2 3

010

20

30

40

50

Ki6

7P

osM

n

p= 3.51e-13

1 2 3

020

40

60

80

100

GeP

osH

arr

iMn

p= 3.33e-21

1 2 3

020

40

60

80

p53P

osM

n

p= 1.7e-10

1 2 3

020

40

60

80

100

EpD

ctP

osM

n

p= 1.64e-14

1 2 3

020

40

60

80

100

pT

EN

PosM

np= 1.43e-27

1 2 30

20

40

60

80

100

Vim

Pos

p= 7.97e-14

Message: clusters can be explained in terms of tumor expression values, i..e in terms of biological pathways.

Page 40: Tree Based Methods for Analyzing

Conclusions

• There is a need to develop tailor made data mining methods for TMA data– Major differences:

• highly non-normal data • Euclidean distance metrics seems to be sub-

optimal for TMA data

• tree or forest based methods work well for kidney and prostate TMA data