discrimination and clustering with microarray gene expression data

29
Discrimination and clustering with microarray gene expression data Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit* Department of Statistics, UC Berkeley, *Department of Biochemistry, Stanford University ENAR, Charlotte NC, March 27 2001

Upload: thor

Post on 18-Jan-2016

51 views

Category:

Documents


2 download

DESCRIPTION

Discrimination and clustering with microarray gene expression data. Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit*. Department of Statistics, UC Berkeley, *Department of Biochemistry, Stanford University. ENAR, Charlotte NC, March 27 2001. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discrimination and clustering with microarray gene expression data

Discrimination and clustering with microarray gene expression data

Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit*

Department of Statistics, UC Berkeley, *Department of Biochemistry, Stanford University

ENAR, Charlotte NC, March 27 2001

Page 2: Discrimination and clustering with microarray gene expression data

Outline

Introductory comments

Classification

Clustering

A synthesis

Concluding remarks

Page 3: Discrimination and clustering with microarray gene expression data

Tumor classification A reliable and precise classification of tumors is

essential for successful treatment of cancer.

Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables.

In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous.

DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression on a genomic scale.

Page 4: Discrimination and clustering with microarray gene expression data

Tumor classification, ctd There are three main types of statistical problems

associated with tumor classification:

1. The identification of new/unknown tumor classes using gene expression profiles;

2. The classification of malignancies into known classes;

3. The identification of “marker” genes that characterize the different tumor classes.

These issues are relevant to other questions we meet , e.g. characterising/classifying neurons or the toxicity of chemicals administered to cells or model animals.

Page 5: Discrimination and clustering with microarray gene expression data

Gene Expression DataGene expression data on p genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

=Log( Red intensity / Green intensity)Log(Avg. PM - Avg. MM)

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 6: Discrimination and clustering with microarray gene expression data

Comparison of discrimination methods

In this field many people are inventing new methods of classification or using quite complex ones (e.g. SVMs). Is this necessary?

We did a study comparing several methods on three publicly available tumor data sets: the Leukemia data set, the Lymphoma data set, and the NIH 60 tumor cell line data, as well as some unpublished data sets.

We compared NN, FLDA, DLDA, DQDA and CART, the last with or without aggregation (bagging or boosting).

The results were unequivocal: simplest is best!

Page 7: Discrimination and clustering with microarray gene expression data
Page 8: Discrimination and clustering with microarray gene expression data

Lymphoma data set: 29 B-CLL, 9 FL, 43 DLBCL,

4,682 genes 50 genes

Images of correlation matrix between 81 samples

Page 9: Discrimination and clustering with microarray gene expression data
Page 10: Discrimination and clustering with microarray gene expression data
Page 11: Discrimination and clustering with microarray gene expression data

Cluster Analysis

Can cluster genes, cell samples, or both.

Strengthens signal when averages are taken within clusters of genes (Eisen).

Useful (essential ?) when seeking new subclasses of cells, tumors, etc.

Leads to readily interpreted figures.

Page 12: Discrimination and clustering with microarray gene expression data

Clusters

Taken from Nature February, 2000Paper by A Alizadeh et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,

Page 13: Discrimination and clustering with microarray gene expression data

Discovering sub-groups

Page 14: Discrimination and clustering with microarray gene expression data

Clustering problems

Suppose we have gene expression

data on p genes for n tumor mRNA

samples in the form of gene expression

profiles xi = (xi1, …, xip), i=1,…,p.

Three related tasks are:

1. Estimating the number of tumor clusters ;

2. Assigning each tumor sample to a cluster;

3. Assessing the strength/confidence of cluster assignments for individual tumors.

These are generic clustering problems.

Page 15: Discrimination and clustering with microarray gene expression data

Assessing the strength/confidence of

cluster assignments

The silhouette width of an observation is

s = (b-a )/max(a,b)

where a is the average dissimilarity between the observation and all others in the cluster to which it belongs, and b is the smallest of the average dissimilarities between the observation and ones in other clusters. Large s means well clustered.

Page 16: Discrimination and clustering with microarray gene expression data

Bagging

• In discriminant analysis, it is well known that gains in accuracy can be obtained by aggregating predictors built from perturbed versions of the learning set (cf. bagging and boosting).

• In the bootstrap aggregating or bagging procedure, perturbed learning sets of the same size as the original learning set are formed by drawing at random with replacement from the learning set, i.e., by forming non-parametric bootstrap replicates of the learning set.

• Predictors are build for each perturbed dataset and aggregated by plurality voting.

Page 17: Discrimination and clustering with microarray gene expression data

Bagging a clustering algorithm

For a fixed number k of clusters – Generate multiple bootstrap learning sets (B=50)– Apply the clustering algorithm to each bootstrap

learning set;– Re-label the clusters for the bootstrap learning sets so

that there is maximum overlap with the original clustering of these observations;

– The cluster assignment of each observation is then obtained by plurality voting.

Record for each observation its cluster vote (CV), which is the proportion of votes in favour of the “winning” cluster.

Page 18: Discrimination and clustering with microarray gene expression data

Lymphoma data set

Page 19: Discrimination and clustering with microarray gene expression data

Leukemia data set

Page 20: Discrimination and clustering with microarray gene expression data
Page 21: Discrimination and clustering with microarray gene expression data

Comparison of clustering and other approaches to microarray data analysis

Cluster analyses:

1) Usually outside the normal framework of statistical inference;

2) less appropriate when only a few genes are likely to change.

3) Needs lots of experiments

Single gene approaches

1) may be too noisy in general to show much

2) may not reveal coordinated effects of positively correlated genes.

3) harder to relate to pathways.

Page 22: Discrimination and clustering with microarray gene expression data
Page 23: Discrimination and clustering with microarray gene expression data

Clustering as a means to an end

We and others (Stanford) are working on methods which try to combine combine clustering with more traditional approaches to microarray data analysis.

Idea: find clusters of genes and average their responses to reduce noise and enhance interpretability.

Use testing to assign significance with averages of clusters of genes as we would with single genes.

Page 24: Discrimination and clustering with microarray gene expression data

Clustering genes

1 2 3 4 5

Cluster 6=(1,2)

Cluster 7=(1,2,3)Cluster 8=(4,5)

Cluster 9= (1,2,3,4,5)

Let p = number of genes.

1. Calculate within class correlation.

2. Perform hierarchical clustering which will produce (2p-1) clusters of genes.

3. Average within clusters of genes.

4 Perform testing on averages of clusters of genes as if they were single genes.

E.g. p=5

Page 25: Discrimination and clustering with microarray gene expression data

Data - Ro1Transgenic mice with a modified Gi coupled receptor (Ro1).

Experiment: induced expression of Ro1 in mice.

8 control (ctl) mice

9 treatment mice eight weeks after Ro1 being induced.

Long-term question: Which groups of genes work together.

Based on paper: Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy, see Redfern C. et al. PNAS, April 25, 2000.

http://www.pnas.org also

http://www.GenMAPP.org/ (Conklin lab, UCSF)

Page 26: Discrimination and clustering with microarray gene expression data

Histogram

Cluster of genes(1703, 3754)

Page 27: Discrimination and clustering with microarray gene expression data

Top 15 averages of gene clusters

-13.4 7869 = (1703, 3754)

-12.1 3754

11.8 6175

11.7 4689

11.3 6089

11.2 1683

-10.7 2272

10.7 9955 = (6194, 1703, 3754)

10.7 5179

10.6 3916

-10.4 8255 = (4572, 4772, 5809)

-10.4 4772

-10.4 10548 = (2534, 1343, 1954)

10.3 9476 = (6089, 5455, 3236, 4014)

Might be influenced by 3754

1 0.7 0.7

0.7 1 0.8

0.7 0.8 1

⎢ ⎢

⎥ ⎥

Correlation1 0.5 0.5

0.5 1 0.8

0.5 0.8 1

⎢ ⎢

⎥ ⎥

T Group ID

Page 28: Discrimination and clustering with microarray gene expression data

Closing remarks

More sophisticated classification methods may become justified when data sets are larger.

There seems to be considerable room for approaches which bring cluster analysis into a more traditional statistical framework.

The idea of using clustering to obtain derived variables seems promising, but has yet to realise this promise.

Page 29: Discrimination and clustering with microarray gene expression data

Acknowledgments

UCBUCB

Yee Hwa YangYee Hwa Yang

Jane FridlyandJane Fridlyand

WEHIWEHI

Natalie ThorneNatalie Thorne

PMCI

David Bowtell

Chuang Fong Kong

StanfordStanford

Sandrine DudoitSandrine Dudoit

UCSF

Bruce Conklin

Karen Vranizan