analyzing expression data: clustering and stats chapter 16

Analyzing Expression Data:Clustering and Stats

Chapter 16

Goals

• We’ve measured the expression of genes or proteins using the technologies discussed previously.

• What can we do with that information?– Identify significant differences in expression– Identify similar patterns of expression

(clustering)

Analysis steps

1. Data normalization

2. Statistical Analysis

3. Cluster Analysis

I. Data Normalization

• Why normalize?– Removes systematic errors– Makes the data easier to analyze statistically

Sources of Error

• Measurements always contain errors.– Systematic (oops)– Random (noise!)

• Subtracting the background level can remove some systematic error– Using the ratio in two-channel experiments does this– Subtracting the overall average intensity can be used with one-

channel data.

• Taking averages over replicates of the experiment reduces the random error.

• Advanced error models are mentioned on p. 628 and covered in “Further Reading”.

Expression data usually not Gaussian (normal)

• Many statistical tests assume that the data is normally distributed.

• Expression microarray spot intensity data (for example) is not.

• Intensity ratio data (two-channel) is not normal either.

• Both go from 0 to infinity whereas normal data is symmetrical.

QuickTime™ and a decompressor

are needed to see this picture.

Taking the logarithm helps normalize expression ratio data

• The expression ratio plotted versus the expression level (geometric mean) in both channels.

• Plotting the log ratio vs. the log expression level gives data that is centered around y=0 and fairly “normal looking”.



Taking the log of the expression ratio “fixes” the left tail



LOWESS Normalization

• Sometimes there is still a bias that depends on the expression level.

• This can be removed by a type of regression called “Locally Weighted Scatterplot Smoothing”.

• This computes and subtracts the mean locally for various values of expression level (RG).



II. Statistical Analysis

• Determining what differences in expression are statistically significant

• Controlling false positives

When are two measurements significantly different?

• We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1).

• A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small.

• The significance is related to the area of the overlap of the underlying distributions.



The Z-test

• If the data is approximately normal, convert it to a Z-score.– X can be the log expression ratio; is then 0 is the sample standard deviation; n is the number of repeats

• The Z-score is distributed N(0,1) (standard normal).• The significance level is the area in the tail(s) of the standard

normal distribution.



QuickTime™ and a decompressorare needed to see this picture.

The t-test

• The t-test makes fewer assumptions about the data than the Z-test

• It can be applied to compare two average measurements which can have– Different variances– Different numbers of observations

• You compute the t-statistic (see pages 654-655) and then look up the significance level of the Students’ T distribution in a table.

III. Cluster Analysis

• Similar expression patterns– Groups of genes/proteins with similar

expression profiles

• Similar expression sub-patterns– Groups of genes/proteins with similar

expression profiles in a subset of conditions

• Different clustering methods• Assessing the value of clusters

Example: Gene Expression Profiles

• Expression level of a gene is measured at different time points after treating cells.

• Many different expression profiles are possible.– No effect– Immediate increase or

decrease– Delayed increase or decrease– Transient increase or

decrease

Clustering by Eye

• n genes or proteins• m different samples (or conditions)• Represent a gene as a point:

– X = <x1, x2, …, xm>

• If m is 1 or 2 (or even 3) you can plot the points and look for clusters of genes with similar expression. – But what if m is bigger than 3?– Need to reduce the dimensionality: PCA

Reducing the Dimensionality of Data: Principal Components Analysis

• PCA linearly map each point to a small set of dimensions (components).– The principal components are

dimensions that capture the maximum variation in the data.

• The principal components capture most of the important information in the data (usually).

• Plotting each point’s values in two of the principal component dimensions allows us to see clusters.

QuickTime™ and aTIFF (Uncompressed) decompressor


2-D Gel Data

PCA: An IllustrationYeast Cell Cycle Gene Expression

• Singular value decomposition of a matrix X (SVD) is – X = U VT

• The mapped value of X is– Y = X VT

• The rows of Y give the mapping of each gene.– Mapped gene i: Yi = <y1, y2, …., ym>

@PNAS (2000)

Clustering Using Statistics

• Algorithm identifies groups.– Example: similar

expression profiles

• Distance measure between pairs of points is needed.



Distance Measures Between Pairs of Points

• In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other.

• So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix.

• We can then compute all the pair-wise distances between rows (or columns).

Standard Distance Measures

• Euclidean Distance

• Pearson Correlation Coefficient

• Mahalanobis Distance

Euclidean Distance

• Standard, everyday distance – Treats all dimensions equally

– If some genes vary more than others (have higher variance), they influence the distance more.

Mahalanobis Distance

• The “normalized” Euclidean distance

• Scales each dimension by the variance in that dimension.– This is useful if the genes tend to vary much more in one sample

than in others since it reduces the affect of that sample on the distances.

Pearson Correlation Coefficient

• Distances are small when two genes have similar patterns of change even if the size of the changes are different.

• This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.

Choice of Distance Matters

• Heirarchical clustering (dentrogram) of tissues.– Corresponds to

clustering the columns of the matrix.

• Branches are different (cancer B/C vs A/B).



Clustering Algorithms

• Hierarchical Clustering

• K-means clustering

• Self-organizing maps and trees

Hierachical Clustering

• Algorithms progressively merge clusters or split clusters.– Merging criterion can by single-linkage or complete-linkage.

• Produce dendrograms– Can be interpreted at different thresholds.

Types of Linkage

• A. Single Linkage• B. Complete Linkage• C. Centroid Method



K-means Clutering

• Related to Expectation Maximization • You specify the number of clusters• Iteratively moves the means of the clusters to

maximize the likelihood (minimize total error).



analyzing expression data: clustering and stats chapter 16

Documents

expression data

expression of genes

log expression ratio

log expression level

different expression

normal data

intensity ratio data

channel data