analyzing expression data: clustering and stats chapter 16

29
Analyzing Expression Data: Clustering and Stats Chapter 16

Upload: meredith-whitehead

Post on 18-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Expression Data: Clustering and Stats Chapter 16

Analyzing Expression Data:Clustering and Stats

Chapter 16

Page 2: Analyzing Expression Data: Clustering and Stats Chapter 16

Goals

• We’ve measured the expression of genes or proteins using the technologies discussed previously.

• What can we do with that information?– Identify significant differences in expression– Identify similar patterns of expression

(clustering)

Page 3: Analyzing Expression Data: Clustering and Stats Chapter 16

Analysis steps

1. Data normalization

2. Statistical Analysis

3. Cluster Analysis

Page 4: Analyzing Expression Data: Clustering and Stats Chapter 16

I. Data Normalization

• Why normalize?– Removes systematic errors– Makes the data easier to analyze statistically

Page 5: Analyzing Expression Data: Clustering and Stats Chapter 16

Sources of Error

• Measurements always contain errors.– Systematic (oops)– Random (noise!)

• Subtracting the background level can remove some systematic error– Using the ratio in two-channel experiments does this– Subtracting the overall average intensity can be used with one-

channel data.

• Taking averages over replicates of the experiment reduces the random error.

• Advanced error models are mentioned on p. 628 and covered in “Further Reading”.

Page 6: Analyzing Expression Data: Clustering and Stats Chapter 16

Expression data usually not Gaussian (normal)

• Many statistical tests assume that the data is normally distributed.

• Expression microarray spot intensity data (for example) is not.

• Intensity ratio data (two-channel) is not normal either.

• Both go from 0 to infinity whereas normal data is symmetrical.

QuickTime™ and a decompressor

are needed to see this picture.

Page 7: Analyzing Expression Data: Clustering and Stats Chapter 16

Taking the logarithm helps normalize expression ratio data

• The expression ratio plotted versus the expression level (geometric mean) in both channels.

• Plotting the log ratio vs. the log expression level gives data that is centered around y=0 and fairly “normal looking”.

QuickTime™ and a decompressor

are needed to see this picture.

Page 8: Analyzing Expression Data: Clustering and Stats Chapter 16

Taking the log of the expression ratio “fixes” the left tail

QuickTime™ and a decompressor

are needed to see this picture.

Page 9: Analyzing Expression Data: Clustering and Stats Chapter 16

LOWESS Normalization

• Sometimes there is still a bias that depends on the expression level.

• This can be removed by a type of regression called “Locally Weighted Scatterplot Smoothing”.

• This computes and subtracts the mean locally for various values of expression level (RG).

QuickTime™ and a decompressor

are needed to see this picture.

Page 10: Analyzing Expression Data: Clustering and Stats Chapter 16

II. Statistical Analysis

• Determining what differences in expression are statistically significant

• Controlling false positives

Page 11: Analyzing Expression Data: Clustering and Stats Chapter 16

When are two measurements significantly different?

• We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1).

• A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small.

• The significance is related to the area of the overlap of the underlying distributions.

QuickTime™ and a decompressor

are needed to see this picture.

Page 12: Analyzing Expression Data: Clustering and Stats Chapter 16

The Z-test

• If the data is approximately normal, convert it to a Z-score.– X can be the log expression ratio; is then 0 is the sample standard deviation; n is the number of repeats

• The Z-score is distributed N(0,1) (standard normal).• The significance level is the area in the tail(s) of the standard

normal distribution.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressorare needed to see this picture.

Page 13: Analyzing Expression Data: Clustering and Stats Chapter 16

The t-test

• The t-test makes fewer assumptions about the data than the Z-test

• It can be applied to compare two average measurements which can have– Different variances– Different numbers of observations

• You compute the t-statistic (see pages 654-655) and then look up the significance level of the Students’ T distribution in a table.

Page 14: Analyzing Expression Data: Clustering and Stats Chapter 16

III. Cluster Analysis

• Similar expression patterns– Groups of genes/proteins with similar

expression profiles

• Similar expression sub-patterns– Groups of genes/proteins with similar

expression profiles in a subset of conditions

• Different clustering methods• Assessing the value of clusters

Page 15: Analyzing Expression Data: Clustering and Stats Chapter 16

Example: Gene Expression Profiles

• Expression level of a gene is measured at different time points after treating cells.

• Many different expression profiles are possible.– No effect– Immediate increase or

decrease– Delayed increase or decrease– Transient increase or

decrease

Page 16: Analyzing Expression Data: Clustering and Stats Chapter 16

Clustering by Eye

• n genes or proteins• m different samples (or conditions)• Represent a gene as a point:

– X = <x1, x2, …, xm>

• If m is 1 or 2 (or even 3) you can plot the points and look for clusters of genes with similar expression. – But what if m is bigger than 3?– Need to reduce the dimensionality: PCA

Page 17: Analyzing Expression Data: Clustering and Stats Chapter 16

Reducing the Dimensionality of Data: Principal Components Analysis

• PCA linearly map each point to a small set of dimensions (components).– The principal components are

dimensions that capture the maximum variation in the data.

• The principal components capture most of the important information in the data (usually).

• Plotting each point’s values in two of the principal component dimensions allows us to see clusters.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

2-D Gel Data

Page 18: Analyzing Expression Data: Clustering and Stats Chapter 16

PCA: An IllustrationYeast Cell Cycle Gene Expression

• Singular value decomposition of a matrix X (SVD) is – X = U VT

• The mapped value of X is– Y = X VT

• The rows of Y give the mapping of each gene.– Mapped gene i: Yi = <y1, y2, …., ym>

@PNAS (2000)

Page 19: Analyzing Expression Data: Clustering and Stats Chapter 16

Clustering Using Statistics

• Algorithm identifies groups.– Example: similar

expression profiles

• Distance measure between pairs of points is needed.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 20: Analyzing Expression Data: Clustering and Stats Chapter 16

Distance Measures Between Pairs of Points

• In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other.

• So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix.

• We can then compute all the pair-wise distances between rows (or columns).

Page 21: Analyzing Expression Data: Clustering and Stats Chapter 16

Standard Distance Measures

• Euclidean Distance

• Pearson Correlation Coefficient

• Mahalanobis Distance

Page 22: Analyzing Expression Data: Clustering and Stats Chapter 16

Euclidean Distance

• Standard, everyday distance – Treats all dimensions equally

– If some genes vary more than others (have higher variance), they influence the distance more.

Page 23: Analyzing Expression Data: Clustering and Stats Chapter 16

Mahalanobis Distance

• The “normalized” Euclidean distance

• Scales each dimension by the variance in that dimension.– This is useful if the genes tend to vary much more in one sample

than in others since it reduces the affect of that sample on the distances.

Page 24: Analyzing Expression Data: Clustering and Stats Chapter 16

Pearson Correlation Coefficient

• Distances are small when two genes have similar patterns of change even if the size of the changes are different.

• This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.

Page 25: Analyzing Expression Data: Clustering and Stats Chapter 16

Choice of Distance Matters

• Heirarchical clustering (dentrogram) of tissues.– Corresponds to

clustering the columns of the matrix.

• Branches are different (cancer B/C vs A/B).

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 26: Analyzing Expression Data: Clustering and Stats Chapter 16

Clustering Algorithms

• Hierarchical Clustering

• K-means clustering

• Self-organizing maps and trees

Page 27: Analyzing Expression Data: Clustering and Stats Chapter 16

Hierachical Clustering

• Algorithms progressively merge clusters or split clusters.– Merging criterion can by single-linkage or complete-linkage.

• Produce dendrograms– Can be interpreted at different thresholds.

Page 28: Analyzing Expression Data: Clustering and Stats Chapter 16

Types of Linkage

• A. Single Linkage• B. Complete Linkage• C. Centroid Method

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 29: Analyzing Expression Data: Clustering and Stats Chapter 16

K-means Clutering

• Related to Expectation Maximization • You specify the number of clusters• Iteratively moves the means of the clusters to

maximize the likelihood (minimize total error).

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.