statistical analysis of microarray data - ndsumcclean/plsc411/gene...statistical analysis of...

Statistical analysis of microarray data

Figure 4. Pre-analysis data transformation using the spatial lowess function. The upper left panel shows log ratios of raw intensity data on a red-green color scale plotted on the spatial coordinates of the array. A bright red corner is apparent on the figure. The upper left panel shows an RI plot of the same data. Spots from the bright corner of the array are highlighted in red. The lower two panels show the array view and RI plots of data after transformations using the spatial lowess smoother. From http://www.jax.org/staff/churchill/labsite/pubs/index.html A great site to look in details of various microarray analysis procedures.

Clustering

Clustering implies co-regulation implies genes involved in a similar biological process

Describes level of coordinate regulation of gene expression on the genome-wide scale

Formulate hypothesis concerning the possible function of unknown genes

Raw fluorescence intensities are transformed into false-color Relatively high ratios of expression of experimental-to-reference sample are coded red (yellow) Relatively low ratios of expression of experimental-to-reference sample are coded green (blue) Brightness of color proportional to the magnitude of the differential expression Ratio of 1 is black

Simplest method is to sort using a spreadsheet application (i.e. Excel)

Statistical software and automated clustering standards

Hierarchical Clustering

• Profiles are compared pair-wise, and the ones that are most similar are. connected. Their values are averaged.

• The process is iterated until one cluster is left.

A

B

C

D

E

F

G

A B C D E

To demonstrate the biological origins of patterns seen in A, data from A were clustered by using methods described above before and after random permutation within rows (random 1), within columns (random 2), and both (random 3).

Hierarchical Clustering

Redundant (i.e. duplicate) elements cluster together

Genes of similar function cluster together

For human data set (see figure below), this includes genes involved in:

Cholesterol biosynthesis The cell cycle The immediate-early response Signaling and angiogenesis Tissue remodeling and wound healing

Clustered display of data from time course of serum stimulation of primary human fibroblasts. Fibroblasts in culture were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1 hr, 2 hr, 3 hr, 4 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr. Data were measured by using a cDNA microarray containing 8,600 human genes. All measurements are relative to time 0. Genes were selected for this analysis if their expression level deviated from time 0 by at least a factor of 3.0 in at least 2 time points. The color scale of image ranges from saturated green for log ratios 3.0 and below to saturated red for log ratios 3.0 and above. Each gene is represented by a single row of colored boxes; each time point is represented by a single column. Five separate clusters are indicated by colored bars and by identical coloring of the corresponding region of the dendrogram. These clusters contain named genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes.

In this example, microarray data is clustered over a time-course of expression changes.

Clustering is not restricted to data sets of this type, and in fact can be easily generalized to data having N-dimensions.

Different clustering algorithms can be employed for this purpose.

Hierarchical clustering can be performed both on the genes and the treatments, allowing detection of patterns in two dimensions.

Clustering of different transcripts, according to similarity across genotypes, or treatments of wildtype

Diffe

rent

gen

otyp

es,

or t

reat

men

ts o

f th

e wi

ld-t

ype,

clus

tere

d ac

cord

ing

to s

imila

rity

of

tran

script

iona

l pr

ofile

For these genotypes, the transcript profiles are almost identical. Thus, the lesions are in closely-related genes or pathways.

For these genotypes, the transcript profiles are almost identical. Thus, the lesions are in closely-related genes or pathways.

For these genes, the levels of expression are almost identical no matter what the genotype. They are therefore co-regulated.

For these genes, the levels of expression are almost identical no matter what the genotype. They are therefore co-regulated.

Self-organizing maps (SOMs) and k-means clusters

Specify the number of clusters that are desired in advance and force the data to conform to this structure

For example in k-means:

1. All genes are initially assigned at random to one of k clusters 2. The mean value for each treatment in each cluster is computed 3. Each gene is then reassigned to the cluster to which it shows the closest similarity 4. Repeat steps 2 and 3 until a stable structure is achieved

Number of clusters is determined according to biological criteria, or on comparison of results of analyses run with different numbers of clusters.

Principal Component Analysis (PCA)

Major axes of variation among treatments is identified

Each gene is assigned a value representing how that component of variation contributes to its profile of expression

Axes typically contain contributions from multiple treatments Studying the effect of sex, cancer type and chemotherapeutic agent

Clustering is sensitive to interaction as well as to primary effects

In most biological cases gene regulation is effected by multiple factors and circumstances Hierarchical clustering presumes a one-to-one manner for gene expression patterns

Plot of the coefficients for characteristic mode 1 against the coefficients for characteristic mode 2. Symbols of different colors and shapes are used for genes that belong to the different clusters identified by the original authors (3–5). (a) cdc15 data (first 12 time points). (b) Sporulation data (7 time points). (c) Fibroblast data (13 time points). (d) random data (7 time points).

statistical analysis of microarray data - ndsumcclean/plsc411/gene...statistical analysis of...

Documents