computational analysis of microarray data · more basic tools.although the focus here is on spotted...

418 | JUNE 2001 | VOLUME 2 www.nature.com/reviews/genetics

R E V I E W S

The advent of the genome project has vastly increasedour knowledge of the genomic sequences of humans andother organisms, as well as the genes that they encode.Various techniques have been developed to exploit thisgrowing body of data, including serial analysis of geneexpression (SAGE)1, oligonucleotide arrays2 and cDNAmicroarrays3,4, that provide rapid, parallel surveys ofgene-expression patterns for hundreds or thousands ofgenes in a single assay. These transcriptional profilingtechniques promise a wealth of data that can be used to develop a more complete understanding of gene function, regulation and interactions.

The most powerful applications of transcriptionalprofiling involve the study of patterns of gene expres-sion across many experiments that survey a wide arrayof cellular responses, phenotypes and conditions. Thesimplest way to identify genes of potential interestthrough several related experiments is to search forthose that are consistently either up- or downregulated.To that end, a simple statistical analysis of gene-expres-sion levels will suffice. However, identifying patterns ofgene expression and grouping genes into expressionclasses might provide much greater insight into theirbiological function and relevance. Several techniqueshave been used for the analysis of gene-expression data,including hierarchical clustering5–9, mutual informa-tion5,10 and self-organizing maps (SOMs)11,12.

The implementation of a successful programme ofexpression analysis requires the development of variouslaboratory protocols, as well as the development of data-base and software tools for efficient data collection andanalysis. Although detailed laboratory protocols havebeen published13,14, the computational tools necessary toanalyse the data are rapidly evolving and no clear consen-sus exists as to the best method for revealing patterns ofgene expression. Indeed, it is becoming increasingly clearthat there might never be a ‘best’ approach and that theapplication of various techniques will allow differentaspects of the data to be explored. Furthermore, withouta more complete understanding of the underlying biolo-gy, particularly of gene regulation, there might never be asingle technique that will allow us to find all the relation-ships in the data. Consequently, choosing the appropriatealgorithms for analysis is a crucial element of the experi-mental design. The purpose of this review is to provide ageneral overview of some existing computationalapproaches. This review is not comprehensive, as new,more sophisticated techniques are rapidly being devel-oped, but instead represents a tutorial on some of themore basic tools. Although the focus here is on spottedDNA microarrays, the techniques described are generallyapplicable to expression data generated using oligonu-cleotide arrays,Affymetrix GeneChipsTM, or SAGE, pro-vided the data is presented in an appropriate format.

COMPUTATIONAL ANALYSIS OF MICROARRAY DATAJohn Quackenbush

Microarray experiments are providing unprecedented quantities of genome-wide data ongene-expression patterns. Although this technique has been enthusiastically developed andapplied in many biological contexts, the management and analysis of the millions of datapoints that result from these experiments has received less attention. Sophisticatedcomputational tools are available, but the methods that are used to analyse the data canhave a profound influence on the interpretation of the results. A basic understanding ofthese computational tools is therefore required for optimal experimental design andmeaningful data analysis.

The Institute for GenomicResearch, 9,712 MedicalCenter Drive, Rockville,Maryland 20850, USA.e-mail: [email protected]

C O M P U TAT I O N A L G E N E T I C S

© 2001 Macmillan Magazines Ltd

NATURE REVIEWS | GENETICS VOLUME 2 | JUNE 2001 | 419

R E V I E W S

Indices (TGI)16 (TIGR is The Insititute for GenomicResearch), the STACK database17 and the Database ofTranscribed Sequences (DoTS) (see links box). Eachdatabase attempts to group ESTs from the same geneand to provide a common annotation. Although theprecise approaches taken by the databases vary, they allgenerally provide high-quality annotation for thecDNAs represented in the public databases. Regardlessof which resource is used to select clones, the cDNAsbeing arrayed should have their sequences verified tovalidate the clone identities. The consensus emerging inthe community is that stable descriptors, either acces-sion numbers or DNA sequence, should be used as theprimary identifier for arrayed cDNA clones.

For other array-based assays, such as pre-spotted fil-ter arrays and Affymetrix GeneChipsTM (REF. 2), theresearcher has little, if any, control over the probe con-tent of the chip. These arrays do have the advantage ofproviding a platform that can be used to more easilycompare results between laboratories. However, at leastin the case of Affymetrix GeneChipsTM, the lack of pre-cise knowledge of the probe sequences used to representeach gene forces users to rely on the annotation provid-ed by the manufacturer.

After clone selection, amplification and purifica-tion, the probes are loaded in microtiter plates into anarraying robot and are mechanically spotted ontochemically modified glass slides. The robotic arrayersprovide a reproducible and precise mathematical mapfrom spots on the arrays to wells in the microtiterplates, and therefore to the cDNA clones and the genesthat they represent.

One important element for assuring data quality,and maximizing opportunities for comparison ofexpression data between experiments, is the establish-ment and use of an integrated laboratory informationmanagement system (LIMS) database to track allaspects of the process. Early in the process, various datamust be tracked, including clone selection, slide infor-mation and scanner settings. Although many arrayinglabs have developed their own internal relational data-bases, there are several commercially available products,as well as published prototypes, available from theNational Center for Biotechnology Information(NCBI)18 and Stanford19 (see links box at end).

Data collection and normalizationOnce a collection of microarray slides is printed, eachslide represents a potential experiment. The arrayedgenes are probes that can be used to query pooled, differ-entially labelled targets derived from RNA samples fromdifferent cellular phenotypes to determine the relativeexpression levels of each gene. The two RNA samplesfrom the tissues of interest are typically used to generatefirst-strand cDNA targets labelled with the fluorescentdyes Cy3 and Cy5. These are then purified, pooled andhybridized to the arrays. After hybridization, slides arescanned and independent images for the control andquery channels are generated. These images must thenbe analysed to identify the arrayed spots and to measurethe relative fluorescence intensities for each element.

Selecting the array probesThe first step in any microarray assay is starting with awell-characterized and annotated set of hybridizationprobes (the sequences that are arranged on the micro-array). For prokaryotes and simple eukaryotes, such asyeast, this is most easily accomplished by designing PCRprimers to amplify gene-specific probes directly fromgenomic DNA. In most eukaryotic genomes, the largenumber of genes, the existence of introns and the lack ofa complete genome sequence, makes direct amplificationimpractical. In these species, the EST data collected inthe public DNA sequence databases are a valuable repre-sentation of the transcribed portion of the genome, andthe cDNA clones from which the ESTs are derived havebecome the primary reagents for expression analysis.

However, clone selection is a significant challenge;there are more than three million human ESTs in thedbEST database, from which a single representativeclone needs to be selected for each gene included in thearray. There are several publicly available analyses ofhuman ESTs, including UniGene15 and TIGR Gene

Box 1 | Normalization

There are three widely used techniques that can be used to normalize gene-expressiondata from a single array hybridization. All of these assume that all (or most) of thegenes in the array, some subset of genes, or a set of exogenous controls that have been‘spiked’ into the RNA before labelling, should have an average expression ratio equal toone. The normalization factor is then used to adjust the data to compensate forexperimental variability and to ‘balance’ the fluorescence signals from the two samplesbeing compared.

Total intensity normalizationTotal intensity normalization data relies on the assumption that the quantity of initialmRNA is the same for both labelled samples. Furthermore, one assumes that somegenes are upregulated in the query sample relative to the control and that others aredownregulated. For the hundreds or thousands of genes in the array, these changesshould balance out so that the total quantity of RNA hybridizing to the array from eachsample is the same. Consequently, the total integrated intensity computed for all theelements in the array should be the same in both the Cy3 and Cy5 channels. Under thisassumption, a normalization factor can be calculated and used to re-scale the intensityfor each gene in the array.

Normalization using regression techniquesFor mRNA derived from closely related samples, a significant fraction of the assayedgenes would be expected to be expressed at similar levels. In a scatterplot of Cy5 versusCy3 intensities (or their logarithms), these genes would cluster along a straight line, theslope of which would be one if the labelling and detection efficiencies were the same forboth samples. Normalization of these data is equivalent to calculating the best-fit slopeusing regression techniques27 and adjusting the intensities so that the calculated slopeis one. In many experiments, the intensities are nonlinear, and local regressiontechniques are more suitable, such as LOWESS (LOcally WEighted ScatterplotSmoothing) regression29.

Normalization using ratio statisticsA third normalization option is a method based on the ratio statistics described byChen et al.20. They assume that although individual genes might be up- ordownregulated, in closely related cells, the total quantity of RNA produced isapproximately the same for essential genes, such as ‘housekeeping genes’. Using thisassumption, they develop an approximate probability density for the ratio T

k= R

k/G

k

(where Rk

and Gk

are, respectively, the measured red and green intensities for the ktharray element). They then describe how this can be used in an iterative process thatnormalizes the mean expression ratio to one and calculates confidence limits that canbe used to identify differentially expressed genes.



R E V I E W S

of 2, whereas those downregulated by the same factorhave an expression ratio of one-half (0.5) — downregu-lated genes are ‘squashed’between 1 and 0. By contrast, agene upregulated by a factor of 2 has a log

2(ratio) of 1,

whereas a gene downregulated by a factor of 2 has alog

2(ratio) of –1, and a gene expressed at a constant level

(with a ratio of 1) has a log2(ratio) of 0 (FIG. 1).

At this point in the analysis of a single experiment, wetypically look for genes that are differentially expressed.Most published studies have used a post-normalizationcut-off of twofold increase or decrease in measured levelto define differential expression, although there is nofirm theoretical basis for selecting this level as significant.Alternatively, the approach defined by Chen et al.20 pro-vides confidence intervals that can be used to identifydifferentially expressed genes.

It should be noted that there are disadvantages tousing only expression ratios for data analysis. Althoughratios can help to reveal some patterns in the data, theyremove all information about the absolute gene-expres-sion levels.Various parameters depend on the measuredintensity, including the confidence limits that are placedon any microarray measurement. Although most of thetechniques developed for analysis of microarray datause ratios, many of them can be adapted for use withmeasured intensities.

Most commercially available microarray scanner manu-facturers provide software that handles image process-ing; there are several additional image-processing pack-ages available (see links box at end).

After image processing, it is necessary to normalizethe relative fluorescence intensities in each of the twoscanned channels. Normalization adjusts for differ-ences in labelling and detection efficiencies for the flu-orescent labels and for differences in the quantity ofinitial RNA from the two samples examined in theassay. These problems can cause a shift in the averageratio of Cy5 to Cy3 and the intensities must be re-scaled before an experiment can be properly analysed.Three normalization algorithms are described in BOX 1.There are several sophisticated, nonlinear approachesto normalization that correct, for example, for varia-tion between the individual spotting pens and for non-linear relationships between the dye intensities.

After normalization, the data for each gene are typi-cally reported as an ‘expression ratio’ or as the logarithmof the expression ratio. The expression ratio is simply thenormalized value of the expression level for a particulargene in the query sample divided by its normalized valuefor the control. The advantage of using the logarithm ofthe expression ratio is simple to understand. Genes thatare upregulated by a factor of 2 have an expression ratio

Cy3 intensity

Log(Cy3 intensity)

Log(

Cy5

inte

nsity

)

0

500

1,000

1,500

2,000

2,500

3,000

Log2 (ratio)–2 –1.6 –1.2 –0.8 –0.4 0 0.4 0.8 1.2 1.6 2

Log2 (ratio)–2 –1.6 –1.2 –0.8 –0.4 0 0.4 0.8 1.2 1.6 2

Freq

uenc

y

0

500

1,000

1,500

2,000

2,500

3,000

Cy5

inte

nsity

0 1×106 2×106 3×106 4×106 5×106 6×106

y = 1.004xR2 = 0.9913y = 0.7982x R2 = 0.9913

0

1×106

2×106

3×106

4×106

5×106

6×106

1×104

1×105

1×106

1×107

1×104 1×105 1×106 1×107

a c

b d

y = 1.004xR2 = 0.9913y = 0.7912x R2 = 0.9913

Freq

uenc

y

Figure 1 | Data normalization. a | A histogram representing the distribution of log2(ratio) values for a ‘self–self’ hybridization, in which the measured Cy5 intensity is generally less than the measured Cy3 intensity. Consequently, the log2(ratio) histogram is centred to the left of zero (as are, indeed, the vast majority of the data). b | The same data set shown before (red) and after (blue) normalization to illustrate how the data are transformed. The normalized distribution, shown in blue, isshifted and centred about zero. The perceived change in the shape of the distribution is an artefact of the process of placing data into ‘bins’ when making thehistogram. Scatter plots before (red) and after (blue) normalization of the c | measured intensities and d | log(intensities) also illustrate the transformation of the data.



R E V I E W S

CLUSTER ANALYSIS

The term ‘cluster analysis’actually encompasses severaldifferent classificationalgorithms that can be used todevelop taxonomies (typicallyas part of exploratory dataanalysis). Note that in thisclassification, the higher thelevel of aggregation, the lesssimilar are members in therespective class.

expression, each experiment represents a separate, dis-tinct axis in space and the log

2(ratio) measured for that

gene in that experiment represents its geometric coor-dinate. For example, if we have three experiments, thelog

2(ratio) for a given gene in experiment 1 is its x-

coordinate, the log2(ratio) in experiment 2 is its y-

coordinate, and the log2(ratio) in experiment 3 is its z-

coordinate. So, we can represent all the information wehave about that gene by a point in x–y–z-expressionspace. A second gene, with nearly the same log

2(ratio)

values for each experiment will be represented by a(spatially) nearby point in expression space; a genewith a very different pattern of expression will be farfrom our original gene. The generalization to moreexperiments is straightforward (although harder todraw): the dimensionality of expression space grows tobe equal to the number of experiments. In this way,expression data can be represented in n-dimensionalexpression space, where n is the number of experi-ments, and where each gene-expression vector is rep-resented as a single point in that space.

Having been provided with a means of measuringdistance between genes, clustering algorithms sort thedata and group genes together on the basis of their sepa-ration in expression space. It should also be noted that ifwe are interested in clustering experiments, we couldrepresent each experiment as an ‘experiment vector’consisting of the expression values for each gene; thesedefine an ‘experiment space’, the dimensionality ofwhich is equal to the number of genes assayed in eachexperiment. Again, by defining distances appropriately,we could apply any of the clustering algorithms definedhere to analyse and group experiments.

To interpret the results from any analysis of multipleexperiments, it is helpful to have an intuitive visual rep-resentation. A commonly used approach relies on thecreation of an expression matrix in which each columnof the matrix represents a single experiment and eachrow represents the expression vector for a particulargene. Colouring each of the matrix elements on the basisof its expression value creates a visual representation ofgene-expression patterns across the collection of experi-ments. There are countless ways in which the expressionmatrix can be coloured and presented. The most com-monly used method colours genes on the basis of theirlog

2(ratio) in each experiment, with log

2(ratio) values

close to zero coloured black, those with log2(ratio) values

greater than zero coloured red, and those with negativevalues coloured green. For each element in the matrix,the relative intensity represents the relative expression,with brighter elements being more highly differentiallyexpressed. For any particular group of experiments, theexpression matrix generally appears without any appar-ent pattern or order. Programmes designed to clusterdata generally re-order the rows, or columns, or both,such that patterns of expression become visually appar-ent when presented in this fashion.

Before clustering the data, there are two further ques-tions that need to be considered: first, should the data beadjusted in some way to enhance certain relationships?And second, what distance measure should be used to

The true power of microarray analysis does notcome from the analysis of single experiments, butrather, from the analysis of many hybridizations to iden-tify common patterns of gene expression. Based on ourunderstanding of cellular processes, genes that are con-tained in a particular pathway, or that respond to a com-mon environmental challenge, should be co-regulatedand consequently, should show similar patterns ofexpression. Our goal then is to identify genes that showsimilar patterns of expression and there exists a largegroup of statistical methods, generally referred to as‘CLUSTER ANALYSIS’, that can be used to achieve this. Beforecomparing the clustering methods, I first discuss amathematical definition for what we mean by ‘similar’,in the context of gene expression.

Comparing expression dataFor expression data, we can begin to address the prob-lem of ‘similarity’ mathematically by defining an‘expression vector’ for each gene that represents itslocation in ‘expression space’. In this view of gene

Box 2 | Distance metrics

In any clustering algorithm, the calculation of a ‘distance’ between any two objects isfundamental to placing them into groups. Analysis of microarray data is no different inthat finding clusters of similar genes relies on finding and grouping those that are‘close’ to each other. To do this, we rely on defining a distance between each gene-expression vector. There are various methods for measuring distance; these typicallyfall into two general classes: metric and semi-metric.

Metric distancesTo be classified as ‘metric’, a distance measure d

ijbetween two vectors, i and j, must obey

several rules:• The distance must be positive definite, d

ij≥ 0 (that is, it must be zero or positive).

• The distance must be symmetric, dij

= dji, so that the distance from i to j is the same as

the distance from j to i.

• An object is zero distance from itself, dii

= 0.

• When considering three objects, i, j and k, the distance from i to k is always less thanor equal to the sum of the distance from i to j, and the distance from j to k, d

ik≤ d

ij

+ djk. This is sometimes called the ‘triangle’ rule.

The most common metric distance is Euclidean distance, which is a generalization ofthe familiar Pythagorean theorem. In a three-dimensional space, the Euclideandistance, d

12, between two points, (x

1,x

2,x

3) and (y

1,y

2,y

3) is given by EQN 1:

where (x1,x

2,x

3) are the usual Cartesian coordinates (x,y,z). The generalization of this to

higher-dimensional expression spaces is straightforward. For our n-dimensionalexpression vectors, the Euclidean distance is given by EQN 2:

where xi and yi are the measured expression values, respectively, for genes X and Y inexperiment i, and the summation runs over the n experiments under analysis.

Semi-metric distancesDistance measures that obey the first three consistency rules, but fail to maintain thetriangle rule are referred to as semi-metric. There are a large number of semi-metricdistance metrics and these are often used in expression analysis. Mathematicaldescriptions of various distance metrics are available as supplementary material (seesupplementary Box 1 online).

d12 = (x1 – y1)2 + (x2 – y2)2 + (x3 – y3)2 , (1)

d = (xi – yi)2 ,

i=1

n∑ (2)



R E V I E W S

Clustering algorithmsVarious clustering techniques have been applied to theidentification of patterns in gene-expression data. Mostcluster analysis techniques are hierarchical; the resultantclassification has an increasing number of nested classesand the result resembles a phylogenetic classification.Non-hierarchical clustering techniques also exist, suchas k-means clustering, which simply partition objectsinto different clusters without trying to specify the rela-tionship between individual elements. Clustering tech-niques can further be classified as divisive or agglomera-tive. A divisive method begins with all elements in onecluster that is gradually broken down into smaller andsmaller clusters. Agglomerative techniques start with(usually) single-member clusters and gradually fusethem together. Finally, clustering can be either super-vised or unsupervised. Supervised methods use existingbiological information about specific genes that arefunctionally related to ‘guide’ the clustering algorithm.However, most methods are unsupervised and these aredealt with first.

Although cluster analysis techniques are extremelypowerful, great care must be taken in applying this fam-ily of techniques. Even though the methods used areobjective in the sense that the algorithms are welldefined and reproducible, they are still subjective in thesense that selecting different algorithms, different nor-malizations, or different distance metrics, will place dif-ferent objects into different clusters. Furthermore, clus-tering unrelated data will still produce clusters, althoughthey might not be biologically meaningful. The chal-lenge is therefore to select the data and to apply the algo-rithms appropriately so that the classification that arisespartitions data sensibly.

Hierarchical clustering. Hierarchical clustering has theadvantage that it is simple and the result can be easilyvisualized13. It has become one of the most widelyused techniques for the analysis of gene-expressiondata. Hierarchical clustering is an agglomerativeapproach in which single expression profiles arejoined to form groups, which are further joined untilthe process has been carried to completion, forming asingle hierarchical tree. The process of hierarchicalclustering proceeds in a simple manner. First, thepairwise distance matrix is calculated for all of thegenes to be clustered. Second, the distance matrix issearched for the two most similar genes (see above) orclusters; initially each cluster consists of a single gene.This is the first true stage in the ‘clustering’ process. Ifseveral pairs have the same separation distance, a pre-determined rule is used to decide between alterna-tives. Third, the two selected clusters are merged toproduce a new cluster that now contains at least twoobjects. Fourth, the distances are calculated betweenthis new cluster and all other clusters. There is noneed to calculate all distances as only those involvingthe new cluster have changed. Last, steps 2–4 arerepeated until all objects are in one cluster. There areseveral variations on hierarchical clustering (BOX 3)

that differ in the rules governing how distances are

group together related genes? In many microarray exper-iments, the data analysis can be dominated by the vari-ables that have the largest values, obscuring other,important differences. One way to circumvent this prob-lem is to adjust or re-scale the data and there are severalmethods in common use with microarray data. Forexample, each vector can be re-scaled so that the averageexpression of each gene is zero — a process referred to as‘mean centring’. In this process, the basal expression levelof a gene is subtracted from each experimental measure-ment. This has the effect of enhancing the variation ofthe expression pattern of each gene across experiments,without regard to whether the gene is primarily up- ordownregulated. This is particularly useful for the analysisof time-course experiments, in which one might like tofind genes that show similar variation around their basalexpression level. The data can also be adjusted so that theminimum and maximum are ±1, or so that the ‘length’of each expression vector is one.

The manner in which we measure distance betweengene-expression vectors also has a profound effect onthe clusters that are produced. (Several distance metricsare reviewed in BOX 2 and supplementary Box 1 online;others have been proposed21.)

CENTROID

The centroid of a cluster is theweighted average point in themultidimensional space; in asense, it is the centre of gravityfor the respective cluster.

Box 3 | Hierarchical clustering algorithms

There are various hierarchical clustering algorithms30 that can be applied to microarraydata analysis5–9,21. These differ in the manner in which distances are calculated betweenthe growing clusters and the remaining members of the data set, including otherclusters. Clustering algorithms include, but are not limited to:• Single-linkage clustering. The distance between two clusters, i and j, is calculated as

the minimum distance between a member of cluster i and a member of cluster j.Consequently, this technique is also referred to as the minimum, or nearest-neighbour, method. This method tends to produce clusters that are ‘loose’ becauseclusters can be joined if any two members are close together. In particular, thismethod often results in ‘chaining’, or the sequential addition of single samples to anexisting cluster. This produces trees with many long, single-addition branchesrepresenting clusters that have grown by accretion.

• Complete-linkage clustering. Complete-linkage clustering is also known as themaximum or furthest-neighbour method. The distance between two clusters iscalculated as the greatest distance between members of the relevant clusters. Notsurprisingly, this method tends to produce very compact clusters of elements and theclusters are often very similar in size.

• Average-linkage clustering. The distance between clusters is calculated using averagevalues. There are, in fact, various methods for calculating averages. The mostcommon is the unweighted pair-group method average (UPGMA). The averagedistance is calculated from the distance between each point in a cluster and all otherpoints in another cluster. The two clusters with the lowest average distance are joinedtogether to form a new cluster. Related methods substitute the CENTROID or themedian for the average.

• Weighted pair-group average. This method is identical to UPGMA, except that in thecomputations, the size of the respective clusters (that is, the number of objectscontained in them) is used as a weight. This method (rather than UPGMA) should beused when the cluster sizes are suspected to be greatly uneven.

• Within-groups clustering. This is similar to UPGMA except that clusters are mergedand a cluster average is used for further calculations rather than the individual clusterelements. This tends to produce tighter clusters than UPGMA.

• Ward’s method. Cluster membership is determined by calculating the total sum ofsquared deviations from the mean of a cluster and joining clusters in such a mannerthat it produces the smallest possible increase in the sum of squared errors31.



R E V I E W S

DENDROGRAM

A branching ‘tree’ diagramrepresenting a hierarchy ofcategories on the basis of degreeof similarity or number ofshared characteristics, especiallyin biological taxonomy. Theresults of hierarchical clusteringare presented as dendrograms,in which the distance along thetree from one element to thenext represents their relativedegree of similarity.

NEURAL NETWORKS

Neural networks are analytictechniques modelled after the(proposed) processes oflearning in cognitive systemsand the neurological functionsof the brain. Neural networksuse a data ‘training set’ to buildrules capable of makingpredictions or classifications on data sets.

hierarchical methods22. In k-means clustering, objectsare partitioned into a fixed number (k) of clusters, suchthat the clusters are internally similar but externally dis-similar; no DENDROGRAMS are produced (but one coulduse hierarchical techniques on each of the data parti-tions after they are constructed). The process involvedin k-means clustering is conceptually simple, but can becomputationally intensive. First, all initial objects arerandomly assigned to one of k clusters (where k is speci-fied by the user). Second, an average expression vector isthen calculated for each cluster and this is used to com-pute the distances between clusters. Third, using an iter-ative method, objects are moved between clusters andintra- and inter-cluster distances are measured witheach move. Objects are allowed to remain in the newcluster only if they are closer to it than to their previouscluster. Fourth, after each move, the expression vectorsfor each cluster are recalculated. Last, the shuffling pro-ceeds until moving any more objects would make theclusters more variable, increasing intra-cluster distancesand decreasing inter-cluster dissimilarity.

Some implementations of k-means clustering allownot only the number of clusters, but also seed cases (orgenes) for each cluster, to be specified. This has thepotential to allow, for example, use of previous knowl-edge of the system to help define the cluster output. Forexample, an attempt to classify patients with two mor-phologically similar but clinically distinct diseases usingmicroarray expression patterns can be imagined. Byusing k-means clustering on experiments with k = 2, thedata will be partitioned into two groups. The challengethen faced is to determine whether there are really onlytwo distinct groups represented in the data or not. Inthis case, k-means clustering is particularly useful withother techniques, such as principal component analysis(PCA, described below). PCA allows visual estimationof the number of clusters represented in the data. Thiscan be used to specify k and to group genes (or experi-ments) into related clusters.

Self-organizing maps. A self-organizing map (SOM) is aNEURAL-NETWORK-based divisive clustering approach11. ASOM assigns genes to a series of partitions on the basis ofthe similarity of their expression vectors to reference vec-tors that are defined for each partition. It is the process ofdefining these reference vectors that distinguishes SOMsfrom k-means clustering. Before initiating the analysis,the user defines a geometric configuration for the parti-tions, typically a two-dimensional rectangular or hexag-onal grid. Random vectors are generated for each parti-tion, but before genes can be assigned to partitions, thevectors are first ‘trained’ using an iterative process thatcontinues until convergence so that the data are mosteffectively separated. First, random vectors are con-structed and assigned to each partition. Second, a gene ispicked at random and, using a selected distance metric,the reference vector that is closest to the gene is identi-fied. Third, the reference vector is then adjusted so that itis more similar to the vector of the assigned gene. Thereference vectors that are nearby on the two-dimensionalgrid are also adjusted so that they are more similar to the

measured between clusters as they are constructed.Each of these will produce slightly different results, aswill any of the algorithms if the distance metric ischanged. Typically for gene-expression data, average-linkage clustering gives acceptable results.

One potential problem with many hierarchical clus-tering methods is that, as clusters grow in size, theexpression vector that represents the cluster might nolonger represent any of the genes in the cluster.Consequently, as clustering progresses, the actualexpression patterns of the genes themselves become lessrelevant. Furthermore, if a bad assignment is madeearly in the process, it cannot be corrected. An alterna-tive, which can avoid these artefacts, is to use a divisiveclustering approach, such as k-means or self-organizingmaps, to partition data (either genes or experiments)into groups that have similar expression patterns.

k-means clustering. If there is advanced knowledgeabout the number of clusters that should be representedin the data, k-means clustering is a good alternative to

a

1 2 3 4 5 6 7 8 9 10

Experiment number

Log(

expr

essi

on r

atio

) ABCDEFGHJ

b

1 2 3 4 5 6 7 8 9 10

Experiment number

Log(

expr

essi

on r

atio

) ABCDEFGHJ

–4

–3

–2

–1

0

1

2

3

4

–4

–3

–2

–1

0

1

2

3

4

Figure 2 | A synthetic gene-expression data set. This data set provides an opportunity toevaluate how various clustering algorithms reveal different features of the data. a | Nine distinctgene-expression patterns were created with log2(ratio) expression measures defined for tenexperiments. b | For each expression pattern, 50 additional genes were generated,representing variations on the basic patterns.



R E V I E W S

FACTOR ANALYSIS

Factor analysis is a datareduction and exploratorymethod similar to pincipalcomponent analysis. Factoranalysis techniques seek toreduce the number of variablesand to detect structure in therelationships between elementsin an analysis.

Principal component analysis. An analysis of micro-array data is a search for genes that have similar, corre-lated patterns of expression. This indicates that some ofthe data might contain redundant information. Forexample, if a group of experiments were more closelyrelated than we had expected, we could ignore some ofthe redundant experiments, or use some average of theinformation without loss of information.

PCA (also called singular value decomposition) isa mathematical technique that exploits these factorsto pick out patterns in the data, while reducing theeffective dimensionality of gene-expression spacewithout significant loss of information23. PCA is oneof a family of related techniques that include FACTOR

ANALYSIS and PRINCIPAL COORDINATE ANALYSIS that provide a‘projection’ of complex data sets onto a reduced, easilyvisualized space.

Although the mathematics is complex, the basicprinciples are straightforward. Imagine taking a three-dimensional cloud of data points and rotating it so thatyou can view it from different perspectives. You mightimagine that certain views would allow you to betterseparate the data into groups than other views. PCAfinds those views that give you the best separation ofthe data. This technique can be applied to both genesand experiments as a means of classification.

In most implementations of PCA, it is difficult todefine accurately the precise boundaries of distinct clus-ters in the data, or to define genes (or experiments)belonging to each cluster. However, PCA is a powerfultechnique for the analysis of gene-expression data whenused with another classification technique, such as k-means clustering or SOMs, that requires the user tospecify the number of clusters.

Analysis of a demonstration data setThe performance of these varied algorithms, normaliza-tion strategies and distance metrics is best shown byexamining a demonstration data set (FIG. 2). Althoughthis sample data set does not reflect the complexity ofreal biological data, its analysis can help to provide anunderstanding of how the data are handled and inter-preted by the various methods.

The first analysis involves the three most com-monly used variations on hierarchical clusteringusing a Euclidean distance metric without any datafiltering (FIG. 3). Each performed quite well withrespect to grouping together genes from a singleexpression class, although the branch lengths pro-duced by each algorithm, and the structures of theindividual clusters, differ quite a bit. It should benoted that, relative to the other algorithms, single-linkage clustering places expression group ‘H’ differ-ently with respect to the other expression groups onthe tree. The difference is due to the manner in whichthe growing clusters are linked together. In single-linkage clustering, growing clusters are joined on thebasis of the distance between the closest members oftheir two respective clusters, whereas complete link-age uses the greatest distance between any two mem-bers of the groups and average linkage uses a group

vector of the assigned gene. Fourth, steps 2 and 3 are iter-ated several thousand times, decreasing the amount bywhich the reference vectors are adjusted and increasingthe stringency used to define closeness in each step. Asthe process continues, the reference vectors converge tofixed values. Last, the genes are mapped to the relevantpartitions depending on the reference vector to whichthey are most similar.

In choosing the geometric configuration for the clus-ters, the user is, effectively, specifying the number of par-titions into which the data is to be divided. As with k-means clustering, the user has to rely on some othersource of information, such as PCA, to determine thenumber of clusters that best represents the available data.

a b c

Figure 3 | Hierarchical clustering. Genes in the demonstration data set were subjected to a | average-linkage, b | complete-linkage and c | single-linkage hierarchical clustering using aEuclidean distance metric and gene-expression families (A–J) that were colour coded forcomparison. Genes that are upregulated appear in red, and those that are downregulatedappear in green, with the relative log2(ratio) reflected by the intensity of the colour. Thismethod of clustering groups genes by reordering the expression matrix allows patterns to be easily visualized.



R E V I E W S

PRINCIPAL COORDINATE

ANALYSIS

Like principal componentanalysis, principal coordinateanalysis seeks to reduce thedimensionality of a spatialrepresentation of a data set bycreating new coordinate axesthat are a combination of theoriginals, and projecting thedata onto those new axes.

data set is analysed by k-means clustering, using k = 5, asone might expect from the PCA results, gene-expressiongroups B, D and E form one cluster, C, F and G formanother, and A, H and J remain as distinct groups (seesupplementary Figure 1 online). The results from this k-means analysis are also consistent with the results ofhierarchical clustering in that genes that consistently up-or downregulated are grouped together, whereas thosethat vary around zero appear separately from these.

If, however, we were looking at time-course dataand wanted to identify genes with expression levels thatvaried in a time-regulated fashion, this analysis wouldnot have allowed such variations to be identified. Oneway to help identify coordinated fluctuations in thedata is to first ‘centre’ the gene-expression vectors bysubtracting the average across all experiments fromeach data point (FIG. 5). The results from both PCA andaverage-linkage clustering reflect this data ‘filtering’. Inthe hierarchical clustering, genes with similar changesrelative to their baseline expression patterns aregrouped. The ‘constant’ A, B and C genes are nowplaced in a single cluster even though, in the originaldata, B genes were generally upregulated and C geneswere generally downregulated. Similarly, the D and Ggenes cluster together, as do the E- and F-gene groups.Although the H and J groups appear next to each otherin the dendrogram, they remain as separate groups,distinct from the others.

Although this data set does not reflect the full com-plexity of the real data sets, this analysis helps to showsome of the complexity in the analysis of real microarrayexpression data. There often is no ‘correct’ way to analyseany data set; the application of various techniques,including algorithms and data filters, can help to revealdifferent features in the data. One must remember thatthe results of any analysis have to be evaluated in thecontext of other biological knowledge.

Supervised methodsThe techniques discussed so far are unsupervisedmethods for identifying patterns of gene expression.Supervised methods represent a powerful alternativethat can be applied if one has some previous informa-tion about which genes are expected to cluster togeth-er. One widely used example is the support vectormachine (SVM)24. SVMs use a training set in whichgenes known to be related by, for example function,are provided as positive examples and genes knownnot to be members of that class are negative examples.These are combined into a set of training examplesthat is used by the SVM to learn to distinguishbetween members and non-members of the class onthe basis of expression data. Having learned theexpression features of the class, the SVM can then beused to recognize and classify the genes in the data seton the basis of their expression. In this way, SVMs usebiological information to determine expression fea-tures that are characteristic of a group and to assigngenes to that group. The SVM can also identify genesin the training set that are outliers or that have beenpreviously assigned to the incorrect class.

average (BOX 3). Without a biological basis for inter-preting these results, there is no way to decide whichgrouping is right and which is wrong. Depending onthe actual experiment, any of the three approachesmight provide a ‘correct’ order. As average-linkageclustering is the most commonly used approach, andbecause it grouped genes together in expressiongroups as well as the others, we will use this methodas the basis for comparison with the non-hierarchicalclustering algorithms.

Average-linkage clustering and PCA applied to thesame data set are shown in FIG. 4. The nine groups ofgenes found in the hierarchical clustering can be clearlyseen in the PCA analysis, although without previousknowledge of the results of the hierarchical analysis, onemight argue that the data set only contains five distinctgroups of genes. Application of k-means clustering andSOM analysis to this data set with more than five clustersproduces five principal groups of genes with small num-bers of genes assigned to the remaining groups. If the

a b

Figure 4 | Principal component analysis. The same demonstration data set was analysedusing a | hierarchical (average-linkage) clustering and b | principal component analysis usingEuclidean distance, to show how each treats the data, with genes colour coded on the basisof hierarchical clustering results for comparison.



R E V I E W S

HYPERPLANE

A hyperplane is an N-dimensional analogy of a line orplane, which divides an ‘N + 1’dimensional space into two.

KERNEL FUNCTION

In support vector machines, thekernel function is ageneralization of the distancemetric; it measures the distancebetween two expression vectorsas the data are projected intohigher-dimensional space.

‘KERNEL FUNCTION’, and the data can then be separated intotwo classes. For some data sets, SVMs might not achieveclean separation, either because of errors in classifica-tion in the training set, or noise in the data, or animproperly chosen kernel function. For this reason,most implementations also allow users to specify a ‘softmargin’ that allows some training examples to fall onthe wrong side of the separating hyperplane.Completely specifying a SVM therefore requires specify-ing both the kernel function and the magnitude of thepenalty to be applied for violating the soft margin.

As with the other techniques described here, this isone of the challenges of using SVMs. It is often difficultto choose the best kernel function, parameters andpenalties. Different parameters often yield completelydifferent classifications. It is therefore often necessary tosuccessively increase kernel complexity until an appro-priate classification is achieved24.

SVMs are one of a group of supervised algorithmsthat have been applied to the classification of gene-expression patterns. Although they might be of use inthe identification of genes that share related expres-sion patterns, an application of potentially greaterimpact is the use of supervised methods for the classi-fication of samples25–27. If we measure gene-expressionpatterns using RNAs collected from various patientsfor which there is, for example, disease-stage classifi-cation or survival data, we can use the microarraydata to ‘train’ an algorithm that can then be applied tothe classification of other previously unclassified sam-ples. This approach could lead to the development of‘molecular expression fingerprinting’ for disease clas-sification. In cancer diagnosis, the ability to produce amolecular expression fingerprint of each tumourmight prove to be extremely important as histologi-cally similar tumours might in fact be the result ofsubstantially different genetic changes, which mightprofoundly influence the progression of the tumourand its response to treatment.

Discussion and conclusionsMicroarray expression analysis offers an opportunityto generate functional data on a genome-wide scaleand consequently, should provide much-needed datafor the biological interpretation of genes and theirfunctions. It has also shown promise for classifyingphysiological and disease states. As the discussion pre-sented here should show, the careful handling andinterpretation of microarray expression data is not yetan exact science.

The hypothesis behind using clustering techniquesis that genes in a cluster must share some commonfunction or regulatory elements. However, classifica-tions based on clustering algorithms are dependent onthe particular methods used, the manner in which thedata are normalized within and across experiments,and the manner in which we measure similarity; anyand all of these factors can have a tremendous effecton the outcome of any analysis. Consequently, there isno such thing as a single correct classification,although different techniques might be more or less

As discussed previously, gene-expression data can bethought of as an m-dimensional space, in which expres-sion vectors are represented as points in that space. AnSVM is a binary classifier that attempts to separate genesinto two classes (in the positive training set, or outsideit) by defining an optimal HYPERPLANE separating classmembers from non-members. However, for most realexamples, there is no simple solution to this problem inexpression space. SVM solves the problem by mappingthe gene-expression vectors from expression space intoa higher-dimensional ‘feature space’, in which distance ismeasured using a mathematical function known as a

51 2 3 4 5 6 7 8 9 10

a

c

b

Log(

expr

essi

on r

atio

)

Gene-expression families

Experiment number

BCDEFGHJ

A

–4

–3

–2

–1

0

1

2

3

4

Figure 5 | The effect of data filtering. Application of various data filters or changes in thedistance metric can change the results derived from any clustering algorithm. a | Meancentring of the data removes ‘constant’ expression, which reveals changes in expressionpatterns for the nine gene families across the ten experiments. The changes can be seen in theresults of b | principal component analysis and c | average-linkage hierarchical clustering.



R E V I E W S

Finally, new means for studying gene and proteinexpression are continuing to be developed, as are thetools for data analysis and its applications. Many of thetechniques applied to the analysis of microarray datawill be applicable to the data sets provided by these newtechniques and will allow the interactions represented inthose data to be explored. As with microarray analysis,these explorations will provide hypotheses that can betested in the laboratory using more standard biologicaland biochemical methods.

appropriate for different data sets. Furthermore, theapplication of more than one technique to the anal-ysis of a particular data set might illuminate differentrelationships between the data. For example, a tech-nique that allows us to find cell-cycle-regulated genesmight obscure the expression response to whatevertechnique was used to synchronize the cells. As withexperimental design, analysis techniques must beselected and tuned to best show the relationships inthe data. Cluster analysis does not give absoluteanswers. Instead, these are data-mining techniquesthat allow relationships in the data to be explored.Some of the most exciting and promising applicationsare those that classify human disease states using patterns of gene expression25–27.

The tools and techniques described here are by nomeans comprehensive and many new algorithms andsoftware tools are under development. As the analysispresented here has shown, the ultimate guide to the useand applicability of any laboratory technique or dataanalysis method must be our biological understanding.If an analysis provides insight into the data that is con-sistent with our understanding of the system understudy, then any extension it provides is more likely to bevalid (for example, by classifying novel genes that mightbe involved in a known pathway).

Links

PUBLIC EST SEQUENCES dbEST CDNA DATABASES UniGene | TIGR Gene Indices | STACK| DoTSIMAGE-PROCESSING SOFTWARE Axon | BioDiscovery |Imaging Research | NHGRI Microarray Project | TIGRsoftware tools | Eisen labDATA ANALYSIS TOOLS BioDiscovery | EuropeanBioinformatics Institute (EBI) Expression Profiler | Eisenlab | Silicon Genetics | Spotfire | X Cluster | TIGRsoftware tools | J-expressMETA-LISTS OF OTHER AVAILABLE SOFTWARE EBI |National Center for Genome Resources | Rockefeller |École Normale Supérieure | Stanford

1. Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W.Serial analysis of gene expression. Science 270, 484–487(1995).

2. Lockhart, D. J. et al. Expression monitoring byhybridization to high-density oligonucleotide arrays. NatureBiotechnol. 14, 1675–1680 (1996).

3. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O.Quantitative monitoring of gene expression patterns withcomplementary DNA microarray. Science 270, 467–470(1995).

4. Schena, M. et al. Parallel human genome analysis:microarray-based expression monitoring of 1000 genes.Proc. Natl Acad. Sci. USA 93, 10614–10619 (1996).

5. Wen, X. et al. Large-scale temporal gene expressionmapping of central nervous system development. Proc.Natl Acad. Sci. USA 95, 334–339 (1998).This is one of the first analyses of large-scale geneexpression — in this case, RT–PCR data — usingclustering and data-mining techniques. It elegantlyshows how integrating the results derived usingvarious distance metrics can reveal different butmeaningful patterns in the data.

6. Michaels, G. S. et al. Cluster analysis and data visualizationof large-scale gene expression data. Pacific Symp.Biocomput. 1998, 42–53 (1998).

7. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D.Cluster analysis and display of genome-wide expressionpatterns. Proc. Natl Acad. Sci. USA 95, 14863–14868(1998).This is an excellent demonstration of the power ofhierarchical clustering to the analysis of microarraydata. The authors also provide software — Clusterand Treeview — which became the standard foranalysing microarray data.

8. Weinstein, J. N. et al. An information-intensive approach tothe molecular pharmacology of cancer. Science 275,343–349 (1997).Weinstein and colleagues present one of the firstand most elegant applications of hierarchicalclustering and other data-mining and visualization

techniques to the analysis of large-scale data inmolecular biology.

9. Sokal, R. R. & Michener, C. D. A statistical method forevaluating systematic relationships. Univ. Kans. Sci. Bull.38, 1409–1438 (1958).

10. Shannon, C. C. & Weaver, W. The Mathematical Theory ofCommunication (Illinois Univ. Press, Illinois, 1963).

11. Kohonen, T. Self Organizing Maps (Springer, Berlin, 1995).12. Tamayo, P. et al. Interpreting patterns of gene expression

with self-organizing maps: methods and application tohematopoietic differentiation. Proc. Natl Acad. Sci. USA96, 2907–2912 (1999).Tamayo and colleagues use self-organizing maps(SOMs) to explore patterns of gene expressiongenerated using Affymetrix arrays, and provide theGENECLUSTER implementation of SOMs.

13. Eisen, M. B. & Brown, P. O. DNA arrays for analysis ofgene expression. Meth. Enzymol. 303, 179–205 (1999).

14. Hegde, P. et al. A concise guide to microarray analysis.Biotechniques 29, 548–560 (2000).

15. Boguski, M. S. & Schuler, G. D. ESTablishing a humantranscript map. Nature Genet. 10, 369–371 (1995).

16. Quackenbush, J. et al. The TIGR gene indices: analysis ofgene transcript sequences in highly sampled eukaryoticspecies. Nucleic Acids Res. 29, 159–164 (2001).

17. Burke, J., Wang, H., Hide, W. & Davison, D. B. Alternativegene form discovery and candidate gene selection fromgene indexing projects. Genome Res. 8, 276–290 (1998).

18. Ermolaeva, O. et al. Data management and analysis forgene expression arrays. Nature Genet. 20, 19–23 (1998).

19. Sherlock, G. et al. The Stanford Microarray Database.Nucleic Acids Res. 29, 152–155 (2001).

20. Chen, Y., Dougherty, E. R. & Bittner, M. Ratio-baseddecisions and the quantitative analysis of cDNA microarrayimages. J. Biomed. Opt. 2, 364–374 (1997).

21. Heyer, L. J., Kruglyak, L. & Yooseph, S. Exploringexpression data: identification and analysis of coexpressedgenes. Genome Res. 9, 1106–1115 (1999).

22. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. &Church, G. M. Systematic determination of genetic

network architecture. Nature Genet. 22, 281–285 (1999).23. Raychaudhuri, S., Stuart, J. M. & Altman, R. B. Principal

components analysis to summarize microarrayexperiments: application to sporulation time series. Pac.Symp. Biocomput. 2000, 455–466 (2000).

24. Brown, M. P. et al. Knowledge-based analysis ofmicroarray gene expression data by using support vectormachines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).This paper shows the power of supervisedtechniques, in this case support vector machines, toprovide additional insight into gene expression andfunction.

25. Golub, T. R. Molecular classification of cancer: classdiscovery and class prediction by gene expressionmonitoring. Science 286, 531–537 (1999).

26. Furey, T. S. et al. Support vector machine classification andvalidation of cancer tissue samples using microarrayexpression data. Bioinformatics 16, 906–914 (2000).

27. Hedenfalk, I. et al. Gene-expression profiles in hereditarybreast cancer. N. Engl. J. Med. 344, 539–548 (2001).

28. Chatterjee, S. & Price, B. Regression Analysis by Example(John Wiley and Sons, New York, 1991).

29. Cleveland, W. S. & Devlin, S. J. Locally weightedregression: an approach to regression analysis by localfitting. J. Am. Stat. Assoc. 83, 596–610 (1988).

30. Sokal, R. R. & Sneath, P. H. A. Principles of NumericalTaxonomy (W. H. Freeman & Co., San Francisco, 1963).

31. Ward, J. H. Hierarchical grouping to optimize an objectivefunction. J. Am. Stat. Assoc. 58, 236–244 (1963).

AcknowledgementsCluster analysis was done using the The Institute for GenomicResearch MeV software package developed by A. Sturn, A. I.Saeed and J.Q., which is available at http://pga.tigr.org/tools.shtml,along with the sample data set used here. The author also thanksA. Sturn, N. H. Lee, R. L. Malek and E. Snesrud for valuable dis-cussions and comments. This work is supported by grants fromthe US National Science Foundation, the US National CancerInstitute, and the US National Heart, Lung, and Blood Institute.


computational analysis of microarray data · more basic tools.although the focus here is on spotted...

Documents