pictorial demonstration

Pictorial Demonstration

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

SVM Functional

To the SVM classifier we add an extra scaling parameters for feature selection:

where the parameters , b are computed by maximizing the the following functional, which is equivalent to maximizing the margin:

Radius Margin Bound

Jaakkola-Haussler Bound

Span Bound

The Algorithm

Computing Gradients

Toy Data

Linear problem with 6 relevant dimensions of 202

Nonlinear problem with 2 relevant dimensions of 52

Face Detection

On the CMU testset consisting of 479 faces and 57,000,000 non-faces we compare ROC curves obtained for different number of selected features.We see that using more than 60 features does not help.

Molecular Classification of Cancer

Dataset TotalSamples

Class 0 Class 1

LeukemiaMorphology (train)

38 27ALL

11AML

LeukemiaMorpholgy (test)

34 20ALL

14AML

Leukemia Lineage(ALL)

23 15B-Cell

8T-Cell

Lymphoma Outcome(AML)

15 8Low risk

7High risk

Dataset TotalSamples

Class 0 Class 1

LymphomaMorphology

77 19FSC

58DLCL

LymphomaOutcome

58 20Low risk

14High risk

Brain Morphology 41 14Glioma

27MD

Brain Outcome 50 38Low risk

12High risk

Morphology Classification

Dataset Algorithm TotalSamples

Totalerrors

Class 1errors

Class 0errors

NumberGenes

SVM 35 0/35 0/21 0/14 40

WV 35 2/35 1/21 1/14 50

LeukemiaMorphology (trest)AML vs ALL

k-NN 35 3/35 1/21 2/14 10

SVM 23 0/23 0/15 0/8 10

WV 23 0/23 0/15 0/8 9

Leukemia Lineage(ALL)B vs T

k-NN 23 0/23 0/15 0/8 10

SVM 77 4/77 2/32 2/35 200

WV 77 6/77 1/32 5/35 30

LymphomaFS vs DLCL

k-NN 77 3/77 1/32 2/35 250

SVM 41 1/41 1/27 0/14 100

WV 41 1/41 1/27 0/14 3

BrainMD vs Glioma

k-NN 41 0/41 0/27 0/14 5

Outcome Classification

Dataset Algorithm TotalSamples

Totalerrors

Class 1errors

Class 0errors

NumberGenes

SVM 58 13/58 3/32 10/26 100

WV 58 15/58 5/32 10/26 12

Lymphoma

LBC treatmentoutcome

k-NN 58 15/58 8/32 7/26 15

SVM 50 7/50 6/12 1/38 50

WV 50 13/50 6/12 7/38 6

Brain

MD treatmentoutcome

k-NN 50 10/50 6/12 4/38 5

Outcome Classification

Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.0015

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.00039

Lymphoma Medulloblastoma

Part

4

Clustering Clustering Algorithms Algorithms

Hierarchical Hierarchical ClusteringClustering

Hierarchical clustering

Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains

Step 1: Transform genes * experiments matrix intogenes * genes distance matrix

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0

Hierarchical clustering (continued)

To transform the genes*exp matrix into genes*genes matrix, use a gene similarity metric.(Eisen et al. 1998 PNAS 95:14863-14868)

Where Gi equal the (log-transformed) primary data for gene G in condition i. For any two genes X and Y observed over a series of N conditions. Goffset

is set to 0, corresponding to fluorescence ratio of 1.0

Exactly same asPearsons correlationexcept the underline


What if genome expression is clustered based on negative correlation?

Pearsons correlation example


G 1 G 2 G 3 G 4 G 5G 1 0G 2 2 0G 3 6 5 0G 4 10 9 4 0G 5 9 8 5 3 0

G (12) G 3 G 4 G5G (12) 0

G 3 6 0G 4 10 4 0G 5 9 5 3 0

G (12) G 3 G (45)G (12) 0

G 3 6 0G (45) 10 5 0

Stage GroupsP5 [1], [2], [3], [4], [5]P4 [1 2], [3], [4], [5]P3 [1 2], [3], [4 5]P2 [1 2], [3 4 5]P1 [1 2 3 4 5]

1 2 3 4 5

Part

5

Clustering Clustering Algorithms Algorithms k-k-means Clusteringmeans Clustering

K-means clustering

This method differs from the hierarchical clustering in many ways. In particular,

- There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case.

- There is no role for the dendrogram in k-means clustering.

- You must supply the number of clusters (k) into which the data are to be grouped.

K-means clustering(continued)

Step 1: Transform n (genes) * m (experiments) matrix inton(genes) * n(genes) distance matrix

Step 2: Cluster genes based on a k-means clustering algorithm

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0


To transform the n*m matrix into n*n matrix, use a similarity (distance) metric.

(Tavazoie et al. Nature Genetics. 1999 Jul;22(3):281-5)

Euclidean distance

Where any two genes X and Y observed over a series of M conditions.


Gene 1 Gene 2 Gene 3 Gene 4Gene 1 0Gene 2 1 0Gene 3 1 0Gene 4 1 1 0

1 2

3 4

1 2

1

2

K-means clustering algorithm

Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix

Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are byfinding the data point farthest from the centersalready chosen. In this example, k=3.

K-means clustering algorithm(continued)

Step 3: Each point is assigned to the clusterassociated with the closest representative center

Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative

K-means clustering algorithm(continued)

Run step 3, 4 and 5 until no further changes occur.

Step 5: Repeat step 3 and 4 with a new representative

Part

6

Clustering Clustering Algorithms Algorithms

Principal Principal Component Component

AnalysisAnalysis

Principal component analysis (PCA)

PCA is a variable reduction procedure. It is useful when you have obtained data on a large number of variables, and believethat there is some redundancy in those variables.

PCA (continued)

PCA (continued)

- Items 1-4 are collapsed into a single new variable that reflects the employees’ satisfaction with supervision, and items 5-7 are collapsed into a single new variable that reflects satisfaction with pay.

- General form for the formula to compute scores on the first componentC1 = b11(X1) + b12(X2) + ……. b1p(Xp)where C1 = the subject’s score on principal component 1 b1p = the regression coefficient(or weight) for observed variable p, as used in creating principal component 1 Xp = the subject’s score on observed variable p.

For example, you could determine each subject’s score on principal component 1 (satisfaction with supervision) and principal component 2 (satisfaction with pay )by C1 = .44(X1) + .40(X2) + .47(X3) + .32(X4) + .02 (X5) + .01 (X6) + .03(X7)

C2 = .01(X1) + .04(X2) + .02(X3) + .02(X4) + .48(X5) + .31 (X6) + .39(X7)

PCA (continued)

These weights can be calculated using special type of equation called an eigenequation.

PCA (continued)

(Alter et al., PNAS, 2000, 97(18) 10101-10106)

PCA (continued)

Part

7

Clustering Clustering Algorithms Algorithms Self-Organizing Self-Organizing

MapsMaps

Clustering

Goals

• Find natural classes in the data

• Identify new classes / gene correlations

• Refine existing taxonomies

• Support biological analysis / discovery

• Different Methods– Hierarchical clustering, SOM's, etc

Self organizing maps (SOM)

- A data visualization technique invented by Professor Teuvo Kohonen which reduce the dimensions of data through the use of self-organizing neural networks.

- A method for producing ordered low-dimensional representations of an input data space.

- Typically such input data is complex and high-dimensional with data elements being related to each other in a nonlinear fashion.

SOM (continued)

SOM (continued)

- Cerebral cortex of the brain is arranged as a two-dimensional plane of neurons and spatial mappings are used to model complex data structures.

- Topological relationships in external stimuli are preserved and complex multi-dimensional data can be represented in a lower (usually two) dimensional space.

SOM (continued)

-One chooses a geometry of "nodes"for example, a 3 × 2 grid.

- The nodes are mapped into k-dimensional space, initially at random, and then iteratively adjusted.

- Each iteration involves randomly selecting a data point P and moving the nodes in the direction of P.

(Tamayo et al., 1999 PNAS 96:2907-2912)

SOM (continued)

- The closest node NP is moved the most,

whereas other nodes are moved by smaller amounts depending on their distance from NP in the initial geometry.

- In this fashion, neighboring points in the initial geometry tend to be mapped to

nearby points in k-dimensional space. The process continues for 20,000-50,000 iterations.

SOM (continued)

Yeast Cell Cycle SOM

- The 828 genes that passed the variation filter were grouped into 30 clusters.

SOM analysis of data of yeast gene expression during diauxic shift [2]. Data were analyzed by a prototype of GenePoint software•a: Genes with a similar expression profile are clustered in the same neuron of a 16 x 16 matrix SOM and genes with closely related profiles are in neighboring neurons. Neurons contain between 10 and 49 genes•b: Magnification of four neurons similarly colored in a. The bar graph in each neuron displays the average expression of genes within the neuron at 2-h intervals during the diauxic shift•c: SOM modified with Sammon's mapping algorithm. The distance between two neurons corresponds to the difference in gene expression pattern between two neurons and the circle size to the number of genes included in the neuron. Neurons marked in green, yellow (upper left corner), red and blue are similarly colored in a and b

Result of SOM clustering of Dictyostelium expression data with a 6 x 4 structure of centroids. A 6 x 4 = 24 clusters is the minimum number of centroids needed to resolve the three clusters revealed by percolation clustering (encircled, from top to bottom: down-regulated genes, early upregulated genes, and late upregulated genes). The remaining 21 clusters are formed by forceful partitioning of the remaining non-informative noisy data. Similarity of expression within these 21 clusters is random, and is biologically meaningless.

SOM clustering

• SOM - self organizing maps• Preprocessing

– filter away genes with insufficient biological variation

– normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.

• Run SOM for many iterations• Plot the results

SOM resultsLarge grid 10x10 3 cells

Clustering visualization

2D SOM visualization

SOM output visualization

The Y-Cluster

Part

8

Beyond Beyond ClusteringClustering

Support vector machines

Used for classification of genes according to function1) Choose positive and negative examples (lable +/-)2) Transform input space to feature space3) Construct maximum margin hyperplane4) Classify new genes as members /non-members

Support vector machines (continued)

- Using the class definitions made by the MIPS yeast genome database, SVMs were trained to recognize six functional classes: tricarboxylic acid (TCA) cycle, respiration, cytoplasmic ribosomes, proteasome, histones, and helix-turn-helix proteins.

(Brown et al., 2000 PNAS 97(1), 262-267)

Support vector machines (continued)

Examples of predicted functional classifications for previously unannotated genes by the SVMs

Class Gene Locus Comments

TCA YHR188C Conserved in worm, Schizosaccharomyces pombe, human

YKL039W PTM1 Major transport facilitator family; likely integral membrane protein.

Resp YKR016W Not highly conserved, possible homolog in S. pombe

YKR046C No convincing homologs

Ribo YKL056C Homolog of translationally controlled tumor protein, abundant, fingers

YNL053W MSG5 Protein-tyrosine phosphatase, bypasses growth arrest by mating factor

Prot YDR330W Ubiquitin regulatory domain protein, S. pombe homolog

YJL036W Member of sorting nexin family

YDL053C No convincing homologs

YLR387C Three C2H2 zinc fingers, similar YBR267W not coregulated

Automatic discovery of regulatory patterns in promoter region

All 6269 ORFs : up and downstream 200 bp.

5097 ORFs : upstream 500 bp.

From SGD

DNA chip : 91 data sets. These data sets consists of the 500 bp upstream regions and the red-green ratios

(Juhl and Knudsen, 2000 Bioinformatics, 16:326-333)

Automatic discovery of regulatory patterns in promoter region (continued)

- Sequence patterns correlated to whole cell expression datafound by Kolmogorov-Smirnov tests

- Regulatory elements were identified by systematic calculations of the significance of correlation between words found in functional annotation of genes and DNA words occuring in their promoter regions.

Bayesian networks analysis

- Graph-based model of joint multi-variate probability distributions- The model can captures properties of conditional independence between variables.- Can describe complex stochastic processes- Provide clear methodologies for learning from (noisy) observation

(Friedman et al. 2000 J. Comp. Biol., 7:601-620)

Bayesian networks analysis (continued)

Bayesian networks analysis (continued)

-76 gene expression measurementof 6177 yeast ORFs. -800 genes whose expression varied over cell-cycle stageswere selected.-Learned networks whose variableswere the expression level of eachof these 800 genes

Movie

http://www.dkfz-heidelberg.de/abt0840/whuber/mamovie.html

Part

9

Concluding Concluding RemarksRemarks

Future directions

• Algorithms optimized for small samples (the no. of samples will remain small for many tasks)

• Integration with other data– biological networks

– medical text

– protein data

• cost-sensitive classification algorithms– error cost depends on outcome (don’t want to miss

treatable cancer), treatment side effects, etc.

Summary

• Microarray Data Analysis -- a revolution in life sciences!

• Beware of false positives

• Principled methodology can produce good results

pictorial demonstration

Documents

gene g

genes matrix

n genes

genes x

exp matrix

relevant dimensions

temporal information

primary data