pictorial demonstration

63
Pictorial Demonstration Rescale features to minimize the LOO bound R 2 /M 2 x 2 x 1 R 2 /M 2 >1 M R x 2 R 2 /M 2 =1 M = R

Upload: susan-rivera

Post on 30-Dec-2015

41 views

Category:

Documents


3 download

DESCRIPTION

R 2 /M 2 = 1. R 2 /M 2 > 1. R. M = R. M. x 2. x 2. x 1. Pictorial Demonstration. Rescale features to minimize the LOO bound R 2 /M 2. SVM Functional. To the SVM classifier we add an extra scaling parameters for feature selection:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pictorial Demonstration

Pictorial Demonstration

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

Page 2: Pictorial Demonstration

SVM Functional

 

 

To the SVM classifier we add an extra scaling parameters for feature selection:

where the parameters , b are computed by maximizing the the following functional, which is equivalent to maximizing the margin:

Page 3: Pictorial Demonstration

Radius Margin Bound

Page 4: Pictorial Demonstration

Jaakkola-Haussler Bound

Page 5: Pictorial Demonstration

Span Bound

Page 6: Pictorial Demonstration

The Algorithm

Page 7: Pictorial Demonstration

Computing Gradients

Page 8: Pictorial Demonstration

Toy Data

Linear problem with 6 relevant dimensions of 202

Nonlinear problem with 2 relevant dimensions of 52

Page 9: Pictorial Demonstration

Face Detection

On the CMU testset consisting of 479 faces and 57,000,000 non-faces we compare ROC curves obtained for different number of selected features.We see that using more than 60 features does not help.

Page 10: Pictorial Demonstration

Molecular Classification of Cancer

 

 

     

Dataset TotalSamples

Class 0 Class 1

LeukemiaMorphology (train)

38 27ALL

11AML

LeukemiaMorpholgy (test)

34 20ALL

14AML

Leukemia Lineage(ALL)

23 15B-Cell

8T-Cell

Lymphoma Outcome(AML)

15 8Low risk

7High risk

Dataset TotalSamples

Class 0 Class 1

LymphomaMorphology

77 19FSC

58DLCL

LymphomaOutcome

58 20Low risk

14High risk

Brain Morphology 41 14Glioma

27MD

Brain Outcome 50 38Low risk

12High risk

Page 11: Pictorial Demonstration

Morphology Classification

Dataset Algorithm TotalSamples

Totalerrors

Class 1errors

Class 0errors

NumberGenes

SVM 35 0/35 0/21 0/14 40

WV 35 2/35 1/21 1/14 50

LeukemiaMorphology (trest)AML vs ALL

k-NN 35 3/35 1/21 2/14 10

SVM 23 0/23 0/15 0/8 10

WV 23 0/23 0/15 0/8 9

Leukemia Lineage(ALL)B vs T

k-NN 23 0/23 0/15 0/8 10

SVM 77 4/77 2/32 2/35 200

WV 77 6/77 1/32 5/35 30

LymphomaFS vs DLCL

k-NN 77 3/77 1/32 2/35 250

SVM 41 1/41 1/27 0/14 100

WV 41 1/41 1/27 0/14 3

BrainMD vs Glioma

k-NN 41 0/41 0/27 0/14 5

Page 12: Pictorial Demonstration

Outcome Classification

Dataset Algorithm TotalSamples

Totalerrors

Class 1errors

Class 0errors

NumberGenes

SVM 58 13/58 3/32 10/26 100

WV 58 15/58 5/32 10/26 12

Lymphoma

LBC treatmentoutcome

k-NN 58 15/58 8/32 7/26 15

SVM 50 7/50 6/12 1/38 50

WV 50 13/50 6/12 7/38 6

Brain

MD treatmentoutcome

k-NN 50 10/50 6/12 4/38 5

Page 13: Pictorial Demonstration

Outcome Classification

Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.0015

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.00039

Lymphoma Medulloblastoma

Page 14: Pictorial Demonstration

Part

4

Clustering Clustering Algorithms Algorithms

Hierarchical Hierarchical ClusteringClustering

Page 15: Pictorial Demonstration

Hierarchical clustering

Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains

Step 1: Transform genes * experiments matrix intogenes * genes distance matrix

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0

Page 16: Pictorial Demonstration

Hierarchical clustering (continued)

To transform the genes*exp matrix into genes*genes matrix, use a gene similarity metric.(Eisen et al. 1998 PNAS 95:14863-14868)

Where Gi equal the (log-transformed) primary data for gene G in condition i. For any two genes X and Y observed over a series of N conditions. Goffset

is set to 0, corresponding to fluorescence ratio of 1.0

Exactly same asPearsons correlationexcept the underline

Page 17: Pictorial Demonstration

Hierarchical clustering (continued)

What if genome expression is clustered based on negative correlation?

Pearsons correlation example

Page 18: Pictorial Demonstration

Hierarchical clustering (continued)

G 1 G 2 G 3 G 4 G 5G 1 0G 2 2 0G 3 6 5 0G 4 10 9 4 0G 5 9 8 5 3 0

G (12) G 3 G 4 G5G (12) 0

G 3 6 0G 4 10 4 0G 5 9 5 3 0

G (12) G 3 G (45)G (12) 0

G 3 6 0G (45) 10 5 0

Stage GroupsP5 [1], [2], [3], [4], [5]P4 [1 2], [3], [4], [5]P3 [1 2], [3], [4 5]P2 [1 2], [3 4 5]P1 [1 2 3 4 5]

1 2 3 4 5

Page 19: Pictorial Demonstration

Part

5

Clustering Clustering Algorithms Algorithms k-k-means Clusteringmeans Clustering

Page 20: Pictorial Demonstration

K-means clustering

This method differs from the hierarchical clustering in many ways. In particular,

- There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case.

- There is no role for the dendrogram in k-means clustering.

- You must supply the number of clusters (k) into which the data are to be grouped.

Page 21: Pictorial Demonstration

K-means clustering(continued)

Step 1: Transform n (genes) * m (experiments) matrix inton(genes) * n(genes) distance matrix

Step 2: Cluster genes based on a k-means clustering algorithm

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0

Page 22: Pictorial Demonstration

K-means clustering(continued)

To transform the n*m matrix into n*n matrix, use a similarity (distance) metric.

(Tavazoie et al. Nature Genetics. 1999 Jul;22(3):281-5)

Euclidean distance

Where any two genes X and Y observed over a series of M conditions.

Page 23: Pictorial Demonstration

K-means clustering(continued)

Gene 1 Gene 2 Gene 3 Gene 4Gene 1 0Gene 2 1 0Gene 3 1 0Gene 4 1 1 0

1 2

3 4

1 2

1

2

Page 24: Pictorial Demonstration

K-means clustering algorithm

Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix

Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are byfinding the data point farthest from the centersalready chosen. In this example, k=3.

Page 25: Pictorial Demonstration

K-means clustering algorithm(continued)

Step 3: Each point is assigned to the clusterassociated with the closest representative center

Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative

Page 26: Pictorial Demonstration

K-means clustering algorithm(continued)

Run step 3, 4 and 5 until no further changes occur.

Step 5: Repeat step 3 and 4 with a new representative

Page 27: Pictorial Demonstration

Part

6

Clustering Clustering Algorithms Algorithms

Principal Principal Component Component

AnalysisAnalysis

Page 28: Pictorial Demonstration

Principal component analysis (PCA)

PCA is a variable reduction procedure. It is useful when you have obtained data on a large number of variables, and believethat there is some redundancy in those variables.

Page 29: Pictorial Demonstration

PCA (continued)

Page 30: Pictorial Demonstration

PCA (continued)

Page 31: Pictorial Demonstration

PCA (continued)

- Items 1-4 are collapsed into a single new variable that reflects the employees’ satisfaction with supervision, and items 5-7 are collapsed into a single new variable that reflects satisfaction with pay.

- General form for the formula to compute scores on the first componentC1 = b11(X1) + b12(X2) + ……. b1p(Xp)where C1 = the subject’s score on principal component 1 b1p = the regression coefficient(or weight) for observed variable p, as used in creating principal component 1 Xp = the subject’s score on observed variable p.

Page 32: Pictorial Demonstration

For example, you could determine each subject’s score on principal component 1 (satisfaction with supervision) and principal component 2 (satisfaction with pay )by C1 = .44(X1) + .40(X2) + .47(X3) + .32(X4) + .02 (X5) + .01 (X6) + .03(X7)

C2 = .01(X1) + .04(X2) + .02(X3) + .02(X4) + .48(X5) + .31 (X6) + .39(X7)

PCA (continued)

These weights can be calculated using special type of equation called an eigenequation.

Page 33: Pictorial Demonstration

PCA (continued)

(Alter et al., PNAS, 2000, 97(18) 10101-10106)

Page 34: Pictorial Demonstration

PCA (continued)

Page 35: Pictorial Demonstration

Part

7

Clustering Clustering Algorithms Algorithms Self-Organizing Self-Organizing

MapsMaps

Page 36: Pictorial Demonstration

Clustering

Goals

• Find natural classes in the data

• Identify new classes / gene correlations

• Refine existing taxonomies

• Support biological analysis / discovery

• Different Methods– Hierarchical clustering, SOM's, etc

Page 37: Pictorial Demonstration

Self organizing maps (SOM)

- A data visualization technique invented by Professor Teuvo Kohonen which reduce the dimensions of data through the use of self-organizing neural networks.

- A method for producing ordered low-dimensional representations of an input data space.

- Typically such input data is complex and high-dimensional with data elements being related to each other in a nonlinear fashion.

Page 38: Pictorial Demonstration

SOM (continued)

Page 39: Pictorial Demonstration

SOM (continued)

- Cerebral cortex of the brain is arranged as a two-dimensional plane of neurons and spatial mappings are used to model complex data structures.

- Topological relationships in external stimuli are preserved and complex multi-dimensional data can be represented in a lower (usually two) dimensional space.

Page 40: Pictorial Demonstration

SOM (continued)

-One chooses a geometry of "nodes"for example, a 3 × 2 grid.

- The nodes are mapped into k-dimensional space, initially at random, and then iteratively adjusted.

- Each iteration involves randomly selecting a data point P and moving the nodes in the direction of P.

(Tamayo et al., 1999 PNAS 96:2907-2912)

Page 41: Pictorial Demonstration

SOM (continued)

- The closest node NP is moved the most,

whereas other nodes are moved by smaller amounts depending on their distance from NP in the initial geometry.

- In this fashion, neighboring points in the initial geometry tend to be mapped to

nearby points in k-dimensional space. The process continues for 20,000-50,000 iterations.

Page 42: Pictorial Demonstration

SOM (continued)

Yeast Cell Cycle SOM

- The 828 genes that passed the variation filter were grouped into 30 clusters.

Page 43: Pictorial Demonstration

SOM analysis of data of yeast gene expression during diauxic shift [2]. Data were analyzed by a prototype of GenePoint software•a: Genes with a similar expression profile are clustered in the same neuron of a 16 x 16 matrix SOM and genes with closely related profiles are in neighboring neurons. Neurons contain between 10 and 49 genes•b: Magnification of four neurons similarly colored in a. The bar graph in each neuron displays the average expression of genes within the neuron at 2-h intervals during the diauxic shift•c: SOM modified with Sammon's mapping algorithm. The distance between two neurons corresponds to the difference in gene expression pattern between two neurons and the circle size to the number of genes included in the neuron. Neurons marked in green, yellow (upper left corner), red and blue are similarly colored in a and b

Page 44: Pictorial Demonstration

Result of SOM clustering of Dictyostelium expression data with a 6 x 4 structure of centroids. A 6 x 4 = 24 clusters is the minimum number of centroids needed to resolve the three clusters revealed by percolation clustering (encircled, from top to bottom: down-regulated genes, early upregulated genes, and late upregulated genes). The remaining 21 clusters are formed by forceful partitioning of the remaining non-informative noisy data. Similarity of expression within these 21 clusters is random, and is biologically meaningless.

Page 45: Pictorial Demonstration

SOM clustering

• SOM - self organizing maps• Preprocessing

– filter away genes with insufficient biological variation

– normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.

• Run SOM for many iterations• Plot the results

Page 46: Pictorial Demonstration

SOM resultsLarge grid 10x10 3 cells

Page 47: Pictorial Demonstration

Clustering visualization

Page 48: Pictorial Demonstration

2D SOM visualization

Page 49: Pictorial Demonstration

SOM output visualization

Page 50: Pictorial Demonstration

The Y-Cluster

Page 51: Pictorial Demonstration

Part

8

Beyond Beyond ClusteringClustering

Page 52: Pictorial Demonstration

Support vector machines

Used for classification of genes according to function1) Choose positive and negative examples (lable +/-)2) Transform input space to feature space3) Construct maximum margin hyperplane4) Classify new genes as members /non-members

Page 53: Pictorial Demonstration

Support vector machines (continued)

- Using the class definitions made by the MIPS yeast genome database, SVMs were trained to recognize six functional classes: tricarboxylic acid (TCA) cycle, respiration, cytoplasmic ribosomes, proteasome, histones, and helix-turn-helix proteins.

(Brown et al., 2000 PNAS 97(1), 262-267)

Page 54: Pictorial Demonstration

Support vector machines (continued)

Examples of predicted functional classifications for previously unannotated genes by the SVMs

Class Gene Locus Comments

TCA YHR188C Conserved in worm, Schizosaccharomyces pombe, human

YKL039W PTM1 Major transport facilitator family; likely integral membrane protein.

Resp YKR016W Not highly conserved, possible homolog in S. pombe

YKR046C No convincing homologs

Ribo YKL056C Homolog of translationally controlled tumor protein, abundant, fingers

YNL053W MSG5 Protein-tyrosine phosphatase, bypasses growth arrest by mating factor

Prot YDR330W Ubiquitin regulatory domain protein, S. pombe homolog

YJL036W Member of sorting nexin family

YDL053C No convincing homologs

YLR387C Three C2H2 zinc fingers, similar YBR267W not coregulated

Page 55: Pictorial Demonstration

Automatic discovery of regulatory patterns in promoter region

All 6269 ORFs : up and downstream 200 bp.

5097 ORFs : upstream 500 bp.

From SGD

DNA chip : 91 data sets. These data sets consists of the 500 bp upstream regions and the red-green ratios

(Juhl and Knudsen, 2000 Bioinformatics, 16:326-333)

Page 56: Pictorial Demonstration

Automatic discovery of regulatory patterns in promoter region (continued)

- Sequence patterns correlated to whole cell expression datafound by Kolmogorov-Smirnov tests

- Regulatory elements were identified by systematic calculations of the significance of correlation between words found in functional annotation of genes and DNA words occuring in their promoter regions.

Page 57: Pictorial Demonstration

Bayesian networks analysis

- Graph-based model of joint multi-variate probability distributions- The model can captures properties of conditional independence between variables.- Can describe complex stochastic processes- Provide clear methodologies for learning from (noisy) observation

(Friedman et al. 2000 J. Comp. Biol., 7:601-620)

Page 58: Pictorial Demonstration

Bayesian networks analysis (continued)

Page 59: Pictorial Demonstration

Bayesian networks analysis (continued)

-76 gene expression measurementof 6177 yeast ORFs. -800 genes whose expression varied over cell-cycle stageswere selected.-Learned networks whose variableswere the expression level of eachof these 800 genes

Page 60: Pictorial Demonstration

Movie

http://www.dkfz-heidelberg.de/abt0840/whuber/mamovie.html

Page 61: Pictorial Demonstration

Part

9

Concluding Concluding RemarksRemarks

Page 62: Pictorial Demonstration

Future directions

• Algorithms optimized for small samples (the no. of samples will remain small for many tasks)

• Integration with other data– biological networks

– medical text

– protein data

• cost-sensitive classification algorithms– error cost depends on outcome (don’t want to miss

treatable cancer), treatment side effects, etc.

Page 63: Pictorial Demonstration

Summary

• Microarray Data Analysis -- a revolution in life sciences!

• Beware of false positives

• Principled methodology can produce good results