learn from chips: microarray data analysis and clustering cs 374 yu bai nov. 16, 2004

Learn from chips: Learn from chips: Microarray data analysis Microarray data analysis

and clusteringand clustering

CS 374CS 374

Yu BaiYu BaiNov. 16, 2004Nov. 16, 2004

Outlines

• Background & motivation

• Algorithms overview

• fuzzy k-mean clustering (1st paper)

• Independent component analysis(2nd paper)

•Why does cancer occur?

CHIP-ing away at medical questions

•Molecular level understanding

•Diagnosis

•Treatment

•Drug design

•Snapshot of gene expression

“(DNA) Microarray”

Spot your genes

Known gene sequences

Glass slide (chip)

Cancer cell

Normal cell

Isolation RNA

Cy3 dye

Cy5 dye

Matrix of expression

Gene 1

Gene 2

Gene N

Exp 1

E 1

Exp 2

E 2

Exp 3

E 3

Why care about “clustering” ?E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene N

Gene 1

Gene 2

•Discover functional relationSimilar expression functionally related

•Assign function to unknown gene

•Find which gene controls which other genes

A review: microarray data analysis

• Supervised (Classification)

• Un-supervised (Clustering)

• “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others

• Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others

1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2]

2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y]

x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y)

3. Other choices for distances…

Heuristic methods: distance metrix

Hierarchical clusteringE1 E2 E3

• Easy

•Depends on where to start the grouping

•Trouble to interpret “tree” structure

K-mean clustering

• How many (k)

• How to initiate • Local minima

Generally, heuristic methods have no established means to determinethe “correct” number of clusters and to choose “best” algorithm

• Overall optimization

Probability-based methods: Principle component analysis (PCA)

• Pearson 1901; Everitt 1992; Basilevksy 1994

• Common use: reduce dimension & filter noise

• Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y

PCA algorithm

• “Column-centered” matrix: A

• Covariance matrix: ATA • Eigenvalue Decomposition ATA = U Λ UT U: Eigenvectors (principle components)

Λ: Eigenvalues

• Digest principle components

• Gaussian assumption

Exp1

-n

Exp1-nExp1-n

Eige

narr

ay

gene

s

Λ U

Exp1

-n

Eigenarray

X

Are biologists satisfied ? • Biological process is non-Gaussian

• “Faithful” vs. “meaningful”

… Gene5 Gene4 Gene3 Gene2 Gene1


RibosomeBiogenesis

EnergyPathway

Biological Regulators

Expressionlevel


Super-Gaussianmodel

Equal to “source separation”

Source 1 Source 2

Mixture 1

?

Independent vs. uncorrelated

E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y

The fact that sources are independent

stronger than uncorrelated

Source x1 Source x2 Two mixtures:

y1= 2*x1 + 3*x2y2= 4*x1 + x2

y1

y2

y1

y2

Independent components

principle components

Independent component analysis(ICA)

),()( tAstx

)(

:

)(

::

)(

:

)( 1

1

1111

ts

ts

aa

aa

tx

tx

nmnm

n

m

,Asx

nmnm

n

m s

s

aa

aa

x

x

::::1

1

1111

Simplified notation

)()()()( 2211 tsatsatsatx niniii

Find “unmixing” matrix A which makes s1,…, sm as independent as possible

(Linear) ICA algorithm• “Likehood function” = Log (probability of observation)

Y= WX

p(x) = |detW| p(y) p(y)= Π pi (yi)

L(y,W) = Log p(x) = Log |detW| + Σ Log pi (yi)

mnmn

m

n x

x

ww

ww

y

y

::::1

1

1111

W

WyLW

),(

WWW

)(

)(

,...)(

)(

)(

)(

)(1

1

1

n

n

n

yp

yyp

yp

yyp

ypyyp

y

],)()[( 1 TT xyWW

(Linear) ICA algorithm

Find W maximize L(y,W)

Super-Gaussianmodel

First paper:

Gasch, et al. (2002) Genome biology, 3, 1-59

Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

Biology is “fuzzy”• Many genes are conditionally co-regulated

• k-mean clustering vs. fuzzy k-mean:

Xi: expression of ith gene Vj: jth cluster center

FuzzyK flowchart

Initial Vj = PCA eigenvectors 1st cycle 3rd cycle

Vj’=Σi m2

XiVj WXi Xi

Σi m2XiVj WXi

weight WXi evaluates the correlation of Xi with others

2nd cycleRemove correlated genes(>0.7)

FuzzyK performance

• k is more “definitive”

• Uncover new gene clusters

Cell wall and secretion factors

• Reveal new promoter sequence

• Recover clusters in classical methods

Second paper: ICA is so new…

Lee, et al. (2003) Genome biology, 4, R76

Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

From linear to non-linear

Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or

Non-linear ICA: X= f(AS)

How to do non-linear ICA?

• Construct feature space F

• Mapping X to Ψ in F

• ICA of Ψ

IRnInput space

IRLfeature space

Normally, L>N

Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space

Kernel trick

Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦV

T ΦV)=L

ΦVT ΦV = [ ] ; choose vectors {v1…vL} from {xi}

k(v1,v1) … k(v1, vL) : :k(vL,v1) … k(vL,vL)

Mapped points in F: Ψ[xi] =[ ]1/2 [ ]k(v1,v1) …k(v1, vL) : :k(vL,v1) …k(vL,vL)

k(v1,xi) : k(vk,xi)

ICA-based clustering

• Independent component yi=(yi1,yi2,…yiK), i=1,…M

• “Load” – the jth entry of yi is the load of jth gene

• Two clusters per component

Clusteri,1 = {gene j| yij= (C%xK)largest load in yi}

Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}

Evaluate biological significance

Cluster 1

Cluster 2Cluster 3

Cluster n

GO 1

GO 2

GO i

GO m

Cluster i GO j

Clusters from ICs Functional Classes

Calculate the p value for each pair: probability that they share many genes by change

Evaluate biological significance

g

if n

Functional

class Microarraydata

“P-value”: p = 1-Σi=1k-1

Prob of sharing i genes =( )( )

fi

g-fn-i gn ( )

( )( )

fi

g-fn-i gn ( )

True positive =kn

Sensitivity =kf

Who is better ?

Conclusion: ICA based clustering Is general better

References

• Su-in lee,(2002) group talk:“Microarray data analysis using ICA”

• Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340

• Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2,94-128

• Alter, et al. 2000, “singular value decomposition for genomewide expression data processing and modeling” PNAS, 97, 10101

• Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”

learn from chips: microarray data analysis and clustering cs 374 yu bai nov. 16, 2004

Documents

possible slide

ey x y slide

n y n

gene n e1e2e3gene n

y i ey y x

gene n exp

distance metrix slide

tree structure slide