learn from chips: microarray data analysis and clustering cs 374 yu bai nov. 16, 2004

31
Learn from chips: Learn from chips: Microarray data analysis Microarray data analysis and clustering and clustering CS 374 CS 374 Yu Bai Yu Bai Nov. 16, 2004 Nov. 16, 2004

Upload: skyla-farnham

Post on 14-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Learn from chips: Learn from chips: Microarray data analysis Microarray data analysis

and clusteringand clustering

CS 374CS 374

Yu BaiYu BaiNov. 16, 2004Nov. 16, 2004

Page 2: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Outlines

• Background & motivation

• Algorithms overview

• fuzzy k-mean clustering (1st paper)

• Independent component analysis(2nd paper)

Page 3: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

•Why does cancer occur?

CHIP-ing away at medical questions

•Molecular level understanding

•Diagnosis

•Treatment

•Drug design

•Snapshot of gene expression

“(DNA) Microarray”

Page 4: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Spot your genes

Known gene sequences

Glass slide (chip)

Cancer cell

Normal cell

Isolation RNA

Cy3 dye

Cy5 dye

Page 5: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Matrix of expression

Gene 1

Gene 2

Gene N

Exp 1

E 1

Exp 2

E 2

Exp 3

E 3

Page 6: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Why care about “clustering” ?E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene N

Gene 1

Gene 2

•Discover functional relationSimilar expression functionally related

•Assign function to unknown gene

•Find which gene controls which other genes

Page 7: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

A review: microarray data analysis

• Supervised (Classification)

• Un-supervised (Clustering)

• “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others

• Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others

Page 8: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2]

2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y]

x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y)

3. Other choices for distances…

Heuristic methods: distance metrix

Page 9: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Hierarchical clusteringE1 E2 E3

• Easy

•Depends on where to start the grouping

•Trouble to interpret “tree” structure

Page 10: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

K-mean clustering

• How many (k)

• How to initiate • Local minima

Generally, heuristic methods have no established means to determinethe “correct” number of clusters and to choose “best” algorithm

• Overall optimization

Page 11: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Probability-based methods: Principle component analysis (PCA)

• Pearson 1901; Everitt 1992; Basilevksy 1994

• Common use: reduce dimension & filter noise

• Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y

Page 12: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

PCA algorithm

• “Column-centered” matrix: A

• Covariance matrix: ATA • Eigenvalue Decomposition ATA = U Λ UT U: Eigenvectors (principle components)

Λ: Eigenvalues

• Digest principle components

• Gaussian assumption

Exp1

-n

Exp1-nExp1-n

Eige

narr

ay

gene

s

Λ U

Exp1

-n

Eigenarray

X

Page 13: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Are biologists satisfied ? • Biological process is non-Gaussian

• “Faithful” vs. “meaningful”

… Gene5 Gene4 Gene3 Gene2 Gene1

… Gene5 Gene4 Gene3 Gene2 Gene1

RibosomeBiogenesis

EnergyPathway

Biological Regulators

Expressionlevel

… Gene5 Gene4 Gene3 Gene2 Gene1

Super-Gaussianmodel

Page 14: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Equal to “source separation”

Source 1 Source 2

Mixture 1

?

Page 15: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Independent vs. uncorrelated

E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y

The fact that sources are independent

stronger than uncorrelated

Source x1 Source x2 Two mixtures:

y1= 2*x1 + 3*x2y2= 4*x1 + x2

y1

y2

y1

y2

Independent components

principle components

Page 16: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Independent component analysis(ICA)

),()( tAstx

)(

:

)(

::

)(

:

)( 1

1

1111

ts

ts

aa

aa

tx

tx

nmnm

n

m

,Asx

nmnm

n

m s

s

aa

aa

x

x

::::1

1

1111

Simplified notation

)()()()( 2211 tsatsatsatx niniii

Find “unmixing” matrix A which makes s1,…, sm as independent as possible

Page 17: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

(Linear) ICA algorithm• “Likehood function” = Log (probability of observation)

Y= WX

p(x) = |detW| p(y) p(y)= Π pi (yi)

L(y,W) = Log p(x) = Log |detW| + Σ Log pi (yi)

mnmn

m

n x

x

ww

ww

y

y

::::1

1

1111

Page 18: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

W

WyLW

),(

WWW

)(

)(

,...)(

)(

)(

)(

)(1

1

1

n

n

n

yp

yyp

yp

yyp

ypyyp

y

],)()[( 1 TT xyWW

(Linear) ICA algorithm

Find W maximize L(y,W)

Super-Gaussianmodel

Page 19: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

First paper:

Gasch, et al. (2002) Genome biology, 3, 1-59

Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

Page 20: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Biology is “fuzzy”• Many genes are conditionally co-regulated

• k-mean clustering vs. fuzzy k-mean:

Xi: expression of ith gene Vj: jth cluster center

Page 21: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

FuzzyK flowchart

Initial Vj = PCA eigenvectors 1st cycle 3rd cycle

Vj’=Σi m2

XiVj WXi Xi

Σi m2XiVj WXi

weight WXi evaluates the correlation of Xi with others

2nd cycleRemove correlated genes(>0.7)

Page 22: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

FuzzyK performance

• k is more “definitive”

• Uncover new gene clusters

Cell wall and secretion factors

• Reveal new promoter sequence

• Recover clusters in classical methods

Page 23: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Second paper: ICA is so new…

Lee, et al. (2003) Genome biology, 4, R76

Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

Page 24: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

From linear to non-linear

Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or

Non-linear ICA: X= f(AS)

Page 25: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

How to do non-linear ICA?

• Construct feature space F

• Mapping X to Ψ in F

• ICA of Ψ

IRnInput space

IRLfeature space

Normally, L>N

Page 26: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space

Kernel trick

Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦV

T ΦV)=L

ΦVT ΦV = [ ] ; choose vectors {v1…vL} from {xi}

k(v1,v1) … k(v1, vL) : :k(vL,v1) … k(vL,vL)

Mapped points in F: Ψ[xi] =[ ]1/2 [ ]k(v1,v1) …k(v1, vL) : :k(vL,v1) …k(vL,vL)

k(v1,xi) : k(vk,xi)

Page 27: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

ICA-based clustering

• Independent component yi=(yi1,yi2,…yiK), i=1,…M

• “Load” – the jth entry of yi is the load of jth gene

• Two clusters per component

Clusteri,1 = {gene j| yij= (C%xK)largest load in yi}

Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}

Page 28: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Evaluate biological significance

Cluster 1

Cluster 2Cluster 3

Cluster n

GO 1

GO 2

GO i

GO m

Cluster i GO j

Clusters from ICs Functional Classes

Calculate the p value for each pair: probability that they share many genes by change

Page 29: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Evaluate biological significance

g

if n

Functional

class Microarraydata

“P-value”: p = 1-Σi=1k-1

Prob of sharing i genes =( )( )

fi

g-fn-i gn ( )

( )( )

fi

g-fn-i gn ( )

True positive =kn

Sensitivity =kf

Page 30: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Who is better ?

Conclusion: ICA based clustering Is general better

Page 31: Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

References

• Su-in lee,(2002) group talk:“Microarray data analysis using ICA”

• Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340

• Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2,94-128

• Alter, et al. 2000, “singular value decomposition for genomewide expression data processing and modeling” PNAS, 97, 10101

• Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”