learn from chips: microarray data analysis and clustering cs 374 yu bai nov. 16, 2004
TRANSCRIPT
Learn from chips: Learn from chips: Microarray data analysis Microarray data analysis
and clusteringand clustering
CS 374CS 374
Yu BaiYu BaiNov. 16, 2004Nov. 16, 2004
Outlines
• Background & motivation
• Algorithms overview
• fuzzy k-mean clustering (1st paper)
• Independent component analysis(2nd paper)
•Why does cancer occur?
CHIP-ing away at medical questions
•Molecular level understanding
•Diagnosis
•Treatment
•Drug design
•Snapshot of gene expression
“(DNA) Microarray”
Spot your genes
Known gene sequences
Glass slide (chip)
Cancer cell
Normal cell
Isolation RNA
Cy3 dye
Cy5 dye
Matrix of expression
Gene 1
Gene 2
Gene N
Exp 1
E 1
Exp 2
E 2
Exp 3
E 3
Why care about “clustering” ?E1 E2 E3
Gene 1
Gene 2
Gene N
E1 E2 E3
Gene N
Gene 1
Gene 2
•Discover functional relationSimilar expression functionally related
•Assign function to unknown gene
•Find which gene controls which other genes
A review: microarray data analysis
• Supervised (Classification)
• Un-supervised (Clustering)
• “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others
• Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others
1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2]
2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y]
x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y)
3. Other choices for distances…
Heuristic methods: distance metrix
Hierarchical clusteringE1 E2 E3
• Easy
•Depends on where to start the grouping
•Trouble to interpret “tree” structure
K-mean clustering
• How many (k)
• How to initiate • Local minima
Generally, heuristic methods have no established means to determinethe “correct” number of clusters and to choose “best” algorithm
• Overall optimization
Probability-based methods: Principle component analysis (PCA)
• Pearson 1901; Everitt 1992; Basilevksy 1994
• Common use: reduce dimension & filter noise
• Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y
PCA algorithm
• “Column-centered” matrix: A
• Covariance matrix: ATA • Eigenvalue Decomposition ATA = U Λ UT U: Eigenvectors (principle components)
Λ: Eigenvalues
• Digest principle components
• Gaussian assumption
Exp1
-n
Exp1-nExp1-n
Eige
narr
ay
gene
s
Λ U
Exp1
-n
Eigenarray
X
Are biologists satisfied ? • Biological process is non-Gaussian
• “Faithful” vs. “meaningful”
… Gene5 Gene4 Gene3 Gene2 Gene1
… Gene5 Gene4 Gene3 Gene2 Gene1
RibosomeBiogenesis
EnergyPathway
Biological Regulators
Expressionlevel
… Gene5 Gene4 Gene3 Gene2 Gene1
Super-Gaussianmodel
Equal to “source separation”
Source 1 Source 2
Mixture 1
?
Independent vs. uncorrelated
E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y
The fact that sources are independent
stronger than uncorrelated
Source x1 Source x2 Two mixtures:
y1= 2*x1 + 3*x2y2= 4*x1 + x2
y1
y2
y1
y2
Independent components
principle components
Independent component analysis(ICA)
),()( tAstx
)(
:
)(
::
)(
:
)( 1
1
1111
ts
ts
aa
aa
tx
tx
nmnm
n
m
,Asx
nmnm
n
m s
s
aa
aa
x
x
::::1
1
1111
Simplified notation
)()()()( 2211 tsatsatsatx niniii
Find “unmixing” matrix A which makes s1,…, sm as independent as possible
(Linear) ICA algorithm• “Likehood function” = Log (probability of observation)
Y= WX
p(x) = |detW| p(y) p(y)= Π pi (yi)
L(y,W) = Log p(x) = Log |detW| + Σ Log pi (yi)
mnmn
m
n x
x
ww
ww
y
y
::::1
1
1111
W
WyLW
),(
WWW
)(
)(
,...)(
)(
)(
)(
)(1
1
1
n
n
n
yp
yyp
yp
yyp
ypyyp
y
],)()[( 1 TT xyWW
(Linear) ICA algorithm
Find W maximize L(y,W)
Super-Gaussianmodel
First paper:
Gasch, et al. (2002) Genome biology, 3, 1-59
Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering
Biology is “fuzzy”• Many genes are conditionally co-regulated
• k-mean clustering vs. fuzzy k-mean:
Xi: expression of ith gene Vj: jth cluster center
FuzzyK flowchart
Initial Vj = PCA eigenvectors 1st cycle 3rd cycle
Vj’=Σi m2
XiVj WXi Xi
Σi m2XiVj WXi
weight WXi evaluates the correlation of Xi with others
2nd cycleRemove correlated genes(>0.7)
FuzzyK performance
• k is more “definitive”
• Uncover new gene clusters
Cell wall and secretion factors
• Reveal new promoter sequence
• Recover clusters in classical methods
Second paper: ICA is so new…
Lee, et al. (2003) Genome biology, 4, R76
Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)
From linear to non-linear
Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or
Non-linear ICA: X= f(AS)
How to do non-linear ICA?
• Construct feature space F
• Mapping X to Ψ in F
• ICA of Ψ
IRnInput space
IRLfeature space
Normally, L>N
Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space
Kernel trick
Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦV
T ΦV)=L
ΦVT ΦV = [ ] ; choose vectors {v1…vL} from {xi}
k(v1,v1) … k(v1, vL) : :k(vL,v1) … k(vL,vL)
Mapped points in F: Ψ[xi] =[ ]1/2 [ ]k(v1,v1) …k(v1, vL) : :k(vL,v1) …k(vL,vL)
k(v1,xi) : k(vk,xi)
ICA-based clustering
• Independent component yi=(yi1,yi2,…yiK), i=1,…M
• “Load” – the jth entry of yi is the load of jth gene
• Two clusters per component
Clusteri,1 = {gene j| yij= (C%xK)largest load in yi}
Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}
Evaluate biological significance
Cluster 1
Cluster 2Cluster 3
Cluster n
GO 1
GO 2
GO i
GO m
Cluster i GO j
Clusters from ICs Functional Classes
Calculate the p value for each pair: probability that they share many genes by change
Evaluate biological significance
g
if n
Functional
class Microarraydata
“P-value”: p = 1-Σi=1k-1
Prob of sharing i genes =( )( )
fi
g-fn-i gn ( )
( )( )
fi
g-fn-i gn ( )
True positive =kn
Sensitivity =kf
Who is better ?
Conclusion: ICA based clustering Is general better
References
• Su-in lee,(2002) group talk:“Microarray data analysis using ICA”
• Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340
• Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2,94-128
• Alter, et al. 2000, “singular value decomposition for genomewide expression data processing and modeling” PNAS, 97, 10101
• Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”