prof. ron shamir & prof. roded sharan school of computer...
TRANSCRIPT
![Page 1: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/1.jpg)
Lecture 7: DNA chips and
clustering 4,6/12/12
חישובית גנומיקה רודד שרן' ופרופרון שמיר ' פרופ אוניברסיטת תל אביב ,ס למדעי המחשב"ביה
Computational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University
![Page 3: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/3.jpg)
CG 2
How Gene Expression Data Looks
Expression levels,
“Raw Data”
conditions
gene
s
Entries of the Raw Data matrix: • Ratio values • Absolute values • …
• Row = gene’s expression pattern / fingerprint vector
• Column = experiment/condition’s profile
![Page 4: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/4.jpg)
CG 3
Data Preprocessing Expression
levels,
“Raw Data”
conditions
gene
s
•Input: Real-valued raw data matrix.
•Compute the similarity matrix (cosine angle/correlation/…)
• Alternatively – distances
10 20 30 40 50 60
10
20
30
40
50
60
From the Raw Data matrix we compute the similarity matrix S. Sij reflects the similarity of the expression patterns of gene i and gene j.
![Page 5: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/5.jpg)
CG 4
DNA chips: Applications
• Deducing functions of unknown genes (similar expression pattern similar function) • Deciphering regulatory mechanisms (co-expression co-regulation). • Identifying disease profiles • Drug development •…
Analysis requires clustering of genes/conditions.
![Page 6: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/6.jpg)
CG 5
Clustering: Objective Group elements (genes) to clusters satisfying:
• Homogeneity: Elements inside a cluster are highly similar to each other.
• Separation: Elements from different clusters have low similarity to each other.
•Unsupervised. •Most formulations are NP-hard.
![Page 9: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/9.jpg)
CG 8
An Alternative View
Instead of partition to clusters – Form a tree-hierarchy of the input elements satisfying:
• More similar elements are placed closer along the tree.
•Or: Tree distances reflect distance between elements
![Page 10: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/10.jpg)
CG 9
Hierarchical Representation
1 3 4 2 1 3 4 2
2.8
4.5 5.0
Dendrogram: rooted tree, usually binary; all leaf-
root distances are equal. Ordinates reflect (avg.)
distances between the corresponding subtrees.
![Page 11: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/11.jpg)
CG 10
Hierarchical Clustering: Average Linkage Sokal & Michener 58, Lance & Williams 67
• Input: Distance matrix (Dij)
• Iterative algorithm. Initially each element is a cluster. nr- size of cluster r
– Find min element Drs in D; merge clusters r,s
– Delete elements r,s; add new element t with Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis
– Repeat
![Page 12: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/12.jpg)
CG 11
Average Linkage (cont.)
• Claim: Drs is the average distance between elements in r and s.
• Proof by induction…
• Claim: Drs can only increase.
![Page 13: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/13.jpg)
CG 12
A General Framework Lance & Williams 67
• Find min element Drs , merge clusters r,s
• Delete elems. r,s, add new elem. t with
Dit=Dti=rDir+ sDis + |Dir-Dis|
• Single-linkage: Dit=min{Dir,Dis}
• Complete-linkage: Dit=max{Dir,Dis}
• Note: analogous formulation in terms of similarity
matrix (rather than distance)
![Page 14: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/14.jpg)
CG 13
Hierarchical clustering of GE data Eisen et al., PNAS 1998
• Growth response: Starved human fibroblast cells, added serum
• Monitored 8600 genes over 13 time-points
• tij - fluorescence level of gene i in condition j; rij – same for reference (time=0).
• sij= log(tij/rij)
• Skl=(jskj •slj)/[|sk||sl|] (cosine of angle)
• Applied average linkage method
• Ordered leaves by increasing element weight: average expression level, time of maximal induction, or other criteria
![Page 16: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/16.jpg)
CG 15
“Eisengrams” for same data randomly permuted within rows (1), columns (2) and both(3)
![Page 17: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/17.jpg)
CG 16
Comments
• Distinct measurements of same genes cluster together
• Genes of similar function cluster together
• Many cluster-function specific insights
• Interpretation is a REAL biological challenge
![Page 18: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/18.jpg)
CG 17
More on hierarchical methods • Agglomerative vs. the “more natural”
divisive. • Advantages:
– gives a single coherent global picture – Intuitive for biologists (from phylogeny)
• Disadvantages: – No single partition; no specific clusters – Forces all elements to fit a tree
hierarchy
![Page 20: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/20.jpg)
CG 19
K-means (Lloyd’ 57, Macqueen ’67)
• Input: vector vi for each element i; #clusters=k
• Define a centroid cp of a cluster Cp as its average vector.
• Goal: minimize clusters pi in cluster pd(vi,cp)
• Objective = homogeneity only (k fixed)
• NP-hard already for k=2.
![Page 21: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/21.jpg)
CG 20
K-means alg.
• Initialize an arbitrary partition P into k clusters.
• Repeat the following till convergence:
– Update centroids (max c, P fixed)
– Assign each point to its closest centroid (max P, c fixed)
• Can be shown to have poly expected time under various assumptions on data distribution.
• A variant: perform a single best modification (that decreases the score the most).
![Page 24: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/24.jpg)
CG 23
A Soft Version
• Based on a probabilistic model of data as coming from a
mixture of Gaussians:
• Goal: evaluate the parameters θ (assume σ is known).
• Method: apply EM to maximize the likelihood of data.
2
( )
( | ) ~ ( , )
i j
i i j
P z j
P x z j N I
2
2
( , )( ) exp( )
2
i j
j
ji
d xL
![Page 25: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/25.jpg)
CG 24
EM, soft version • Iteratively, compute soft assignment and use it to
derive expectations of π, μ:
j
i
![Page 26: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/26.jpg)
CG 25
Soft vs. hard k-means
Soft EM optimizes: Hard EM optimizes: If we use uniform mixture probs then k-means is an application of hard EM since:
2
( )log ( , | ) ( , )i z i
i
P x z d x
![Page 28: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/28.jpg)
CG
The probabilistic setting
Input: data x coming from a probabilistic model with hidden
information y
Goal: Learn the model’s parameters so that the likelihood of the
data is maximized.
Example: a mixture of two Gaussians
2 1
2
2
1( 1) ; ( 2) 1
( )1( | ) exp
22
j
i i
i
i i
P y P y p p
xP x y
p
j
![Page 29: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/29.jpg)
CG
The likelihood function
1 2 1
2
2
2
2
( 1) ; ( 2) 1
( )1( | ) exp
22
( ) ( | ) ( , | )
( )log ( ) log exp
22
i i
i j
i i
i i i
ji i
j j
i j
P y p P y p p
xP x y j
L P x P x y j
p xL
![Page 30: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/30.jpg)
CG
The EM algorithm Goal: max logP(x|θ)=log (Σ P(x,y|θ))
Assume we have a model θt which we wish to improve.
Note: P(x|θ) = P(x,y|θ) / P(y|x,θ)
( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
( | , ) log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | ) ( | , ) log ( , | ) ( | , ) log ( | , )
log ( | )
t t t
t t t
y y y
t t
y y
t
P y x P x P y x P x y P y x P y x
P y x P x P y x P x y P y x P y x
P x P y x P x y P y x P y x
P x
( | , ) log ( , | ) ( | , ) log ( | , )
( | , ) = ( | ) ( | ) ( | , ) log
( | , )
t t t t
y y
tt t t t
y
P y x P x y P y x P y x
P y xQ Q P y x
P y x
Constant >=0
![Page 31: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/31.jpg)
CG
The EM algorithm (cont.)
Main component:
is the expectation of logP(x,y|θ) over the distribution of y given
by the current parameters θt
The algorithm:
• E-step: Calculate the Q function
• M-step: Maximize Q(θ|θt) with respect to θ
• Stopping criterion: improvement in log likelihood ≤ ε
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
![Page 32: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/32.jpg)
CG 31
Application to the mixture model
( | ) ( | , ) log ( , | )t t
y
Q P y x P x y
( , | ) ( , | ) ( , | )
1
0
ijy
i i i i
i i j
i
ij
i
P x y P x y P x y j
y jy
y j
log ( , | ) log ( , | )
( | ) ( | , ) log ( , | )
( | , ) log ( , | )
ij i i
i j
t t
ij i i
y i j
t
ij i i
i j y
P x y y P x y j
Q P y x y P x y j
P y x y P x y j
![Page 33: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/33.jpg)
CG 32
Application (cont.)
( | ) ( 1| , ) log ( , | )
( , | ): ( 1| , )
( , | )
t t
ij i i i
i j
tt t i iij ij i t
i i
j
Q P y x P x y j
P x y jw P y x
P x y j
2
2
( )1( | ) log log log
22
i jt t
ij j
i j
xQ w p
![Page 34: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/34.jpg)
CG
Baum-Welch: EM for HMM
y=π, i.e. the log likelihood is
And the Q function is:
log ( | ) log ( , | )P x P x
( | ) ( | , ) log ( , | )t tQ P x P x
![Page 35: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/35.jpg)
CG
Baum-Welch (cont.)
( , ) ( )
1 1 1
( , | ) [ ( )]k kl
M M ME b A
k kl
k b k l
P x e b a
Emission probability, state k
character b
Transition probability, state k
to state l
Number of times we saw b from k at π
![Page 36: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/36.jpg)
CG
Baum-Welch (cont.)
1 1 1
1 1 1
( | ) ( | , ) ( , ) log( ( )) ( ) log
( | , ) ( , ) log( ( )) ( | , ) ( ) log
M M Mt t
k k kl kl
k b k l
M M Mt t
k k kl kl
k b k l
Q P x E b e b A a
P x E b e b P x A a
( | , ) ( )t
kl klP x A A
( | , ) ( , ) ( )t
k kP x E b E b
value probability expectation value probability expectation
![Page 37: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/37.jpg)
CG
•So we want to find a set of parameters θt+1 that maximizes:
•Ek(b), Akl can be computed using forward/backward:
•For maximization, select:
Baum-Welch (cont.)
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
'
( ) , ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
P(i=k, i+1=l | x, t) = [1/P(x)]·fk(i)·akl·el(xi+1)·bl(i+1)
Akl = [1/P(x)]· i fk(i) · akl · el(xi+1) · bl(i+1) similarly, Ek(b) = [1/P(x)] · {i|xi=b}
fk(i) · bk(i)
![Page 38: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/38.jpg)
CG
Relative entropy is positive
log(x)≤x-1
i
ii
i
i
i
i
i
i
i
i
xPxQ
xP
xQxP
xP
xQxP
0)()(
1)(
)()(
)(
)(log)(
![Page 39: Prof. Ron Shamir & Prof. Roded Sharan School of Computer ...rshamir/algmb/presentations/Clustering-em-bw.pdf · Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel](https://reader035.vdocuments.site/reader035/viewer/2022070820/5f1e0651e8fdfd6a582ab457/html5/thumbnails/39.jpg)
CG
Maximize:
Baum-Welch: EM for HMM
1 1 1
( ) log( ( )) logM M M
k k kl kl
k b k l
E b e b A a
chosen
ij
'
( ) (denote as a ), ( )
( ')
ij kij k
ik k
k b
A E ba e b
A E b
always positive
'
1 1 1 ' 1 '
'
'
1 ' 1
log log
log
chosen chosenM M M Mkl kl kl
kl ikother otherk l k k lkl ik kl
k
chosenM Mchosen kl
ik kl otherk k l kl
a A aA A
a A a
aA a
a
Difference between chosen set and some other:
Multiply and divide by same factor