machine learning for data science (cs4786) lecture 15 · 2017-10-12 · machine learning for data...
TRANSCRIPT
![Page 1: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/1.jpg)
Machine Learning for Data Science (CS4786)Lecture 15
Review + Probabilistic Modeling
Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2017fa/
![Page 2: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/2.jpg)
Announcements
• In-class Kaggle link is up
• only one registration per group
• 5 submissions per day allowed
• Start early so you get more submissions
• Survey: Participation 95.44% ! Kudos!
![Page 3: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/3.jpg)
0
25
50
75
100
Too fast
Somew
hat fast
Right speed
Somew
hat slow
Too slow
Lecture Speed
![Page 4: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/4.jpg)
0
30
60
90
120
Too Mathy
Vague
Clear and Easy
Too easy/boring
Lecture Style/Clarity
![Page 5: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/5.jpg)
Assignment Load
0
30
60
90
120
1-2 hours
2-5 hours
5-9 hours
> 9 hours
![Page 6: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/6.jpg)
TELL ME WHO YOUR FRIENDS ARE . . .
Cluster nodes in a graph.Analysis of social network data.
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
![Page 7: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/7.jpg)
TELL ME WHO YOUR FRIENDS ARE . . .
Cluster nodes in a graph.Analysis of social network data.
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
![Page 8: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/8.jpg)
TELL ME WHO YOUR FRIENDS ARE . . .
Cluster nodes in a graph.Analysis of social network data.
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
![Page 9: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/9.jpg)
n
n
A
Steps
![Page 10: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/10.jpg)
n
n
A Y n
K
Spectral Embedding
Steps
![Page 11: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/11.jpg)
n
n
A Y n
K
Spectral Embedding
Cluster(Y)
Steps
![Page 12: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/12.jpg)
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
AL = -D
![Page 13: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/13.jpg)
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
AL = -D
Di,i =nX
j=1
Ai,j
![Page 14: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/14.jpg)
Spectral Embedding
• Nodes linked to each other are close in embedded space
![Page 15: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/15.jpg)
SPECTRAL CLUSTERING ALGORITHM(UNNORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the Laplacian matrix L = D −A
3 Find eigen vectors v1, . . . ,vn
of L (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
![Page 16: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/16.jpg)
SPECTRAL CLUSTERING ALGORITHM(UNNORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the Laplacian matrix L = D −A
3 Find eigen vectors v1, . . . ,vn
of L (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
y1, . . . ,yn are called spectral embedding
![Page 17: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/17.jpg)
SPECTRAL CLUSTERING ALGORITHM(UNNORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the Laplacian matrix L = D −A
3 Find eigen vectors v1, . . . ,vn
of L (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
y1, . . . ,yn are called spectral embedding
Embeds the n nodes into K-1 dimensional vectors
![Page 18: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/18.jpg)
NORMALIZED CUT
Why cut is perhaps not a good measure?
Normalized cut: Minimize sum of ratio of number of edges cutper cluster and number of edges within cluster
NCUT =�j
CUT(Cj
)Edges(C
j
)Example K = 2
CUT(C1,C2)� 1Edges(C1) +
1Edges(C2)�
Minimize CUT(C1,C2) s.t. Edges(C1) = Edges(C2)
![Page 19: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/19.jpg)
NORMALIZED CUT
Why cut is perhaps not a good measure?
Normalized cut: Minimize sum of ratio of number of edges cutper cluster and number of edges within cluster
NCUT =�j
CUT(Cj
)Edges(C
j
)Example K = 2
CUT(C1,C2)� 1Edges(C1) +
1Edges(C2)�
Minimize CUT(C1,C2) s.t. Edges(C1) = Edges(C2)
1
2 3
4
5 6
7
![Page 20: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/20.jpg)
NORMALIZED CUT
Why cut is perhaps not a good measure?
Normalized cut: Minimize sum of ratio of number of edges cutper cluster and number of edges within cluster
NCUT =�j
CUT(Cj
)Edges(C
j
)Example K = 2
CUT(C1,C2)� 1Edges(C1) +
1Edges(C2)�
Minimize CUT(C1,C2) s.t. Edges(C1) = Edges(C2)
1
2 3
4
5 6
7
![Page 21: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/21.jpg)
NORMALIZED CUT
Why cut is perhaps not a good measure?
Normalized cut: Minimize sum of ratio of number of edges cutper cluster and number of edges within cluster
NCUT =�j
CUT(Cj
)Edges(C
j
)Example K = 2
CUT(C1,C2)� 1Edges(C1) +
1Edges(C2)�
Minimize CUT(C1,C2) s.t. Edges(C1) = Edges(C2)
1
2 3
4
5 6
7
![Page 22: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/22.jpg)
NORMALIZED CUT
Normalized cut: Minimize sum of ratio of number of edges cutper cluster and number of edges within cluster
NCUT =�j
CUT(Cj
)Edges(C
j
)Example K = 2
CUT(C1,C2)� 1Edges(C1) +
1Edges(C2)�
This is an NP hard problem!
. . . so relax
![Page 23: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/23.jpg)
NORMALIZED SPECTRAL CLUSTERING
As before, we want to minimize ∑(i,j)∈E(ci − cj)2 = c�Lc
But we also want to weight the values of ci’s based on degree.We want high degree nodes to have larger c magnitude
That is we want to simultaneously maximize ∑ni=1 c2
i D2i,i = c�Dc
Find c so as to:
minimizec�Lcc�Dc≡minimize c�Lc subject to c�Dc = 1
≡minimize u�D−1�2LD−1�2u subject to �u� = 1
![Page 24: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/24.jpg)
NORMALIZED SPECTRAL CLUSTERING
As before, we want to minimize ∑(i,j)∈E(ci − cj)2 = c�Lc
But we also want to weight the values of ci’s based on degree.We want high degree nodes to have larger c magnitude
That is we want to simultaneously maximize ∑ni=1 c2
i D2i,i = c�Dc
Find c so as to:
minimizec�Lcc�Dc≡minimize c�Lc subject to c�Dc = 1
≡minimize u�D−1�2LD−1�2u subject to �u� = 1
![Page 25: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/25.jpg)
NORMALIZED SPECTRAL CLUSTERING
As before, we want to minimize ∑(i,j)∈E(ci − cj)2 = c�Lc
But we also want to weight the values of ci’s based on degree.We want high degree nodes to have larger c magnitude
That is we want to simultaneously maximize ∑ni=1 c2
i D2i,i = c�Dc
Find c so as to:
minimizec�Lcc�Dc≡minimize c�Lc subject to c�Dc = 1
≡minimize u�D−1�2LD−1�2u subject to �u� = 1
![Page 26: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/26.jpg)
NORMALIZED SPECTRAL CLUSTERING
As before, we want to minimize ∑(i,j)∈E(ci − cj)2 = c�Lc
But we also want to weight the values of ci’s based on degree.We want high degree nodes to have larger c magnitude
That is we want to simultaneously maximize ∑ni=1 c2
i D2i,i = c�Dc
Find c so as to:
minimizec�Lcc�Dc≡minimize c�Lc subject to c�Dc = 1
≡minimize u�D−1�2LD−1�2u subject to �u� = 1
![Page 27: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/27.jpg)
NORMALIZED SPECTRAL CLUSTERING
As before, we want to minimize ∑(i,j)∈E(ci − cj)2 = c�Lc
But we also want to weight the values of ci’s based on degree.We want high degree nodes to have larger c magnitude
That is we want to simultaneously maximize ∑ni=1 c2
i D2i,i = c�Dc
Find c so as to:
minimizec�Lcc�Dc≡minimize c�Lc subject to c�Dc = 1
≡minimize u�D−1�2LD−1�2u subject to �u� = 1
![Page 28: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/28.jpg)
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
Minimize c>L̃c s.t. c ? 1
![Page 29: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/29.jpg)
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
Minimize c>L̃c s.t. c ? 1
Approximately Minimize normalized cut!
![Page 30: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/30.jpg)
SPECTRAL CLUSTERING
Input: Similarity matrix A
A
i,j = A
j,i > 0 indicates similarity between elements x
i
and x
j
Example: A
i,j = exp(−�d(xi
,xj
))A is adjacency matrix of a graph
Minimize c>L̃c s.t. c ? 1
Approximately Minimize normalized cut!
NORMALIZED CUT
First note that Edges(Ci
) = ∑k∶x
k
∈Ci
D
k,k
Set c
i
=�����������
�Edges(C2)Edges(C1) if i ∈ C1
−�Edges(C1)Edges(C2) otherwise
Verify that c
�Lc = �E� ×NCut and c
�Dc = �E� (and Dc ⊥ 1)
Hence we relax Minimize NCUT(C) to
Minimizec
�Lc
c
�Dc
s.t. Dc ⊥ 1
Solution: Find second smallest eigenvectors of L̃ = I −D
−1�2AD
−1�2
![Page 31: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/31.jpg)
SPECTRAL CLUSTERING ALGORITHM (NORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the normalized Laplacian matrix L̃ = I −D
−1�2AD
−1�23 Find eigen vectors v1, . . . ,vn
of L̃ (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
![Page 32: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/32.jpg)
Review
![Page 33: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/33.jpg)
ClusteringCLUSTERING
X
d
n XX
![Page 34: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/34.jpg)
K-means
• K-means algorithm: (wishful thinking)
• Fix parameters (the k means) and compute new cluster assignments (or probabilities) for every point
• Fix cluster assignment for all data points and re-evaluate parameters (the k-means)
![Page 35: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/35.jpg)
Single-Link Clustering
• Start with all points being their own clusters
• Until we get K-clusters, merge the closest two clusters
![Page 36: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/36.jpg)
When to Use Single Link
• When we have dense sampling of points within each cluster
• When not to use: when we might have outliers
![Page 37: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/37.jpg)
When to use K-means
• When we have nice spherical round equal size clusters or cluster masses are far apart
• Handles outliers better
![Page 38: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/38.jpg)
PRINCIPAL COMPONENT ANALYSIS
Eigenvectors of the covariance matrix are the principalcomponents
Top K principal components are the eigenvectors with K largesteigenvalues
Projection = Data × Top Keigenvectors
Reconstruction = Projection × Transpose of top K eigenvectors
Independently discovered by Pearson in 1901 and Hotelling in1933.
⌃ =cov X !
1.
eigs= ⌃ ,K( )W2.
3. Y = W⇥X�µ
![Page 39: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/39.jpg)
When to use PCA• Great when data is truly low dimensional (on a
hyperplane (linear))
• Or approximately low dimensional (almost lie on plane Eg. very flat ellipsoid)
• Eg. Dimensionality reduction for face images, for multiple biometric applications as preprocessing…
![Page 40: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/40.jpg)
CCA ALGORITHM
Write x̃t = � xtx
′t� the d + d ′ dimensional concatenated vectors.
Calculate covariance matrix of the joint data points
⌃ = � ⌃1,1 ⌃1,2⌃2,1 ⌃2,2
�
Calculate ⌃−11,1⌃1,2⌃
−12,2⌃2,1. The top K eigen vectors of this matrix
give us projection matrix for view I.
Calculate ⌃−12,2⌃2,1⌃
−11,1⌃1,2. The top K eigen vectors of this matrix
give us projection matrix for view II.
,
n
d1 d2
X = X1 X2
!
1.
⌃ =cov X !
2.⌃
⌃⌃⌃11
21
12
22=
eigs= ,K( )3. W1⌃�1
11 ⌃12⌃�122 ⌃21
4. Y = W⇥X�µ11
1 1
![Page 41: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/41.jpg)
When to use CCA?
• CCA applies for problems where data can be split into 2 views X = [X1,X2]
• CCA picks directions of projection (in each view) where data is maximally correlated
• Maximizes correlation coefficient and not just covariance so is scale free
![Page 42: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/42.jpg)
When to use CCA• Scenario 1: You have two feature extraction techniques.
• One provides excellent features for dogs Vs cats and noise on other classes
• Other method provides excellent features for cars Vs bikes and noise for other classes
• What do we do?
A. Use CCA to find one common representation
B. Concatenate the two features extracted
![Page 43: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/43.jpg)
When to use CCA• Scenario 2: You have two cameras capturing images of the
same objects from different angles.
• You have a feature extraction technique that provides feature vectors from each camera.
• You want to extract good features for recognizing the object from the two cameras
• What do we do?
A. Use CCA to find one common representation
B. Concatenate features provides excellent features for
![Page 44: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/44.jpg)
PICK A RANDOM W
Y = X ⇥
2
666666664
+1 . . . �1�1 . . . +1+1 . . . �1
···
+1 . . . �1
3
777777775
d
K
pK
![Page 45: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/45.jpg)
When to use RP?• When data is huge and very large dimensional
• For PCA, CCA typically you think of K (no. of dimensions we reduce to) in double digits
• For RP think of K typically in 3-4 digit numbers
• RP guarantees preservation of inter-point distances.
• RP unlike PCA and CCA does not project using unit vectors. (What does this mean?)
![Page 46: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/46.jpg)
KERNEL PCA
,
!
1. =
n
nK̃ Xkcompute
eigs= ( ,K)2.
" #�,A K̃
3. =PA1p�1
AKp�K
. . .
![Page 47: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/47.jpg)
KERNEL PCA
4. Y = ⇥K̃ P
![Page 48: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/48.jpg)
When to use Kernel PCA
• When data lies on some non-linear, low dimensional subspace
• Kernel function matters. (Eg. RBF kernel, only points close to a given point have non-negligible kernel evaluation)
![Page 49: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/49.jpg)
Spectral Clustering
• You want to cluster nodes of a graph into groups based on connectivity
• Unnormalized spectral clustering: divide into groups where as few edges between groups are cut
![Page 50: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/50.jpg)
SPECTRAL CLUSTERING ALGORITHM(UNNORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the Laplacian matrix L = D −A
3 Find eigen vectors v1, . . . ,vn
of L (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
![Page 51: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/51.jpg)
Spectral Embedding
n
n
A Yn
KUse K-means on Y
![Page 52: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/52.jpg)
Normalized Spectral Clustering
• Unnormalized spectral embedding encourages loner nodes to be pushed far away from rest
• This is indeed the min-cut solution to cut off loners
• Instead form clusters that minimize ratio of edges cut to number of edges each cluster has
• (busy groups tend to form clusters)
• Algorithm, replace Laplacian matrix by normalized one
![Page 53: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/53.jpg)
SPECTRAL CLUSTERING ALGORITHM (NORMALIZED)
1 Given matrix A calculate diagonal matrix D s.t. D
i,i = ∑n
j=1 A
i,j
2 Calculate the normalized Laplacian matrix L̃ = I −D
−1�2AD
−1�23 Find eigen vectors v1, . . . ,vn
of L̃ (ascending order of eigenvalues)
4 Pick the K eigenvectors with smallest eigenvalues to gety1, . . . ,yn
∈ RK
5 Use K-means clustering algorithm on y1, . . . ,yn
![Page 54: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/54.jpg)
When to use Spectral Clustering
• First, even works with weighted graph, where weight of edge represents similarity
• When knowledge about how clusters should be formed is solely decided by similarity between points, there is no underlying prior knowledge
![Page 55: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/55.jpg)
Probabilistic Modeling
![Page 56: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/56.jpg)
PROBABILISTIC MODEL
Data: x1. . . . ,xn
![Page 57: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/57.jpg)
PROBABILISTIC MODEL
Data: x1. . . . ,xn
✓ 2 ⇥P✓ explains data
![Page 58: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/58.jpg)
PROBABILISTIC MODEL
![Page 59: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/59.jpg)
PROBABILISTIC MODEL
⌃2
⌃1
⇡1 = 0.5
⇡2 = 0.25
⇡3 = 0.25
⌃3
![Page 60: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/60.jpg)
PROBABILISTIC MODEL
⌃2
⌃1
⇡1 = 0.5
⇡2 = 0.25
⇡3 = 0.25
⌃3
⌃ 3
⌃2
⌃1
⇡1 = 0.5
⇡2 = 0.25
⇡3 = 0.25
![Page 61: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/61.jpg)
PROBABILISTIC MODELS
Set of models ⇥ consists of parameters s.t. P✓ for each ✓ ∈ ⇥ is adistribution over data.
Learning: Estimate ✓∗ ∈ ⇥ that best models given data
![Page 62: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/62.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
Reasoning:One of the models in ⇥ is the correct oneGiven data we pick the one that best explains the observed dataEquivalently pick the maximum likelihood estimator,
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)
Often referred to as frequentist view
![Page 63: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/63.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
Reasoning:One of the models in ⇥ is the correct one
Given data we pick the one that best explains the observed dataEquivalently pick the maximum likelihood estimator,
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)
Often referred to as frequentist view
![Page 64: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/64.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
Reasoning:One of the models in ⇥ is the correct oneGiven data we pick the one that best explains the observed data
Equivalently pick the maximum likelihood estimator,
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)
Often referred to as frequentist view
![Page 65: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/65.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
Reasoning:One of the models in ⇥ is the correct oneGiven data we pick the one that best explains the observed dataEquivalently pick the maximum likelihood estimator,
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)
Often referred to as frequentist view
![Page 66: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/66.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
Reasoning:One of the models in ⇥ is the correct oneGiven data we pick the one that best explains the observed dataEquivalently pick the maximum likelihood estimator,
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)
Often referred to as frequentist view
![Page 67: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/67.jpg)
MAXIMUM LIKELIHOOD PRINCIPAL
Pick ✓ ∈ ⇥ that maximizes probability of observation
✓MLE
= argmax✓∈⇥ log P✓(x1, . . . ,xn
)�����������������������������������������������������������������������������������������������������Likelihood
A priori all models are equally good, data could have beengenerated by any one of them
![Page 68: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/68.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed dataGiven data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 69: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/69.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our belief
We update our belief based on observed dataGiven data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 70: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/70.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed data
Given data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 71: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/71.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed dataGiven data we pick the model that we believe the most
Pick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 72: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/72.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed dataGiven data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)
I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 73: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/73.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed dataGiven data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 74: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/74.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Reasoning:Models are abstractions that capture our beliefWe update our belief based on observed dataGiven data we pick the model that we believe the mostPick ✓ that maximizes log P(✓�x1, . . . ,xn
)I want to say : Often referred to as Bayesian view
There are Bayesian and there Bayesians
Say you had a prior belief about models provided by P (✓)
![Page 75: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/75.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Maximize a posteriori probability of model given data
✓MAP
= argmax✓∈⇥P(✓�x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)∑✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)P(x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)�����������������������������������������������������������������������������likelihood
P(✓)�prior
= argmax✓∈⇥ log P(x1, . . . ,xn
�✓) + log P(✓)
![Page 76: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/76.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Maximize a posteriori probability of model given data
✓MAP
= argmax✓∈⇥P(✓�x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)∑✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)P(x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)�����������������������������������������������������������������������������likelihood
P(✓)�prior
= argmax✓∈⇥ log P(x1, . . . ,xn
�✓) + log P(✓)
![Page 77: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/77.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Maximize a posteriori probability of model given data
✓MAP
= argmax✓∈⇥P(✓�x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)∑✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)P(x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)�����������������������������������������������������������������������������likelihood
P(✓)�prior
= argmax✓∈⇥ log P(x1, . . . ,xn
�✓) + log P(✓)
![Page 78: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/78.jpg)
MAXIMUM A POSTERIORI
Pick ✓ ∈ ⇥ that is most likely given data
Maximize a posteriori probability of model given data
✓MAP
= argmax✓∈⇥P(✓�x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)∑✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)P(✓)P(x1, . . . ,xn
)= argmax✓∈⇥ P(x1, . . . ,xn
�✓)�����������������������������������������������������������������������������likelihood
P(✓)�prior
= argmax✓∈⇥ log P(x1, . . . ,xn
�✓) + log P(✓)
![Page 79: Machine Learning for Data Science (CS4786) Lecture 15 · 2017-10-12 · Machine Learning for Data Science (CS4786) Lecture 15 Review + Probabilistic Modeling ... Cluster nodes in](https://reader034.vdocuments.site/reader034/viewer/2022042312/5edaccd2a22bb1624d6d924a/html5/thumbnails/79.jpg)
THE BAYESIAN CHOICE
Don’t pick any ✓∗ ∈ ⇥Model is simply an abstraction
We have a prosteriori distribution over models, why pick one if inthe end of the day we only want cluster assignments
For each point find probability of cluster assignment we get byintegrating over a posteriori probability of parameters ✓
We will come back to this later . . .
P (X|data) =X
✓2⇥
P (X, ✓|data) =X
✓2⇥
P (X|✓)P (✓|data)
✓?