guaranteed learning of latent variable models through...
TRANSCRIPT
![Page 1: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/1.jpg)
Guaranteed Learning of Latent Variable Models
through Spectral and Tensor Methods
Anima Anandkumar
U.C. Irvine
![Page 2: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/2.jpg)
Application 1: Clustering
Basic operation of grouping data points.
Hypothesis: each data point belongs to an unknown group.
![Page 3: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/3.jpg)
Application 1: Clustering
Basic operation of grouping data points.
Hypothesis: each data point belongs to an unknown group.
Probabilistic/latent variable viewpoint
The groups represent different distributions. (e.g. Gaussian).
Each data point is drawn from one of the given distributions. (e.g.Gaussian mixtures).
![Page 4: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/4.jpg)
Application 2: Topic Modeling
Document modeling
Observed: words in document corpus.
Hidden: topics.
Goal: carry out document summarization.
![Page 5: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/5.jpg)
Application 3: Understanding Human Communities
Social Networks
Observed: network of social ties, e.g. friendships, co-authorships
Hidden: groups/communities of actors.
![Page 6: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/6.jpg)
Application 4: Recommender Systems
Recommender System
Observed: Ratings of users for various products, e.g. yelp reviews.
Goal: Predict new recommendations.
Modeling: Find groups/communities of users and products.
![Page 7: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/7.jpg)
Application 5: Feature Learning
Feature Engineering
Learn good features/representations for classification tasks, e.g.image and speech recognition.
Sparse representations, low dimensional hidden structures.
![Page 8: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/8.jpg)
Application 6: Computational Biology
Observed: gene expression levels
Goal: discover gene groups
Hidden variables: regulators controlling gene groups
![Page 9: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/9.jpg)
How to model hidden effects?
Basic Approach: mixtures/clusters
Hidden variable h is categorical.
Advanced: Probabilistic models
Hidden variable h has more general distributions.
Can model mixed memberships.
x1 x2 x3 x4 x5
h1
h2 h3
This talk: basic mixture model. Tomorrow: advanced models.
![Page 10: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/10.jpg)
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
![Page 11: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/11.jpg)
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
![Page 12: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/12.jpg)
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
![Page 13: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/13.jpg)
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
In this series: guaranteed and efficient learning through spectral methods
![Page 14: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/14.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
![Page 15: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/15.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
![Page 16: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/16.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
![Page 17: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/17.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
![Page 18: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/18.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
![Page 19: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/19.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
![Page 20: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/20.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.
![Page 21: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/21.jpg)
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.
Wednesday: implementation of tensor methods.
![Page 22: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/22.jpg)
Resources for Today’s Talk
“A Method of Moments for Mixture Models and Hidden MarkovModels.” by A. , D. Hsu, and S.M. Kakade. Proc. of COLT, June2012.
“Tensor Decompositions for Learning Latent Variable Models.” by A.,R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012.
Resources available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html
![Page 23: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/23.jpg)
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
![Page 24: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/24.jpg)
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
![Page 25: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/25.jpg)
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U
⊤(k), where U(k) are top-k eigen vectors.
![Page 26: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/26.jpg)
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U
⊤(k), where U(k) are top-k eigen vectors.
Proof
By Pythagorean theorem:∑
i ‖xi − Pxi‖2 =∑
i ‖xi‖2 −∑
i ‖Pxi‖2.Maximize: 1
n
∑
i ‖Pxi‖2 = 1n
∑
iTr[
Pxix⊤i P
⊤]
= Tr[PSP⊤].
![Page 27: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/27.jpg)
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
![Page 28: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/28.jpg)
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
Rayleigh Quotient
For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then
max‖z‖=1
z⊤Sz = λ1, min‖z‖=1
z⊤Sz = λd,
and the optimizing vectors are u1 and ud.
![Page 29: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/29.jpg)
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
Rayleigh Quotient
For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then
max‖z‖=1
z⊤Sz = λ1, min‖z‖=1
z⊤Sz = λd,
and the optimizing vectors are u1 and ud.
Optimal Projection
maxP :P 2=I
Rank(P )=k
Tr(P⊤SP ) = λ1 + λ2 . . . + λk and P spans {u1, . . . , uk}.
![Page 30: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/30.jpg)
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
![Page 31: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/31.jpg)
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
![Page 32: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/32.jpg)
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
How the above equation is obtained
E[(x− µ)(x− µ)⊤] = E[(Ah− µ)(Ah− µ)⊤] + E[zz⊤]
=∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
![Page 33: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/33.jpg)
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
![Page 34: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/34.jpg)
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).
Lowest eigenvalue of covariance matrix yields σ2.
![Page 35: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/35.jpg)
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).
Lowest eigenvalue of covariance matrix yields σ2.
How to Learn A?
![Page 36: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/36.jpg)
Learning through Spectral Clustering
Learning A through Spectral Clustering
Project samples x on to span(A).
Distance-based clustering (e.g. k-means).
A series of works, e.g. Vempala & Wang.
![Page 37: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/37.jpg)
Learning through Spectral Clustering
Learning A through Spectral Clustering
Project samples x on to span(A).
Distance-based clustering (e.g. k-means).
A series of works, e.g. Vempala & Wang.
Failure to cluster under large variance.
Learning Gaussian Mixtures Without Separation Constraints?
![Page 38: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/38.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
![Page 39: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/39.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
![Page 40: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/40.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
![Page 41: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/41.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
![Page 42: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/42.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
What if the data is not Gaussian?
![Page 43: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/43.jpg)
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
What if the data is not Gaussian?
◮ Moment-based Estimation of probabilistic latent variable models?
![Page 44: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/44.jpg)
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
![Page 45: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/45.jpg)
Tensor Notation for Higher Order Moments
Multi-variate higher order moments form tensors.
Are there spectral operations on tensors akin to PCA?
Matrix
E[x⊗ x] ∈ Rd×d is a second order tensor.
E[x⊗ x]i1,i2 = E[xi1xi2 ].
For matrices: E[x⊗ x] = E[xx⊤].
Tensor
E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].
![Page 46: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/46.jpg)
Third order moment for Gaussian mixtures
Consider mixture of k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means. µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
Intuition behind equation
E[x⊗ x⊗ x] = E[(Ah)⊗ (Ah)⊗ (Ah)] + E[(Ah)⊗ z ⊗ z] + . . .
=∑
i
wi · ai ⊗ ai ⊗ ai + σ2∑
i
µ⊗ ei ⊗ ei + . . .
How to recover parameters A and w from third order moment?
![Page 47: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/47.jpg)
Simplifications
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
σ2 is obtained from σmin(Cov(x)).
E[x⊗ x] =∑
iwiai ⊗ ai + σ2I.
![Page 48: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/48.jpg)
Simplifications
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
σ2 is obtained from σmin(Cov(x)).
E[x⊗ x] =∑
iwiai ⊗ ai + σ2I.
Can obtain
M3 =∑
i
wiai ⊗ ai ⊗ ai
M2 =∑
i
wiai ⊗ ai.
How to obtain parameters A and w from M2 and M3?
![Page 49: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/49.jpg)
Tensor Slices
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Multilinear transformation of tensor
M3(B,C,D) :=∑
i
wi(B⊤ai) · (C⊤ai) · (D⊤ai)
![Page 50: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/50.jpg)
Tensor Slices
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Multilinear transformation of tensor
M3(B,C,D) :=∑
i
wi(B⊤ai) · (C⊤ai) · (D⊤ai)
Slice of a tensor
M3(I, I, r) =∑
iwiai ⊗ ai〈ai, r〉 = ADiag(w)Diag(A⊤r)A⊤
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤
M2 = A · Diag(w) ·A⊤
![Page 51: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/51.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
![Page 52: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/52.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
![Page 53: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/53.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
![Page 54: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/54.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
![Page 55: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/55.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.
We have vi ∝ U⊤ai .
![Page 56: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/56.jpg)
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.
We have vi ∝ U⊤ai .
Technical Detail
r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.
![Page 57: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/57.jpg)
Learning Gaussian Mixtures through
Eigen-decomposition of Tensor Slices
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V Diag(λ̃)V −1.
Recovery of A (method 1)
Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .
![Page 58: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/58.jpg)
Learning Gaussian Mixtures through
Eigen-decomposition of Tensor Slices
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V Diag(λ̃)V −1.
Recovery of A (method 1)
Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .
Recovery of A (method 2)
Let Θ ∈ Rk×k be a random rotation matrix.
Consider k slices M3(I, I, θi) and find eigen-decomposition above.
Let Λ̃ = [λ̃1| . . . |λ̃k] be the matrix of eigenvalues of all slices.
Recover A as : A = UΘ−1Λ̃.
![Page 59: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/59.jpg)
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
![Page 60: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/60.jpg)
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
![Page 61: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/61.jpg)
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
![Page 62: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/62.jpg)
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.
![Page 63: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/63.jpg)
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.
More efficient learning methods using higher order moments?
![Page 64: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/64.jpg)
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
![Page 65: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/65.jpg)
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
![Page 66: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/66.jpg)
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
When is it the most compact representation? (Identifiability).
Can we recover the decomposition? (Algorithm?)
![Page 67: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/67.jpg)
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
When is it the most compact representation? (Identifiability).
Can we recover the decomposition? (Algorithm?)
The most compact representation is known as CP-decomposition(CANDECOMP/PARAFAC).
![Page 68: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/68.jpg)
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
![Page 69: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/69.jpg)
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
![Page 70: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/70.jpg)
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv = M(I, v) = λv.
![Page 71: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/71.jpg)
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv = M(I, v) = λv.
Two Problems
How to find eigenvectors of a tensor?
A is not orthogonal in general.
![Page 72: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/72.jpg)
Whitening
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.
When A ∈ Rd×k has full column rank, it is an invertible
transformation.
v1
v2v3
Wa1a2a3
![Page 73: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/73.jpg)
Using Whitening to Obtain Orthogonal Tensor
Tensor M3 Tensor T
Multi-linear transform
M3 ∈ Rd×d×d and T ∈ R
k×k×k.
T = M3(W,W,W ) =∑
iwi(W⊤ai)
⊗3.
T =∑
i∈[k]
wi · vi ⊗ vi ⊗ vi is orthogonal.
Dimensionality reduction when k ≪ d.
![Page 74: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/74.jpg)
How to Find Whitening Matrix?
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
v1
v2v3
Wa1a2a3
Use pairwise moments M2 to find W s.t. W⊤M2W = I.
Eigen-decomposition of M2 = UDiag(λ̃)U⊤, thenW = UDiag(λ̃−1/2).
V := W⊤ADiag(w)1/2 is an orthogonal matrix.
T = M3(W,W,W ) =∑
i
w−1/2i (W⊤ai
√wi)
⊗3
=∑
i
λivi ⊗ vi ⊗ vi, λi := w−1/2i .
T is an orthogonal tensor.
![Page 75: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/75.jpg)
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
![Page 76: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/76.jpg)
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Tensor Power Method
Start from an initial vector v. v 7→ T (I, v, v)
‖T (I, v, v)‖ .
![Page 77: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/77.jpg)
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Tensor Power Method
Start from an initial vector v. v 7→ T (I, v, v)
‖T (I, v, v)‖ .
Canonical recovery method
Randomly initialize the power method. Run to convergence to obtainv with eigenvalue λ.
Deflate: T − λv ⊗ v ⊗ v and repeat.
![Page 78: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/78.jpg)
Putting it together
Gaussian mixture: x = Ah+ z, where E[h] = w.
z ∼ N (0, σ2I).
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
Obtain whitening matrix W from SVD of M2.
Use W for multilinear transform: T = M3(W,W,W ).
Find eigenvectors of T through power method and deflation.
What about learning other latent variable models?
![Page 79: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/79.jpg)
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
![Page 80: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/80.jpg)
Topic Modeling
![Page 81: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/81.jpg)
Probabilistic Topic Models
Useful abstraction for automatic categorization of documents
Observed: words. Hidden: topics.
Bag of words: order of words does not matter
Graphical model representation
l words in a document x1, . . . , xl.
h: proportions of topics in a document.
Word xi generated from topic yi.
Exchangeability: x1 ⊥⊥ x2 ⊥⊥ . . . |h
A(i, j) := P[xm = i|ym = j] :
topic-word matrix.Words
Topics
Topic
Mixture
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
AAAAA
h
![Page 82: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/82.jpg)
Distribution of the topic proportions vector h
If there are k topics, distribution over the simplex ∆k−1
∆k−1 := {h ∈ Rk, hi ∈ [0, 1],
∑
i
hi = 1}.
Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.
Tomorrow: Latent Dirichlet allocation.
![Page 83: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/83.jpg)
Distribution of the topic proportions vector h
If there are k topics, distribution over the simplex ∆k−1
∆k−1 := {h ∈ Rk, hi ∈ [0, 1],
∑
i
hi = 1}.
Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.
Tomorrow: Latent Dirichlet allocation.
![Page 84: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/84.jpg)
Formulation as Linear Models
Distribution of the words x1, x2, . . .
Order d words in vocabulary. If x1 is jth word, assign ej ∈ Rd.
Distribution of each xi: supported on vertices of ∆d−1.
Properties
Pr[x1|topic h = i] = E[x1|topic h = i] = ai
Linear Model: E[xi|h] = Ah .
Multiview model: h is fixed and multiple words (xi) are generated.
![Page 85: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/85.jpg)
Geometric Picture for Topic Models
Topic proportions vector (h)
Document
![Page 86: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/86.jpg)
Geometric Picture for Topic Models
Single topic (h)
![Page 87: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/87.jpg)
Geometric Picture for Topic Models
Single topic (h)
AAA
x1
x2
x3Word generation (x1, x2, . . .)
![Page 88: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/88.jpg)
Gaussian Mixtures vs. Topic Models(spherical) Mixture of Gaussian:
k means: a1, . . . ak
Component h = i with prob. wi
observe x, with spherical noise,
x = ai + z, z ∼ N (0, σ2i I)
(single) Topic Models
k topics: a1, . . . ak
Topic h = i with prob. wi
observe l (exchangeable) words
x1, x2, . . . xl i.i.d. from ai
Unified Linear Model: E[x|h] = Ah
Gaussian mixture: single view, spherical noise.
Topic model: multi-view, heteroskedastic noise.
Three words per document suffice for learning.
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
![Page 89: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/89.jpg)
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
![Page 90: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/90.jpg)
Recap: Basic Tensor Decomposition Method
Toy Example in MATLAB
Simulated Samples: Exchangeable Model
Whiten The Samples◮ Second Order Moments◮ Matrix Decomposition
Orthogonal Tensor Eigen Decomposition◮ Third Order Moments◮ Power Iteration
![Page 91: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/91.jpg)
Simulated Samples: Exchangeable Model
Model Parameters
Hidden State:h ∈ basis {e1, . . . , ek}k = 2
Observed States:xi ∈ basis {e1, . . . , ed}d = 3
Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A
Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3
h
x1 x2 x3
![Page 92: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/92.jpg)
Simulated Samples: Exchangeable Model
Model Parameters
Hidden State:h ∈ basis {e1, . . . , ek}k = 2
Observed States:xi ∈ basis {e1, . . . , ed}d = 3
Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A
Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3
Generate Samples Snippet
for t = 1 : n% generate h for this sampleh category=(rand()>0.5) + 1;h(t,h category)=1;transition cum=cumsum(A true(:,h category));% generate x1 for this sample | hx category=find(transition cum> rand(),1);x1(t,x category)=1;% generate x2 for this sample | hx category=find(transition cum >rand(),1);x2(t,x category)=1;% generate x3 for this sample | hx category=find(transition cum > rand(),1);x3(t,x category)=1;end
![Page 93: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/93.jpg)
Whiten The Samples
Second Order Moments
M2 =1n
∑
t xt1 ⊗ xt2
Whitening Matrix
W = UwL−0.5w ,
[Uw, Lw] =k-svd(M2)
Whiten Data
yt1 = W⊤xt1
Orthogonal Basis
V = W⊤A→ V ⊤V = I
Whitening Snippet
fprintf(’The second order moment M2:’);M2 = x1′*x2/n[Uw, Lw, Vw]= svd(M2);fprintf(’M2 singular values:’); LwW = Uw(:,1:k)* sqrt(pinv(Lw(1:k,1:k)));y1 = x1 * W; y2 = x2 * W; y3 = x3 * W;
a1 6⊥ a2 v1 ⊥ v2
![Page 94: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/94.jpg)
Orthogonal Tensor Eigen Decomposition
Third Order Moments
T =1
n
∑
t∈[n]
yt1 ⊗ yt2 ⊗ yt3 ≈∑
i∈[k]
λivi ⊗ vi ⊗ vi, V ⊤V = I
Gradient Ascent
T (I, v1, v1) =1
n
∑
t∈[n]
〈v1, yt2〉〈v1, yt3〉yt1 ≈∑
i
λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
![Page 95: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/95.jpg)
Orthogonal Tensor Eigen Decomposition
T ← T −∑
j
λjv⊗3
j , v ← T (I, v, v)
‖T (I, v, v)‖
Power Iteration Snippet
.V = zeros(k,k); Lambda = zeros(k,1);
.for i = 1:k
. v old = rand(k,1); v old = normc(v old);
. for iter = 1 : Maxiter
. v new = (y1’* ((y2*v old).*(y3*v old)))/n;
. if i >1
. % deflation
. for j = 1: i-1
. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);
. end
. end
. lambda = norm(v new);v new = normc(v new);
. if norm(v old - v new) < TOL
. fprintf(′Converged at iteration %d.′ , iter);
. V(:,i) = v new; Lambda(i,1) = lambda;
. break;
. end
. v old = v new;
. end
.end
![Page 96: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/96.jpg)
Orthogonal Tensor Eigen Decomposition
T ← T −∑
j
λjv⊗3
j , v ← T (I, v, v)
‖T (I, v, v)‖
Power Iteration Snippet
.V = zeros(k,k); Lambda = zeros(k,1);
.for i = 1:k
. v old = rand(k,1); v old = normc(v old);
. for iter = 1 : Maxiter
. v new = (y1’* ((y2*v old).*(y3*v old)))/n;
. if i >1
. % deflation
. for j = 1: i-1
. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);
. end
. end
. lambda = norm(v new);v new = normc(v new);
. if norm(v old - v new) < TOL
. fprintf(′Converged at iteration %d.′ , iter);
. V(:,i) = v new; Lambda(i,1) = lambda;
. break;
. end
. v old = v new;
. end
.end
iter 0
iter 1iter 2iter 3iter 4iter 5
Green: Groundtruth
Red: Estimation at each iteration
![Page 97: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond](https://reader034.vdocuments.site/reader034/viewer/2022052004/6017b07d8f270b732f669b72/html5/thumbnails/97.jpg)
Conclusion
h
x1 x2x3
Gaussian mixtures and topic models. Unified linear representation.
Learning through higher order moments.
Tensor decomposition via whitening and power method.
Tomorrow’s lecture
Latent variable models and moments. Analysis of tensor power method.
Wednesday’s lecture: Implementation of tensor method.
Code snippet available at
http://newport.eecs.uci.edu/anandkumar/MLSS.html