an introduction to matrix decomposition and graphical model lei zhang/lead researcher microsoft...
TRANSCRIPT
An Introduction To Matrix Decomposition and Graphical Model
Lei ZhangLead ResearcherMicrosoft Research Asia
2012-04-17
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrievalndash Modeling Threaded Discussions
What Is Matrix Decomposition
bull We wish to decompose the matrix A by writing it as a product of two or more matrices
Antimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data sample
ndash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k bases
ndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrievalndash Modeling Threaded Discussions
What Is Matrix Decomposition
bull We wish to decompose the matrix A by writing it as a product of two or more matrices
Antimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data sample
ndash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k bases
ndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
What Is Matrix Decomposition
bull We wish to decompose the matrix A by writing it as a product of two or more matrices
Antimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data sample
ndash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k bases
ndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm)
Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Principle Component Analysis
bull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Maximizing Variance
bull The objective of the rotation transformation is to find the maximal variance
bull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Optimization Problem
bull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Property Data Decomposition
bull PCA can be treated as data decomposition
a=UUTa
=(u1u2hellipun) (u1u2hellipun)T a
=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Face Recognition ndash Eigenface
bull Turk MA Pentland AP Face recognition using eigenfaces CVPR 1991 (Citation 2654)
bull The eigenface approachndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Column-Stochastic amp Irreducible
bull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Iterative PageRank Calculation
bull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
SINGULAR VALUE DECOMPOSITION
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
SVD - Definition
bull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
SVD and PCA
bull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Example - LSI
bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basis
ndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvalues
bull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Latent Semantic Indexing (LSI)
1 Document file preparation preprocessingndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Latent Semantic Indexing
bull Assumption there is some underlying latent semantic structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space V
VΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
HITS vs PageRank
bull PageRank may be computed once HITS is computed per query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Motivation
bull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Multiplicative Update Algorithm
bull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Multiplicative Update Algorithm
bull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative
values with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Outline
bull Basic conceptsndash Likelihood iidndash ML MAP and Bayesian Inferencendash Expectation-Maximizationndash Mixture Gaussian Parameter estimation
bull pLSAndash Motivationndash Derivation amp Geometry propertiesndash Applications
bull LDAndash Motivation - Why to add a hyper parameterndash Dirichlet Distributionndash Variational EMndash Relations with other topic modalsndash Incorporating category information
bull Summary
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Not Included
bull General graphical model theoriesbull Markov random field (belief propagation)bull Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
BASIC CONCEPTS
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
What Is Machine Learning
Databull Let x = (x1 x2 xD) T denote a data point and D = x(1) x(2) x(N) a
data set D is sometimes associated with desired outputs y1 y2
Predictionsbull We are generally interested in predicting something based on the observed
data setbull Given D what can we say about x(N+1)
Modelbull To make predictions we need to make some assumptions We can often
express these assumptions in the form of a model with some parameters θbull Given data D we learn the model parameters from which we can predict
new data pointsbull The model can often be expressed as a probability distribution over data
points
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Likelihood Function
bull Given a set of parameter values probability density function (PDF) will show that some data are more probable than other data
bull Inversely given the observed data and a model of interest Likelihood function is defined as
L(θ) = fθ(x|θ) = p(x|θ)
bull That is likelihood function L(θ) will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Maximum Likelihood (ML)
bull Maximum likelihood will find the best model parameters that make the data ldquomost likelyrdquo generated from this model
bull Suppose we are given n data samples (x1 x2 hellip xn)
bull Maximum likelihood will find θ that maximize L(θ)
bull Predictive distribution
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
IID ndash Independent Identically Distributed
bull IID means
bull The problem is considerably simplified as
bull Usually log likehood is used
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Introduction to Machine Learning Lectures 1-2 Slides)
bull Gregor Heinrich Parameter estimation for text analysis Technical Note 2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
EXPECTATION MAXIMIZATION
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Why We Need EM
bull The ExpectationndashMaximization (EM) algorithm is a method for ML learning of parameters in latent variable models
bull Why we need latent variables
bull To describe complex model Gaussian Mixture Model
bull To discover the intrinsic structure inside a data set Topic Models such as pLSA LDA
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
More General
bull Data set bull Likelihood
bull Goal learn maximum likelihood (ML) parameter values
bull The maximum likelihood procedure finds parameters θ such that
bull Because of the integral (or sum) over latent variables the likelihood can be a very complicated and hard to optimize function
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The Expectation Maximization (EM) Algorithm
bull The EM algorithm finds a (local) maximum of a latent variable model likelihood It starts from arbitrary values of the parameters and iterates two steps
bull E step Fill in values of latent variables according to posterior given data
bull M step Maximize likelihood as if latent variables were not hidden
bull Decomposes difficult problems into series of tractable steps
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Jensenrsquos Inequality
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Lower Bounding the Log Likelihood
bull Observed data D = yn Latent variables X = xn Parameters θbull Goal Maximize the log likelihood (ie ML learning) wrt θ
bull Any distribution q(X) over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensenrsquos inequality
bull where H[q] is the entropy of q(X)
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The E and M Steps of EM
bull The lower bound on the log likelihood is given by
bull EM alternates betweenbull E step optimize wrt distribution over hidden variables
holding params fixed
bull M step maximize wrt parameters holding hidden distribution fixed
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The E Step
bull E step for fixed θ
bull The second term is the Kullback-Leibler divergencebull This means that for fixed θ F is bounded above by L and achieves
that bound whenbull So the E step simply sets
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The M Step
bull M step maximize wrt parameters holding hidden distribution q fixed
bull The second equality comes from fact that entropy of q(X) does not depend directly on θ
bull The specific form of the M step depends on the model Often the maximum wrt θ can be found analytically
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
EM Never Decreases the Likelihood
bull The E and M steps together never decrease the log likelihood
bull The E step brings F(q θ) to the likelihood L(θ) 顶bull The M-step maximizes F(q θ) wrt θ 抬bull F(q θ) lt L(θ) by Jensen ndash or equivalently from the non-
negativity of KL
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Reference
bull Zoubin Ghahramani Machine Learning (4F13) 2006 Cambridge(Unsupervised learning Lecture 5 Slides)
bull Christopher M Bishop (2006) Pattern Recognition and Machine Learning Springer
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
WHY DO WE NEED GRAPHICAL MODEL
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Why Do We Need Graphical Models
bull Consndash Graphical model is so complex even with a few circleshellipndash We have to make too many assumptions
bull Prosndash We do need probability to explain our world But joint probability is
hard to computendash Graphical model can help us analyze and understand our problemsndash Graphs are an intuitive way of representing and visualizing the
relationships between many variablesndash With a graphical model we can decouple joint probability to
conditional probabilities which are usually easier
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Directed Acyclic Graphical Models (Bayesian Networks)
bull A DAG Model Bayesian network corresponds to a factorization of the joint probability distribution
p(ABCDE) = p(A)p(B)p(C|AB)p(D|BC)p(E|CD)
bull In general
bull where pa(i) are the parents of node i
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Directed Graphs for Statistical ModelsPlate Notation
bull A data set of N points generated from a Gaussian
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
PLSA ndash PROBABILISTIC LATENT SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Latent Semantic Indexing (LSI) Review
bull For natural Language Queries simple term matching does not work effectivelyndash Ambiguous terms ndash Same Queries vary due to personal styles
bull Latent semantic indexingndash Creates this lsquolatent semantic spacersquo (hidden meaning)
bull LSI puts documents together even if they donrsquot have common words if the docs share frequently co-occurring terms
bull Disadvantagesndash Statistical foundation is missing
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA ndash Probabilistic Latent Semantic Analysis
bull Automated Document Indexing and Information retrievalbull Identification of Latent Classes using an Expectation
Maximization (EM) Algorithmbull Shown to solve
ndash Polysemybull Java could mean ldquocoffeerdquo and also the ldquoPL Javardquobull Cricket is a ldquogamerdquo and also an ldquoinsectrdquo
ndash Synonymybull ldquocomputerrdquo ldquopcrdquo ldquodesktoprdquo all could mean the same
bull Has a better statistical foundation than LSA
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA
M
Nd
d
z
w
M
d
z1
w1
z2
w2
z3
w3
zN
wN
hellip
z1 hellip zN are variables ziє[1K]K is the number of latent topics
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dM
z1
w1
z2
w2
zNm
wNm
hellip
p(w|z=1) p(w|z=2) p(w | z=Nm) are shared for all documents
Likelihood
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Joint Probability vs Likelihood
bull Joint probability
bull Likelihood (only for observed variables)
bull p(d) is assumed to be uniform
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Document Decomposition
bull Each document can be decomposed as
bull This is similar to the matrix decomposition if we consider each discrete distribution as a vector
p(w|d) = ZVtimesk p(z|d)
bull With many documents we hope to find latent topics as common basis
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA ndash Objective Function
bull pLSA tries to maximize the log likelihood
bull Due to the summation over z inside log we have to resort to EM
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
EM Steps
bull E-Stepndash Expectation of the likelihood function is calculated with the current
parameter values
bull M-Stepndash Update the parameters with the calculated posterior probabilitiesndash Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Lower Bounding the Log Likelihood
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
EM Steps
bull The E-Step
bull The M-Step
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Latent Subspace
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA vs LSA
bull LSA and PLSA perform dimensionality reductionndash In LSA by keeping only K singular valuesndash In PLSA by having K aspects
bull Comparison to SVDndash U Matrix related to P(z|d) (doc to aspect)ndash V Matrix related to P(w|z) (aspect to term)ndash E Matrix related to P(z) (aspect strength)
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
pLSA vs LSA
bull The main difference is the way the approximation is done
bull PLSA generates a model (aspect model) and maximizes its predictive power
bull Selecting the proper value of K is heuristic in LSA
bull Model selection in statistics can determine optimal K in PLSA
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Applications
bull Text mining topic discovering
bull Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Text Mining
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Scene Classification
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Classification Result
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Reference
bull Thomas Hofmann Probabilistic latent semantic analysis In Proc of Uncertainty in Artificial Intelligence UAI99 Stockholm 1999
bull Bosch A Zisserman A and Munoz X Scene Classification via pLSAProceedings of the European Conference on Computer Vision (2006)
bull Sivic J Russell B C Efros A A Zisserman A and Freeman W T Discovering Object Categories in Image Collections MIT AI Lab Memo AIM-2005-005 February 2005
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
LDA ndash LATENT DIRICHILET ALLOCATION
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Problems in pLSA
bull pLSA provides no probabilistic model at the document level Each doc has its own topic mixture proportion
bull The number of parameters in the model grows linearly with M (the number of documents in the training set)
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Problems in pLSA
bull There is no constraint for distributions p(z|di)
bull Easy to lead to serious problems with over-fitting
d1
z1
w1
z2
w2
zN1
wN1
hellip
d2
z1
w1
z2
w2
zN2
wN2
hellip
dm
z1
w1
z2
w2
zNm
wNm
hellip
p(z|d1) p(z|d2) p(z|dm)
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Dirichlet Distribution
bull In the LDA model the topic mixture proportions for each document are assumed to follow some distribution
bull Requirement for such a distributionndash The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one That is the samples are multinormials
ndash Easy to optimize
bull Dirichlet Distribution is one of such distributions
bull The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Dirichlet Distribution
bull Definition
bull The density is zero outside this open (K minus 1)-dimensional simplex
k
i ii
ki ik
i i
ki i
kk
xx
xxxxp i
1
11
1
12121
1 0 st
)Γ(
)(Γ)|(
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
bull Various parameter α
(6 2 2) (3 7 5)
(2 3 4) (6 2 6)
Example Dirichlet Distributions (K=3)
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Example Dirichlet Distributions (K=3)
bull Equal αi different
α0=01 α0=1 α0=10
k
i i10
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The LDA Model
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The LDA Model
bull For each documentbull Choose ~Dirichlet()bull For each of the N words wn
ndash Choose a topic znraquo Multinomial()
ndash Choose a word wn from p(wn|zn) a multinomial probability conditioned on the topic zn
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Joint Probability
bull Given parameter α and β
where
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Likelihood
bull Joint Probability
bull Marginal distribution of a document
bull Likelihood over all the documents
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Inference
bull The likelihood can be computed by summing each documentbull Jansenrsquos inequality in EM
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Inference
bull In E-Step we need to compute the posterior distribution of the hidden variables
bull Unfortunately this distribution is intractable to compute in general
bull We have to resort to variational approach
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Variational Inference
bull In variational inference we consider a simplified graphical model with variational parameters and minimize the KL Divergence between the variational and posterior distributions
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Variantional Inference
bull The difference between the lower bound and the likelihood is the KL divergence
bull Maximizing the lower bound L() with respect to and is equivalent to minimizing the KL divergence
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
VBEM vs EM
bull Only different in the E-Step
bull In standard EM q(X) is directly set to p(X|Dθ) and let KL=0bull In VBEM it is intractable to compute p(X|Dθ) Instead it
approximates p(X|Dθ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D θ)
bull This is also equivalent to maximizing the lower bound L(θ)
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Parameter Estimation
bull Given a corpus of documents we would like to find the parameters and which maximize the likelihood of the observed data
bull Strategy (Variational EM)
bull Lower bound log p(w|) by a function L()bull Repeat until convergence
ndash E Maximize L() with respect to the variational parameters ndash M Maximize the bound with respect to parameters and
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Parameter Estimation
bull E-Step Variational Inference ndash repeat until convergence
bull M-Step Parameter estimation
β
α can be implemented using the Newton-Raphson method
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Topic Examples in a 100-topic LDA Model)bull 16000 documents from a subset of the TREC AP corpus
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Classification (50-topic LDA + SVM)
bull Reuters-21578 dataset ndash contains 8000 documents and 15818 words
(a) EARN vs NOT EARN (b) GRAIN vs NOT GRAIN
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Problems in LDA
bull Dirichlet Distribution is helpful to avoid over-fitting But the assumption might be too strong
z4z3z2z1
w4w3w2w1
b
z4z3z2z1
w4w3w2w1
z4z3z2z1
w4w3w2w1
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
A Bayesian Hierarchical Model for Learning Natural Scene Categories
bull Incorporating category information
MNd
π
z
x
θ
β
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Codebookbull 174 Local Image Patches
bull DetectionEvenly Sampled GridRandom SamplingSaliency DetectorLowersquos DoG Detector
bull RepresentationNormalized 11x11 gray values128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Topic Hierarchical Clustering
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
More Topic Models
bull Dynamic topic models ICML 2006bull Correlated Topic Model NIPS 2005bull Hierarchical Dirichlet Process Journal of the American
Statistical Association 2003bull Nonparametric Bayes pachinko allocation UAI 2007bull Supervised LDA NIPS 2007bull MedLDA ndash Maximum Margin Discrimant LDA ICML 2009bull hellip
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Are you really into Graphical Models
bull Describing Visual Scenes using Transformed Dirichlet Processes E Sudderth A Torralba W Freeman and A Willsky NIPS Dec 2005
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Reference
bull David M Blei Andrew Y Ng Michael I Jordan Latent Dirichlet Allocation Journal of Machine Learning Research (JMLR) 2003
bull Matthew J Beal Variational Algorithms for Approximate Bayesian Inference PhD Thesis University of Cambridge 1998
bull L Fei-Fei and P Perona A Bayesian Hierarchical Model for Learning Natural Scene Categories CVPR 2005
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
bull Graphical Modelndash Basic concepts in probabilistic machine learningndash EMndash pLSAndash LDA
bull Two Applicationsndash Document decomposition for ldquolong queryrdquo retrieval ICCV 2009ndash Modeling Threaded Discussions SIGIR 2009
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Large-Scale Indexing for ldquoLong Queryrdquo Retrieval (Similarity Search)
Xiao Zhang Zhiwei Li Lei Zhang Wei-Ying Ma and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
The Long Query Problem
bull If a query contains 1000 keywordsndash Need to access 1000 inverted listsndash The intersection of 1000 inverted lists may be emptyndash The union of 1000 inverted list may be the whole corpus
bull Dimension reduction
Term1 Term2 Term3 Term4 hellip TermN
Img1 1 2 0 0 hellip 2
f1 f2 hellip fM
Img1 02 01 hellip 003
Topic Projection
Dim = 1 million
Dim = 200
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Key Idea Dimension Reduction + Residual Error Preservation
bull p original TF-IDF vector in vocabulary spacebull X projection matrix for dimension reductionbull w low dimensional feature vectorbull residual error
p Xw
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Orthogonal Decomposition
p Xw 1 11 1 1
2 21 2 1 2
1 1
1
k
k
W k W
W W Wk W
p x x
p x x w
p w
p x x
Base vector
Low dimensional representation
Residual
An image = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words(10 words)
T Tp q p q
p q
w w
X1 X2 X3 hellip Xk
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
A Probabilistic Implementation
x is a switch variable It controls a word generated from
bull a topic specific distribution
bull a document specific distribution
bull a background distribution
( | )p w d 1
( 0 | ) ( | ) ( | )K
kp x d p w z k p z k d
( 1| ) ( | )p x d p w d ( 2 | ) ( | )p x d p w
C Chemudugunta etal Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model NIPS 2006
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Search (Online)
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
DS1 DS2 hellip
hellip
hellip
hellip
hellip
hellip
LSH Index
Doc 300 Doc 401 hellip
A query = 0 01 02 03 04 05 06 07 08 09 1
0
2
4
6
8
10
12
14
16
+ a few words
Re-rankingDoc 401 hellip
Doc 1
Doc 2
Doc 300
Doc 401
Doc N
Doc 300
Index 10M Images 46GBSearch Speed lt 100ms
Doc Meta
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Search Example
Query Image
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Simultaneously Modeling Semantics and Structure of Threaded Discussions A Sparse
Coding Approach and Its Applications
Chen LIN Jiang-Ming YANG Rui CAI Xin-jing WANG Wei WANG Lei ZHANG
SIGIR 2009
123
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Semantic amp structure
124
SemanticTopics
StructureWho reply to who
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Reply reconstruction
126
DocumentSimilarity
TopicSimilarity
StructureSimilarity
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Baselines
NP Reply to Nearest Post
RR Reply to Root
DSDocument Similarity
LDA Latent Dirichlet Allocation Project documents to topic space
SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
127
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Evaluation
method Slashdot Apple
All Posts Good Posts All Posts Good Posts
NP 0021 0012 0289 0239
RR 0183 0319 0269 0474
DS 0463 0643 0409 0628
LDA 0465 0644 0410 0648
SWB 0463 0644 0410 0641
SMSS 0524 0737 0517 0772
128
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Expert finding
Reply reconstruction
Network construction
Expert finding
Methods
HITS
PageRank
hellip
129
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Baselines
LMFormal Models for Expert Finding in Enterprise Corpora SIGIR
06Achieves stable performance in expert finding task using a
language modelPageRank
Benchmark nodal ranking methodHITS
Find hub nodes and authority nodeEABIF
Personalized Recommendation Driven by Information Flow SIGIR rsquo06
Find most influential node130
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Evaluation
131
bull Bayesian estimate
Method MRR MAP P10
LM 0821 0698 0800
EABIF(ori) 0674 0362 0243
EABIF(rec) 0742 0318 0281
PageRank(ori) 0675 0377 0263
PageRank(rec) 0743 0321 0266
HITS(ori) 0906 0832 0900
HITS(rec) 0938 0822 0906
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-
Summary
bull Matrix and probability are fundamental mathematics in information retrieval and computer visionndash Matrix decomposition ndash a good practice to learn matrixndash Graphical model ndash a good practice to learn probability
bull Graphical model is a good tool to analyze problems
bull The essence of decomposition is to discover a set of mid-level features to describe original documentsimages
bull It is more adaptable for various applications than matrix decomposition
- An Introduction To Matrix Decomposition and Graphical Model
- Outline
- What Is Matrix Decomposition
- Why We Need Matrix Decomposition
- Why We Need Matrix Decomposition (2)
- Principle Component Analysis
- Definition ndash Eigenvalue amp Eigenvector
- Definition ndash Principle Component Analysis
- Principle Component Analysis (2)
- Maximizing Variance
- Optimization Problem
- Property Data Decomposition
- Face Recognition ndash Eigenface
- Slide 14
- Slide 15
- PageRank ndash Power Iteration
- Column-Stochastic amp Irreducible
- Iterative PageRank Calculation
- Convergence of the power iteration
- Singular value decomposition
- SVD - Definition
- Singular Values And Singular Vectors
- Matrix approximation
- SVD and PCA
- Example - LSI
- SVD and PCA (2)
- Latent Semantic Indexing (LSI)
- Latent Semantic Indexing
- Similarity Measures
- Similarity Measures (2)
- HITS (Hyperlink Induced Topic Search)
- Power Iteration
- HITS and SVD
- HITS vs PageRank
- NMF ndash Non-Negative Matrix Factorization
- Definition
- Motivation
- Multiplicative Update Algorithm
- Multiplicative Update Algorithm (2)
- NMF vs PCA
- Reference
- Major Reference
- Outline (2)
- Not Included
- Basic Concepts
- What Is Machine Learning
- Likelihood Function
- Maximum Likelihood (ML)
- IID ndash Independent Identically Distributed
- Reference (2)
- Expectation Maximization
- Why We Need EM
- More General
- The Expectation Maximization (EM) Algorithm
- Jensenrsquos Inequality
- Lower Bounding the Log Likelihood
- The E and M Steps of EM
- The E Step
- The M Step
- EM Never Decreases the Likelihood
- Reference (3)
- Why Do We Need Graphical Model
- Why Do We Need Graphical Models
- Directed Acyclic Graphical Models (Bayesian Networks)
- Directed Graphs for Statistical Models Plate Notation
- pLSA ndash Probabilistic Latent Semantic Analysis
- Latent Semantic Indexing (LSI) Review
- pLSA ndash Probabilistic Latent Semantic Analysis (2)
- pLSA
- pLSA (2)
- Joint Probability vs Likelihood
- Document Decomposition
- pLSA ndash Objective Function
- EM Steps
- Lower Bounding the Log Likelihood (2)
- EM Steps (2)
- Latent Subspace
- pLSA vs LSA
- pLSA vs LSA (2)
- Applications
- Text Mining
- Scene Classification
- Classification Result
- Reference (4)
- LDA ndash Latent Dirichilet Allocation
- Problems in pLSA
- Problems in pLSA (2)
- Dirichlet Distribution
- Dirichlet Distribution (2)
- Example Dirichlet Distributions (K=3)
- Example Dirichlet Distributions (K=3) (2)
- The LDA Model
- The LDA Model (2)
- Joint Probability
- Likelihood
- Inference
- Inference (2)
- Variational Inference
- Variantional Inference
- VBEM vs EM
- Parameter Estimation
- Parameter Estimation (2)
- Topic Examples in a 100-topic LDA Model)
- Classification (50-topic LDA + SVM)
- Problems in LDA
- A Bayesian Hierarchical Model for Learning Natural Scene Catego
- Codebook
- Topic Distribution in Different Categories
- Topic Hierarchical Clustering
- More Topic Models
- Are you really into Graphical Models
- Reference (5)
- Outline (3)
- Slide 115
- The Long Query Problem
- Key Idea Dimension Reduction + Residual Error Preservation
- Orthogonal Decomposition
- A Probabilistic Implementation
- Search (Online)
- Search Example
- Search Example (2)
- Simultaneously Modeling Semantics and Structure of Threaded Dis
- Semantic amp structure
- Optimize them together
- Reply reconstruction
- Baselines
- Evaluation
- Expert finding
- Baselines (2)
- Evaluation (2)
- Summary
-