Download - Author-Topic Models
6th June 2005 Research in Algorithms for the InterNet 1
Modeling Documents
Amruta JoshiDepartment of Computer Science
Stanford University
Research in Algorithms for the InterNet 2Amruta Joshi, Stanford Univ.
Outline Topic Models
Topic Extraction2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference
Integrating topics and syntax Probabilistic Models Composite Model Inference
Research in Algorithms for the InterNet 3Amruta Joshi, Stanford Univ.
Motivation Identifying content of a document Identifying its latent structure
More specificallyGiven a collection of documents we want to
create a model to collect information about Authors Topics Syntactic constructs
Research in Algorithms for the InterNet 4Amruta Joshi, Stanford Univ.
Topics & Authors Why model topics?
Observe topic trends How documents relate to one-another Tagging abstracts
Why model authors’ interests? Identifying what author writes about Identifying authors with similar interests Authorship attribution Creating reviewer lists Finding unusual work by an author
Research in Algorithms for the InterNet 5Amruta Joshi, Stanford Univ.
Topic Extraction: Overview
Supervised Learning Techniques Learn from labeled document
collection But Unlabeled documents,
Rapidly changing fields (Yang 1998)
In floods, the banks of a river overflow
rivers
Research in Algorithms for the InterNet 6Amruta Joshi, Stanford Univ.
Topic Extraction: Overview Dimensionality Reduction
Represent documents in Vector Space of terms
Map to low-dimensionality Non-linear dim. reduction
WEBSOM (Lagus et. al. 1999) Linear Projection
LSI (Berry, Dumais, O’Brien 1995)
Regions represent topics
Research in Algorithms for the InterNet 7Amruta Joshi, Stanford Univ.
Topic Extraction: Overview
Cluster documents on semantic contentTypically, each cluster has just 1 topic
Aspect ModelTopic modeled as distribution over wordsDocuments generated from multiple topics
Research in Algorithms for the InterNet 8Amruta Joshi, Stanford Univ.
Author Information: Overview
As doth the lion in the Capitol, A man no mightier than thyself or me …
Analyzing text using Stylometry
statistical analysis using literary style, frequency of word usage, etc
Semantics Content of document
Research in Algorithms for the InterNet 9Amruta Joshi, Stanford Univ.
Author Information: Overview
Graph-based modelsBuild Interactive
ReferralWeb using citations Kautz, Selman, Shah 1997
Build Co-Author Graphs White & Smith Page-Rank for analysis
D1
D3 D4
D2
Research in Algorithms for the InterNet 10Amruta Joshi, Stanford Univ.
The Big Idea Topic Model
Model topics as distribution over words
Author Model Model author as distribution over words
Author-Topic Model Probabilistic Model for both Model topics as distribution over words Model authors as distribution over topics
Research in Algorithms for the InterNet 11Amruta Joshi, Stanford Univ.
Bayesian Networks
nodes = random variablesedges = direct probabilistic influence
Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates
XRay
Lung Infiltrates
Sputum Smear
TuberculosisPneumonia
Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park
Research in Algorithms for the InterNet 12Amruta Joshi, Stanford Univ.
Bayesian Networks
Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents
If variables are discrete, P is usually multinomial P can be linear Gaussian, mixture of Gaussians, …
0.7 0.3p
t
p
0.6 0.4
0.01 0.990.2 0.8
tp
t
t
p
TP P(I |P, T )
XRay
Lung Infiltrates
Sputum Smear
TuberculosisPneumonia
Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park
Research in Algorithms for the InterNet 13Amruta Joshi, Stanford Univ.
BN Learning
BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search.
InducerInducerData
X
I
S
T P
Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park
Research in Algorithms for the InterNet 14Amruta Joshi, Stanford Univ.
Generative ModelProbabilistic Generative Process Statistical Inference
Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b )
Mixture components
Mixture weights
Research in Algorithms for the InterNet 15Amruta Joshi, Stanford Univ.
Bayesian Network for modeling document generation
Doc 1
T1 T2 TT
Z
…
w1
W
w2 wv…
Z
W
Research in Algorithms for the InterNet 16Amruta Joshi, Stanford Univ.
Topic Model: Plate Notation
D
Document
Document specific
distribution over topics
T
Topic distribution over
words
w
Wordz
Nd
Topic
Research in Algorithms for the InterNet 17Amruta Joshi, Stanford Univ.
Topic Model: Geometric Representation
Research in Algorithms for the InterNet 18Amruta Joshi, Stanford Univ.
Modeling Authors with words
D
Document
w
Word
Uniform distribution over authors of doc
ad
x
Nd
Author
A
Distribution of authors over
words
Research in Algorithms for the InterNet 19Amruta Joshi, Stanford Univ.
D
Document
Author-Topic Model
T
Topic distribution over words
w
Wordz
Topic
A
Distribution of authors over
topics x
Nd
Authorad
Uniform distribution of
documents over authors
Research in Algorithms for the InterNet 20Amruta Joshi, Stanford Univ.
Inference
Expectation Maximization But poor results (local Maxima)
Gibbs Sampling Parameters: , Start with initial random assignment Update parameter using other parameters Converges after ‘n’ iterations Burn-in time
Research in Algorithms for the InterNet 21Amruta Joshi, Stanford Univ.
Inference and Learning for Documents
# of times word m is
assigned to topic j
# of times topic j has occurred in document d
mj dj
Prob. that ith topic is assigned to topic j keeping other topic
assn unchanged
Research in Algorithms for the InterNet 22Amruta Joshi, Stanford Univ.
Matrix Factorization
Research in Algorithms for the InterNet 23Amruta Joshi, Stanford Univ.
Topic Model: Inference
River Stream Bank Money Loan123456789
10111213141516
Can we recover the original topics and topic mixtures from this data?
document
s River
LoanMoneyBankStream
Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine
Research in Algorithms for the InterNet 24Amruta Joshi, Stanford Univ.
Example of Gibbs Sampling
River Stream Bank Money Loan123456789
10111213141516
Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )
River
LoanMoneyBankStream
Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine
Research in Algorithms for the InterNet 25Amruta Joshi, Stanford Univ.
River Stream Bank Money Loan123456789
10111213141516
After 1 iteration
Apply sampling equation to each word token
River
LoanMoneyBankStream
Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine
Research in Algorithms for the InterNet 26Amruta Joshi, Stanford Univ.
River Stream Bank Money Loan123456789
10111213141516
After 4 iterations
River
LoanMoneyBankStream
Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine
Research in Algorithms for the InterNet 27Amruta Joshi, Stanford Univ.
River Stream Bank Money Loan123456789
10111213141516
After 32 iterations
stream .40 bank .39bank .35 money .32river .25 loan .29
topic 1 topic 2●● ●●
River
LoanMoneyBankStream
Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine
Research in Algorithms for the InterNet 28Amruta Joshi, Stanford Univ.
Results Tested on Scientific Papers
NIPS Dataset V=13,649 D=1,740 K=2,037 #Topics = 100 #tokens = 2,301,375
CiteSeer Dataset V=30,799 D=162,489 K=85,465 #Topics = 300 #tokens = 11,685,514
Research in Algorithms for the InterNet 29Amruta Joshi, Stanford Univ.
Evaluating Predictive Power
Perplexity Indicates ability to predict words on new
unseen documents
Lower the better
Research in Algorithms for the InterNet 30Amruta Joshi, Stanford Univ.
Results: Perplexity
Research in Algorithms for the InterNet 31Amruta Joshi, Stanford Univ.
Recap First
Author Model Topic Model
Then Author-Topic Model
Next… Integrating Topics & Syntax
Research in Algorithms for the InterNet 32Amruta Joshi, Stanford Univ.
Integrating topics & syntax
Probabilistic Models Short-range dependencies
Syntactic Constraints Represented as distinct syntactic classes HMM, Probabilistic CFGs
Long-range dependencies Semantic Constraints Represented as probabilistic distribution Bayes Model, Topic Model
New Idea! Use both
Research in Algorithms for the InterNet 33Amruta Joshi, Stanford Univ.
How to integrate these? Mixture of Models
Each word exhibits either short or long range dependencies
Product of Models Each word exhibits both short or long range
dependencies
Composite Model Asymmetric All words exhibit short-range dependencies Subset of words exhibit long-range
dependencies
Research in Algorithms for the InterNet 34Amruta Joshi, Stanford Univ.
The Composite Model 1 Capturing asymmetry
Replace probability distribution over words with semantic model
Syntactic model chooses when to emit content word
Semantic model chooses which word to emit
Methods Syntactic component is HMM Semantic component is Topic model
Research in Algorithms for the InterNet 35Amruta Joshi, Stanford Univ.
Generating phrases
network neural output
networks ...
image images object
objects ...
kernel support
svm vector ..
.
in with for
on ...
used trained
obtained described
...
0.5 0.4 0.1
0.9
0.2
0.7
0.9
network used for images image obtained with kernel output described with objects neural network trained with svm
images
Research in Algorithms for the InterNet 36Amruta Joshi, Stanford Univ.
The Composite Model 2 (Graphical)
w1 w2 w3 w4
c1 c2 c3 c4
z1 z2 z3 z4
Topics
Words
Classes
Doc’s distribution over topics
Research in Algorithms for the InterNet 37Amruta Joshi, Stanford Univ.
The Composite Model 3 (d) : document’s distribution over topics Transitions between classes ci-1 and ci follow
distribution (Ci-1)
A document is generated as: For each word wi in document d
Draw zi from (d)
Draw ci from (Ci-1)
If ci=1, then draw wi from (zi), else draw wi from (ci)
Research in Algorithms for the InterNet 38Amruta Joshi, Stanford Univ.
Results
Tested onBrown corpus (tagged with word types)Concatenated Brown & TASA corpus
HMM & Topic Model20 Classes
start/end Markers Class + 19 classesT = 200
Research in Algorithms for the InterNet 39Amruta Joshi, Stanford Univ.
Results Identifying Syntactic classes & semantic topics
Clean separation observed
Identifying function words & content words “control” : plain verb (syntax) or semantic word
Part-of-Speech Tagging Identifying syntactic class
Document Classification Brown corpus: 500 docs => 15 groups Results similar to plain Topic Model
Research in Algorithms for the InterNet 40Amruta Joshi, Stanford Univ.
Extensions to Topic Model Integrating link information (Cohn,
Hofmann 2001) Learning Topic Hierarchies Integrating Syntax & Topics Integrate authorship info with content
(author-topic model) Grade-of-membership Models Random sentence generation
Research in Algorithms for the InterNet 41Amruta Joshi, Stanford Univ.
Conclusion
Identifying its latent structure
Document Content is modeled forSemantic Associations – topic model Authorship - author topic modelSyntactic Constructs – HMM
Research in Algorithms for the InterNet 42Amruta Joshi, Stanford Univ.
Acknowledgements
Prof. Rajeev Motwani Advice and guidance regarding topic
selection
T. K. Satish Kumar Help on Probabilistic Models
Research in Algorithms for the InterNet 43Amruta Joshi, Stanford Univ.
Thank you!
Research in Algorithms for the InterNet 44Amruta Joshi, Stanford Univ.
References Primary
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.
Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf)
Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada
Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.