author-topic models

44
6th June 2005 Research in Algorithms for th e InterNet 1 Modeling Documents Amruta Joshi Department of Computer Science Stanford University

Upload: dothuan

Post on 14-Feb-2017

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Author-Topic Models

6th June 2005 Research in Algorithms for the InterNet 1

Modeling Documents

Amruta JoshiDepartment of Computer Science

Stanford University

Page 2: Author-Topic Models

Research in Algorithms for the InterNet 2Amruta Joshi, Stanford Univ.

Outline Topic Models

Topic Extraction2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference

Integrating topics and syntax Probabilistic Models Composite Model Inference

Page 3: Author-Topic Models

Research in Algorithms for the InterNet 3Amruta Joshi, Stanford Univ.

Motivation Identifying content of a document Identifying its latent structure

More specificallyGiven a collection of documents we want to

create a model to collect information about Authors Topics Syntactic constructs

Page 4: Author-Topic Models

Research in Algorithms for the InterNet 4Amruta Joshi, Stanford Univ.

Topics & Authors Why model topics?

Observe topic trends How documents relate to one-another Tagging abstracts

Why model authors’ interests? Identifying what author writes about Identifying authors with similar interests Authorship attribution Creating reviewer lists Finding unusual work by an author

Page 5: Author-Topic Models

Research in Algorithms for the InterNet 5Amruta Joshi, Stanford Univ.

Topic Extraction: Overview

Supervised Learning Techniques Learn from labeled document

collection But Unlabeled documents,

Rapidly changing fields (Yang 1998)

In floods, the banks of a river overflow

rivers

Page 6: Author-Topic Models

Research in Algorithms for the InterNet 6Amruta Joshi, Stanford Univ.

Topic Extraction: Overview Dimensionality Reduction

Represent documents in Vector Space of terms

Map to low-dimensionality Non-linear dim. reduction

WEBSOM (Lagus et. al. 1999) Linear Projection

LSI (Berry, Dumais, O’Brien 1995)

Regions represent topics

Page 7: Author-Topic Models

Research in Algorithms for the InterNet 7Amruta Joshi, Stanford Univ.

Topic Extraction: Overview

Cluster documents on semantic contentTypically, each cluster has just 1 topic

Aspect ModelTopic modeled as distribution over wordsDocuments generated from multiple topics

Page 8: Author-Topic Models

Research in Algorithms for the InterNet 8Amruta Joshi, Stanford Univ.

Author Information: Overview

As doth the lion in the Capitol, A man no mightier than thyself or me …

Analyzing text using Stylometry

statistical analysis using literary style, frequency of word usage, etc

Semantics Content of document

Page 9: Author-Topic Models

Research in Algorithms for the InterNet 9Amruta Joshi, Stanford Univ.

Author Information: Overview

Graph-based modelsBuild Interactive

ReferralWeb using citations Kautz, Selman, Shah 1997

Build Co-Author Graphs White & Smith Page-Rank for analysis

D1

D3 D4

D2

Page 10: Author-Topic Models

Research in Algorithms for the InterNet 10Amruta Joshi, Stanford Univ.

The Big Idea Topic Model

Model topics as distribution over words

Author Model Model author as distribution over words

Author-Topic Model Probabilistic Model for both Model topics as distribution over words Model authors as distribution over topics

Page 11: Author-Topic Models

Research in Algorithms for the InterNet 11Amruta Joshi, Stanford Univ.

Bayesian Networks

nodes = random variablesedges = direct probabilistic influence

Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 12: Author-Topic Models

Research in Algorithms for the InterNet 12Amruta Joshi, Stanford Univ.

Bayesian Networks

Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents

If variables are discrete, P is usually multinomial P can be linear Gaussian, mixture of Gaussians, …

0.7 0.3p

t

p

0.6 0.4

0.01 0.990.2 0.8

tp

t

t

p

TP P(I |P, T )

XRay

Lung Infiltrates

Sputum Smear

TuberculosisPneumonia

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 13: Author-Topic Models

Research in Algorithms for the InterNet 13Amruta Joshi, Stanford Univ.

BN Learning

BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search.

InducerInducerData

X

I

S

T P

Slide Credit: Lisa Getoor, UMD College ParkLisa Getoor, UMD College Park

Page 14: Author-Topic Models

Research in Algorithms for the InterNet 14Amruta Joshi, Stanford Univ.

Generative ModelProbabilistic Generative Process Statistical Inference

Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b )

Mixture components

Mixture weights

Page 15: Author-Topic Models

Research in Algorithms for the InterNet 15Amruta Joshi, Stanford Univ.

Bayesian Network for modeling document generation

Doc 1

T1 T2 TT

Z

w1

W

w2 wv…

Z

W

Page 16: Author-Topic Models

Research in Algorithms for the InterNet 16Amruta Joshi, Stanford Univ.

Topic Model: Plate Notation

D

Document

Document specific

distribution over topics

T

Topic distribution over

words

w

Wordz

Nd

Topic

Page 17: Author-Topic Models

Research in Algorithms for the InterNet 17Amruta Joshi, Stanford Univ.

Topic Model: Geometric Representation

Page 18: Author-Topic Models

Research in Algorithms for the InterNet 18Amruta Joshi, Stanford Univ.

Modeling Authors with words

D

Document

w

Word

Uniform distribution over authors of doc

ad

x

Nd

Author

A

Distribution of authors over

words

Page 19: Author-Topic Models

Research in Algorithms for the InterNet 19Amruta Joshi, Stanford Univ.

D

Document

Author-Topic Model

T

Topic distribution over words

w

Wordz

Topic

A

Distribution of authors over

topics x

Nd

Authorad

Uniform distribution of

documents over authors

Page 20: Author-Topic Models

Research in Algorithms for the InterNet 20Amruta Joshi, Stanford Univ.

Inference

Expectation Maximization But poor results (local Maxima)

Gibbs Sampling Parameters: , Start with initial random assignment Update parameter using other parameters Converges after ‘n’ iterations Burn-in time

Page 21: Author-Topic Models

Research in Algorithms for the InterNet 21Amruta Joshi, Stanford Univ.

Inference and Learning for Documents

# of times word m is

assigned to topic j

# of times topic j has occurred in document d

mj dj

Prob. that ith topic is assigned to topic j keeping other topic

assn unchanged

Page 22: Author-Topic Models

Research in Algorithms for the InterNet 22Amruta Joshi, Stanford Univ.

Matrix Factorization

Page 23: Author-Topic Models

Research in Algorithms for the InterNet 23Amruta Joshi, Stanford Univ.

Topic Model: Inference

River Stream Bank Money Loan123456789

10111213141516

Can we recover the original topics and topic mixtures from this data?

document

s River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 24: Author-Topic Models

Research in Algorithms for the InterNet 24Amruta Joshi, Stanford Univ.

Example of Gibbs Sampling

River Stream Bank Money Loan123456789

10111213141516

Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 25: Author-Topic Models

Research in Algorithms for the InterNet 25Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 1 iteration

Apply sampling equation to each word token

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 26: Author-Topic Models

Research in Algorithms for the InterNet 26Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 4 iterations

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 27: Author-Topic Models

Research in Algorithms for the InterNet 27Amruta Joshi, Stanford Univ.

River Stream Bank Money Loan123456789

10111213141516

After 32 iterations

stream .40 bank .39bank .35 money .32river .25 loan .29

topic 1 topic 2●● ●●

River

LoanMoneyBankStream

Slide Credit: Padhraic Smyth, UC IrvinePadhraic Smyth, UC Irvine

Page 28: Author-Topic Models

Research in Algorithms for the InterNet 28Amruta Joshi, Stanford Univ.

Results Tested on Scientific Papers

NIPS Dataset V=13,649 D=1,740 K=2,037 #Topics = 100 #tokens = 2,301,375

CiteSeer Dataset V=30,799 D=162,489 K=85,465 #Topics = 300 #tokens = 11,685,514

Page 29: Author-Topic Models

Research in Algorithms for the InterNet 29Amruta Joshi, Stanford Univ.

Evaluating Predictive Power

Perplexity Indicates ability to predict words on new

unseen documents

Lower the better

Page 30: Author-Topic Models

Research in Algorithms for the InterNet 30Amruta Joshi, Stanford Univ.

Results: Perplexity

Page 31: Author-Topic Models

Research in Algorithms for the InterNet 31Amruta Joshi, Stanford Univ.

Recap First

Author Model Topic Model

Then Author-Topic Model

Next… Integrating Topics & Syntax

Page 32: Author-Topic Models

Research in Algorithms for the InterNet 32Amruta Joshi, Stanford Univ.

Integrating topics & syntax

Probabilistic Models Short-range dependencies

Syntactic Constraints Represented as distinct syntactic classes HMM, Probabilistic CFGs

Long-range dependencies Semantic Constraints Represented as probabilistic distribution Bayes Model, Topic Model

New Idea! Use both

Page 33: Author-Topic Models

Research in Algorithms for the InterNet 33Amruta Joshi, Stanford Univ.

How to integrate these? Mixture of Models

Each word exhibits either short or long range dependencies

Product of Models Each word exhibits both short or long range

dependencies

Composite Model Asymmetric All words exhibit short-range dependencies Subset of words exhibit long-range

dependencies

Page 34: Author-Topic Models

Research in Algorithms for the InterNet 34Amruta Joshi, Stanford Univ.

The Composite Model 1 Capturing asymmetry

Replace probability distribution over words with semantic model

Syntactic model chooses when to emit content word

Semantic model chooses which word to emit

Methods Syntactic component is HMM Semantic component is Topic model

Page 35: Author-Topic Models

Research in Algorithms for the InterNet 35Amruta Joshi, Stanford Univ.

Generating phrases

network neural output

networks ...

image images object

objects ...

kernel support

svm vector ..

.

in with for

on ...

used trained

obtained described

...

0.5 0.4 0.1

0.9

0.2

0.7

0.9

network used for images image obtained with kernel output described with objects neural network trained with svm

images

Page 36: Author-Topic Models

Research in Algorithms for the InterNet 36Amruta Joshi, Stanford Univ.

The Composite Model 2 (Graphical)

w1 w2 w3 w4

c1 c2 c3 c4

z1 z2 z3 z4

Topics

Words

Classes

Doc’s distribution over topics

Page 37: Author-Topic Models

Research in Algorithms for the InterNet 37Amruta Joshi, Stanford Univ.

The Composite Model 3 (d) : document’s distribution over topics Transitions between classes ci-1 and ci follow

distribution (Ci-1)

A document is generated as: For each word wi in document d

Draw zi from (d)

Draw ci from (Ci-1)

If ci=1, then draw wi from (zi), else draw wi from (ci)

Page 38: Author-Topic Models

Research in Algorithms for the InterNet 38Amruta Joshi, Stanford Univ.

Results

Tested onBrown corpus (tagged with word types)Concatenated Brown & TASA corpus

HMM & Topic Model20 Classes

start/end Markers Class + 19 classesT = 200

Page 39: Author-Topic Models

Research in Algorithms for the InterNet 39Amruta Joshi, Stanford Univ.

Results Identifying Syntactic classes & semantic topics

Clean separation observed

Identifying function words & content words “control” : plain verb (syntax) or semantic word

Part-of-Speech Tagging Identifying syntactic class

Document Classification Brown corpus: 500 docs => 15 groups Results similar to plain Topic Model

Page 40: Author-Topic Models

Research in Algorithms for the InterNet 40Amruta Joshi, Stanford Univ.

Extensions to Topic Model Integrating link information (Cohn,

Hofmann 2001) Learning Topic Hierarchies Integrating Syntax & Topics Integrate authorship info with content

(author-topic model) Grade-of-membership Models Random sentence generation

Page 41: Author-Topic Models

Research in Algorithms for the InterNet 41Amruta Joshi, Stanford Univ.

Conclusion

Identifying its latent structure

Document Content is modeled forSemantic Associations – topic model Authorship - author topic modelSyntactic Constructs – HMM

Page 42: Author-Topic Models

Research in Algorithms for the InterNet 42Amruta Joshi, Stanford Univ.

Acknowledgements

Prof. Rajeev Motwani Advice and guidance regarding topic

selection

T. K. Satish Kumar Help on Probabilistic Models

Page 43: Author-Topic Models

Research in Algorithms for the InterNet 43Amruta Joshi, Stanford Univ.

Thank you!

Page 44: Author-Topic Models

Research in Algorithms for the InterNet 44Amruta Joshi, Stanford Univ.

References Primary

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.

Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf)

Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada

Griffiths, T.L., & Steyvers, M.,  Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.

Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.