text mining using lda with context

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Text Mining Using LDA with Context

Christoph Kling, Steffen Staab

Web and Internet Science Group · ECS · University of Southampton, UK &

Text Mining Using LDA with Context 2/68Steffen Staab

Text Mining Documents

Documents are PDFs, emails, tweets,

Flickr photo tags, CVs, ...

Documents consist of bag of words metadata

- author(s) - timestamp- geolocation- publisher- booktitle- device...

Chinese food

Vegan

food

Break-

fast

dimsumduckeggs

...

vegantofu...

eggsham...

Objective:Cluster, categorize,

& explain


Latent Dirichlet Allocation (LDA)


Latent Dirichlet Allocation (LDA)

Document-topic distributions

Topic-word distributions

K topicsM documentsEach doc m M has length Nm


Use Metadata to Help Topic Prediction

Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics

→ breakfast topic happens during morning hours Chinese

food

Vegan

food

Break-

fast

dimsumduckeggs

...

vegantofu...

eggsham...


Use Metadata to Help Topic Prediction

Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics

→ breakfast topic happens during morning hours

Usage Autocompletion

→ From words to words Prediction of search queries

→ From metadata to words→ From words to metadata

Chinese food

Vegan

food

Break-

fast

dimsumduckeggs

...

vegantofu...

eggsham...


Nominal

Ordinal

Cyclic

Spherical

Networked

Structures of Metadata Spaces Nejdl Staab Kling


Challenges for Using Metadata for Text Mining

Generalizing the Text Mining ModelCreating a special text mining model for every dataset with its

kind of metadata spaces is impractical→ we need flexible models!



Generalizing the Text Mining Model Efficiency of the Text Mining ModelRich metadata → complex models → complex inference, slow convergence of samplers→ analysis of big datasets impossible



Generalizing the Text Mining Model Efficiency of the Text Mining Model Explaining the ResultImportance of Metadata→ learn how to weight metadata→ exclude irrelevant metadata (improves efficiency!)Complex dependencies & complex probability functions→ Learned parameters incomprehensible→ Reduced usefulness for data analysis / visualisation→ No sanity checks on parameters


Topic Models for Arbitrary Metadata



Predict document-topic distributions using metadata→ Gaussian Process Regression Topic Model

(Agovic & Banerjee, 2012)→ Dirichlet-Multinomial Regression Topic Model

(Mimno & McCallum, 2012)→ Structural Topic Model (logistic normal regression)

(Roberts et al., 2013)



Predict document-topic distributions using metadata→ Gaussian Process Regression Topic Model→ Dirichlet-Multinomial Regression Topic Model→ Structural Topic Model (logistic normal regression)

Regression input: MetadataRegression output: Topic distribution



Dirichlet-multinomial regression

Metadata




Gaussian process regression

Metadata




Logistic normal regression

Metadata




Alternating inference: Estimate topics Estimate regression model Use prediction for re-estimating topics Re-estimate regression model with new topics ...



Alternating inference: Estimate topics Estimate regression model Use prediction for re-estimating topics Re-estimate regression model with new topics ...

slow convergence



Applicable to a wide range of metadata! Estimation of regression parameters relatively expensive Learned parameters have no natural interpretation Alternating process of paramter estimation is expensive



Dirichlet-multinomial and logistic-normal regression do not support complex input data

(i.e. geographical data, temporal cycles, …)

Gaussian process regression topic models are very powerful with the right kernel function

...but require expert knowledge for kernel selection and efficient inference!


Hierarchical Multi-Dirichlet Process

Topic Models

The Idea


Topic Prediction

Topi

c P

roba

bilit

y

Metadata (e.g. time)

Documents, e.g. emails


Dirichlet-Multinomial Regression

Topi

c P

roba

bilit

y



Gaussian Process Regression

Topi

c P

roba

bilit

y


Topi

c P

roba

bilit

y


Cluster-Based Prediction

Topi

c P

roba

bilit

y




Topi

c P

roba

bilit

y




Topi

c P

roba

bilit

y


Topi

c P

roba

bilit

yTo

pic

Pro

babi

lity

Topi

c P

roba

bilit

y


Idea

Two-step model:1)Cluster similar documents2)Learn topics for clusters and documents simultaneously

▪ Learn topic distributions of document clusters▪ Use cluster-topic distributions for topic prediction


Performance, Complex Metadata

Cluster documents for each metadata




+ nominal, ordinal, cyclic, spherical data+ any data which can be clustered!



Metadata clusters are associated with topicsGerman Beer

Party


Mixture of Metadata Predictions

Metadata clusters are associated with topicsGerman Beer

Party

The topic prediction for a single document is a mixture of the prediction of its metadata clusters


Smoothing of HMDP


Cluster-Based Prediction vs Outliers and noisy data

Topi

c P

roba

bilit

y



Adjacency Smoothing

Naive approach: Smoothed value of a cluster is the mean of the cluster and its adjacent clusters

Repeat n times


Smoothing topics associated with metadata clusters

Documents receive topics from their own and neighboring metadata clusters



Smooth topics associated with metadata clusters


Nominal Ordinal Cyclic Spherical Networked


Smoothing

Smoothing-strength is learned during inferenceSimilar clusters → stronger smoothingDissimilar clusters → softer smoothing

Smoothing-strength alternatively can be predefined by user


Metadata Weighting in HMDP's


Feature Weighting

One variable governs the influence of metadata cluster on documents

If η < threshold, ignore variable.

η


Metadata Weighting

Importance of metadata is learned during inference, answering the question:

How many percent of the topics are explained by a given metadata? (e.g. time, geographical coordinates, ...)

→ Interpretable parameter! Metadata with a low weight can be removed during

inference


Example Application


Dataset

Linux Kernel Mailinglist3,400,000 emails with timestamps and mailinglist ID


Dataset

Linux Kernel Mailinglist3,400,000 emails with timestamps and mailinglist ID

Timeline Yearly cycle Weekly cycle Daily cycle Mailing list


Topics


Topics

Professional topics:

Hobbyist topics:


Topics

Metadata weighting:


Topics

Metadata weighting:

can be removed during inference


Efficient Inference in HMDP


Hierarchical Multi-Dirichlet Process Topic Model (HMDP)

Cluster-topic distributions


Metadata



Inference:Nearly completely collapsedinference!



We only need to learn Global topic distribution Topic assignments to words



We only need to learn Global topic distribution Topic assignments to words Dirichlet parameters



Approximations: Variational Practical Stochastic

→ low memory consumption→ online inference


Parameters of HMDP

Cluster-topic distributions:How many documents of a cluster contain topic x?


Parameters of HMDP

Cluster-topic distributions:How many documents of a cluster contain topic x? Metadata-weightsHow many of the topics of documents are explainedby metadata x?


Parameters of HMDP

Cluster-topic distributions:How many documents of a cluster contain topic x? Metadata-weightsHow many of the topics of documents are explainedby metadata x? Dirichlet process scaling parametersHow many pseudo-counts do we add to the topic

distributions?


Properties of HMDP

Interpretable parameters Simultaneous inference of topics and metadata-topic

dependencies Efficient online inference


Comparison of Topic Models for Arbitrary Metadata


Comparison

Gaussian Process Topic ModelThe “perfect” model:

Can cope with arbitrary metadata Models dependencies between metadata Parameter learning is very expensive Kernel selection and inference require expert knowledge Parameters of Gaussian processes hard to interpret


Comparison

Multinomial Regression Topic ModelThe “straight-forward” model:

Can cope with many metadata Parameter learning is cheaper than for Gaussian

processes but still expensive (due to alternating inference and repeated distance calculations)

Can not cope with complex metadata(e.g. geographical, cyclic, ...) Does not model dependencies between metadata Regression weights of Dirichlet-multinomial regression

hard to interpret


Comparison

Hierarchical Multi-Dirichlet Process Topic ModelThe “fast” model:

Can cope with arbitrary metadata Fast inference (simultaneously for topics and topic

predictions) All parameters have natural interpretations as probabilities

or pseudo-counts Requires a (simple) pre-clustering of documents Does not model dependencies between metadata


THANK YOU FOR YOUR ATTENTION!