discovering topics from unstructured text

Discovering Topics from UnstructuredText

Deep Tech Summit, NPCC. Bhattacharyya

Machine Learning LabDepartment of CSA, IISc

26th Oct, 2016

Information retrieval from Unstructured text

What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections

(usually stored on computers).

1 Image from Internet

Information retrieval from Unstructured text

What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections (usually stored on computers).

1 Image from Internet

Challenges in handling Unstructured TextCorpora

How do we build automatic indexing Large Corpora NLP based methodologies will not scale

What are Topics

runinninghitseasongame

What are Topics

run cup patient computerinning minutes drug softwarehit add doctor system

season tablespoon cancer microsoftgame oil medical company

What are Topics

run cup patient computerinning minutes drug softwarehit add doctor system

season tablespoon cancer microsoftgame oil medical companySport Cooking Healthcare Computers

Models for discovering themes

Topic models attempt to discover themes indocument collections

Themes can be used for anotating documents. Can be useful for organizing, and searching large

document corpora Do not require supervision

Visualizing Topics: Browsing Wikipedia

Wikipedia topicsAllison Chaney. TMVE

https://github.com/ajbc/tmve-original

https://github.com/ajbc/tmve-original

Outline

What are topics

Latent Semnatic Indexing

Probabilistic Topic ModelsLDA

Learning Topics from finite number of Samples

Information Retrieval

Corpus- Collection of Documents Document Collection of Words

IR revisitedGiven a document find similar documents in a corpus

Corpus is a matrix

SMART Information retrieval system

Pioneered by G. Salton1 in 1975 Given a query q find closest documents

score(q, d) =Aq Ad

AqAd

Representations and scoring system were developed1G. Salton, A. Wong, and C. S. Yang (1975), A Vector Space Model for Automatic Indexing, Communications of theACM

Keyword search

QueryWho won the Turing award in 2015

Retrieval given query q

Aij ={

1 i th word is present in j th document0 otherwise

return i if Aj q is high

Polysemy, Synonmy, Term Dependence

Polysemy Words which have more than one meaningCricket

Polysemous words in queries can reduce precision

Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem

Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together


Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision

Synonmy: Different words have the same meaning.Automobile and Car

queries with synonmyous words can be a problem



Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision

Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem


Latent Semantic Indexing1

d : number of words n: number of documents

SVD-Singular Value Decompostion

A = [A1, . . . ,An]

Adn

= Mdr

Drr

Srn

1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.

Retrieving documents with LSI

Ai = MDSi

D1Mq = q

simLSI(q,Ai) = qSi

LSI outperformed keyword searchsimLSI(q,Ai) outperformed qAi

1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.

Retrieving documents

Project query on columns of M and finddocuments closest to the projected query.

Works well but why? Maybe M encodes semantics

LSI: A probabilistic analysisPapadimitrou et al. 2000

Each document has only one Topic Each topic, column of M, has some primary words Probability mass on the primary words are very high. Could mathematically explain superior performance

of LSI

What are Topics

Topic is a probability distribution over words

Each document mi.i.d draws fromTopic

LSI and Information retrieval

When does LSI workIf corpus is pure and each topic has some primary words then

Si Sj c whenever Si and Sj share the same topic

Primary words of a topic Group of words with significant fraction of probability

mass within a topic Should be disjoint run,inning,hit, season, game

LSI is not the answer

Topics: Computer Science, Arts

Outline

What are topics




Probabilistic Topic Models: Latent DirichletAllocation(LDA)

Unsupervised corpus analysis Generative model for

documents Topic defined by a pmf over

words Learn topics inherent in

corpus

LDA: Generative model

Document: p.m.f over topics,

Topic: p.m.f. over words, z

Process: Pick , z and thenchoose word from z

Generative Model: sports, cooking, moviesz: sportw : runs

Example: Dynamic Topic Model of Science

75-topic dynamic topic model of the Journal Science(1880-2002)

Words in topics evolve over time Source:http://topics.cs.princeton.edu/Science/

http://topics.cs.princeton.edu/Science/

Topic Model of Science: Example Topics

Dynamic Topic Model of Science: Example I

Dynamic Topic Model of Science: Example II

Resource scarce languages: Multilingualtopics

Training: English-Hindi-Bengali Wikipedia 3.3K doc triplets Test: EN-HI-BN news from FIRE EN 14K, HI 15K, BN 12K articles

English Hindi Bengalifilm films awarddisney awardshitchcock simp-sons chaplinmovie academy

(chaplin film the jerrytom film pitt best actorand)

(film prize do film onethe r him cyrus film)

Insights into LDA

LDA works well for large documents Big Corpus Applies to Dyadic data

Videos, Software Codes, Cross-Lingual retrieval,

When does LDA work?

Theorem((Teng et al. 2014))W.h.p if logn m then d(G,G) C

=

lognn

+logmm

+logmn

n = number of documents m = number of words in a document

n is very large = logmmNot good for short messages

m is very large = lognnNot good for small corpus

1Jian Tang et al. (2014) Understanding the Limiting Factors of Topic Modeling via Posterior ContractionAnalysis ICML 2014

LDA: Observations

Inference is NP hard Learning parameters from a corpus is also NP hard Requires MCMC techniques or variational techniques

Topic SimplexThree Topics: Can be viewed as a triangle in 2-D

Topic SimplexDocuments put weights on the vertices of the triangle

Topic SimplexDocuments are points inside the triangle !

Outline

What are topics




General model for probabilistic topic models

Each column of M be a topicProbability distribution over words

Randomly choose l , weight over topic l ,they should sum to 1

Sample m words fromk

i=1Mll to create adocument

Fit multiple topics to a single documentprovably

QuestionHow many documents do i need to recover M from A

Recent breakthrough (Arora et al. 2012)2gave guarantees Polynomial time algorithms

Separability

Mwt = p0, Mwt = 0

1Learning Topic ModelsGoing beyond SVD, STOC 2012

Fitting topics using separability

TheoremIf all topics have anchor words, there is a polynomial timealgorithm that returns an M such that with high probability,

kl=1

di=1

|Mil Mil | provided

s Max{O(d2k6 logda42p602m

),O

(k4

2a2

)},

where, is the condition number of E(WW T ), a is theminimum expected weight of a topic and m is the numberof words in each document.

Fitting topics using separability

s Max{O(d2k6 logda42p602m

),O

(k4

2a2

)} The dependence of s on parameters p0 is 1/p60 For the topic baseball the word run maybe an

anchor word with p0 = 0.1 Then the requirement is that every 10-th word in a

document on this topic is run (too strong) More realistic to ask that a set of words like - run, hit,

score, together have frequency 0.1

Our Assumptions: Dominant Topics

Dominant Admixture assumption every document has a dominant topic: one topic has

weight significantly higher than others for every topic, there is a small fraction of documents

which are nearly purely on that topic

Formally, let , , , , 0 be non-negative reals satisfying:

+ (1 ), + 2 0.5, 0.08

For j [s], document j has a dominant topic l(j), s.t.Wl(j),j and Wl j , l = l(j)

For each topic l , there are at least 0w0s documentsfor which topic l has weight at least 1 .

Our Assumptions: Catchwords

CatchwordsCatchwords of a topic: a group of words

each word occurs strictly more frequently in the topicthan other topics

together they have high frequency

Formally: Sl

Sl l , l {1, . . . , k} such thati Sl , l = l ,

Mil Mil ,iSl

Mil p0, and m2Mil 8 ln(

20w0

)

Our Results (Bansal et al., NIPS 2014)

Under the assumptions, the TSVD algorithm succeedswith high probability in finding an M so that

i,l

|Mil Mil | O(k), provided

s (

1w0

(k6m2

2p20+

m2k20

2p0+

d02

)).

Dependence of s on w0, that is (1/w0), is optimal Dependence of s on d , (d/0w02), is optimal For Arora, to get comparable error we need a

quadratic dependence on d

Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

Thresholding

Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

m}|

w02 s; |{j : A

(1)ij =

m}| 3w0s.

Do the thresholding on A(2):

Bij =

{i if A

(2)ij > i/m and i 8 ln(20/w0)

0 otherwise.

SVD: Find the best rank k approximation B(k) to B.

Identify Dominant Topics

Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

Thresholded SVD-based k -means (TSVD)

Identify Catchwords

For each i , l , compute g(i , l) = the (0w0s/2)th highestelement of {A(2)ij : j Rl}.

Let Jl ={i : g(i , l) > max

( 4m2 ln(20/w0),maxl =l g(i , l

))}

,

where, = 12(1+)(+) .

Find Topic Vectors

Find the 0w0s/2 highest

iJl A(2)ij among all j [s]

Return the average of these A,j as our approximation M,l to M,l

Why TSVD Works

Data matrix A (left) and Thresholded matrix B (right)Black: non-catchwords, Blue: catchwords

Empirical ResultsDatasets

NIPS: 1,500 NIPS full papers NYT: Random subset of 30,000 documents from the

New York Times dataset Pubmed: Random subset of 30,000 documents from

the Pubmed abstracts dataset 20NG: 13,389 documents from 20NewsGroup

Baselines Recover (Arora et. al., 2013): state-of-art provable

algorithm based on separability assumption Tensor (Anandkumar et. al., 2012): state-of-art

provable algorithm using tensor decomposition

Empirical Results: AssumptionsCorpus Documents K Fraction of Documents

= 0.4 = 0.8 = 0.9NIPS 1,500 50 56.6% 10.7% 4.8%NYT 30,000 50 63.7% 20.9% 12.7%

Pubmed 30,000 50 62.2% 20.3% 10.7%20NG 13,389 20 74.1% 54.4% 44.3%

Table: Fraction of documents satisfying dominant topic assumption.

Corpus K Mean per topic % Topicsfrequency of CW with CWNIPS 50 0.05 95%NYT 50 0.11 100%

Pubmed 50 0.05 90%20NG 20 0.06 100%

Table: CatchWords (CW) assumption with = 1.1, = 0.25

Empirical Results: L1 Reconstruction ErrorAverage improvement over best of R-KL & Tensor: 30.7%

Corpus Documents Tensor R-L2 R-KL TSVD % Improvement

NIPS

40,000 0.298 0.342 0.308 0.094 68.5%60,000 0.296 0.346 0.311 0.089 69.9%80,000 0.285 0.335 0.303 0.087 69.4%100,000 0.280 0.344 0.306 0.086 69.3%150,000 0.320 0.336 0.302 0.084 72.2%200,000 0.322 0.335 0.301 0.113 62.5%

Pubmed

40,000 0.379 0.388 0.332 0.326 1.8%60,000 0.317 0.372 0.328 0.287 9.5%80,000 0.321 0.358 0.320 0.276 13.8%100,000 0.304 0.350 0.315 0.276 9.2%150,000 0.355 0.344 0.313 0.239 23.6%200,000 0.322 0.334 0.309 0.225 27.3%

20NG

40,000 0.174 0.126 0.120 0.124 -3.3%60,000 0.207 0.114 0.110 0.106 3.6%80,000 0.203 0.110 0.108 0.095 12.0%100,000 0.151 0.103 0.102 0.087 14.7%200,000 0.162 0.096 0.097 0.072 25.8%

NYT

40,000 0.316 0.214 0.208 0.174 16.3%60,000 0.330 0.205 0.200 0.156 22.0%80,000 0.330 0.198 0.196 0.168 14.3%100,000 0.353 0.198 0.196 0.163 16.8%150,000 0.310 0.192 0.192 0.156 18.8%200,000 0.292 0.189 0.189 0.173 8.5%

Empirical Results: L1 Reconstruction Error

Histogram of L1 error across topics for 40k syntheticdocuments. On majority of the topic (> 90%) the recoveryerror for TSVD is significantly smaller.

0

10

20

30

40

50

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

NIPS

0

10

20

30

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3

NYT

0

10

20

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6

Pubmed

0

5

10

15

20

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

20NG

L1 Reconstruction Error

AlgorithmRKLTensorTSVD

Num

ber

of T

opic

s

Empirical Results on Real Data:Perplexity & Topic Coherence

0

1000

2000

20NG NIPS NYT Pubmed

Perplexity

100

50

0

20NG NIPS NYT Pubmed

Topic Coherence

AlgorithmTSVDTensorRL2RKL

Check out!

PaperA provable SVD-based algorithm for learning topics indominant admixture corpus (NIPS 2014)

Codehttp://mllab.csa.iisc.ernet.in/tsvd/

http://mllab.csa.iisc.ernet.in/tsvd/

Thank you

What are topicsLatent Semnatic IndexingProbabilistic Topic ModelsLDALearning Topics from finite number of Samples