discovering topics from unstructured text

67
Discovering Topics from Unstructured Text Deep Tech Summit, NPC C. Bhattacharyya Machine Learning Lab Department of CSA, IISc 26th Oct, 2016

Upload: nasscom-product-connect

Post on 15-Apr-2017

58 views

Category:

Science


1 download

TRANSCRIPT

  • Discovering Topics from UnstructuredText

    Deep Tech Summit, NPCC. Bhattacharyya

    Machine Learning LabDepartment of CSA, IISc

    26th Oct, 2016

  • Information retrieval from Unstructured text

    What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections

    (usually stored on computers).

    1 Image from Internet

  • Information retrieval from Unstructured text

    What is IR?Manning, Raghavan, Stutze08Finding material (usually documents) of an unstructured nature(usually text) that satisfies an information need from withinlarge collections (usually stored on computers).

    1 Image from Internet

  • Challenges in handling Unstructured TextCorpora

    How do we build automatic indexing Large Corpora NLP based methodologies will not scale

  • What are Topics

    runinninghitseasongame

  • What are Topics

    runinninghitseasongame

  • What are Topics

    run cup patient computerinning minutes drug softwarehit add doctor system

    season tablespoon cancer microsoftgame oil medical company

  • What are Topics

    run cup patient computerinning minutes drug softwarehit add doctor system

    season tablespoon cancer microsoftgame oil medical companySport Cooking Healthcare Computers

  • Models for discovering themes

    Topic models attempt to discover themes indocument collections

    Themes can be used for anotating documents. Can be useful for organizing, and searching large

    document corpora Do not require supervision

  • Visualizing Topics: Browsing Wikipedia

    Wikipedia topicsAllison Chaney. TMVE

    https://github.com/ajbc/tmve-original

    https://github.com/ajbc/tmve-original

  • Outline

    What are topics

    Latent Semnatic Indexing

    Probabilistic Topic ModelsLDA

    Learning Topics from finite number of Samples

  • Outline

    What are topics

    Latent Semnatic Indexing

    Probabilistic Topic ModelsLDA

    Learning Topics from finite number of Samples

  • Information Retrieval

    Corpus- Collection of Documents Document Collection of Words

    IR revisitedGiven a document find similar documents in a corpus

  • Corpus is a matrix

  • Corpus is a matrix

  • Corpus is a matrix

  • Corpus is a matrix

  • SMART Information retrieval system

    Pioneered by G. Salton1 in 1975 Given a query q find closest documents

    score(q, d) =Aq Ad

    AqAd

    Representations and scoring system were developed1G. Salton, A. Wong, and C. S. Yang (1975), A Vector Space Model for Automatic Indexing, Communications of theACM

  • Keyword search

    QueryWho won the Turing award in 2015

    Retrieval given query q

    Aij ={

    1 i th word is present in j th document0 otherwise

    return i if Aj q is high

  • Polysemy, Synonmy, Term Dependence

    Polysemy Words which have more than one meaningCricket

    Polysemous words in queries can reduce precision

    Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem

    Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together

  • Polysemy, Synonmy, Term Dependence

    Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision

    Synonmy: Different words have the same meaning.Automobile and Car

    queries with synonmyous words can be a problem

    Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together

  • Polysemy, Synonmy, Term Dependence

    Polysemy Words which have more than one meaningCricketPolysemous words in queries can reduce precision

    Synonmy: Different words have the same meaning.Automobile and Carqueries with synonmyous words can be a problem

    Term Dependence: Terms are not orthogonalMisses Themesoften certain group of words occur together

  • Latent Semantic Indexing1

    d : number of words n: number of documents

    SVD-Singular Value Decompostion

    A = [A1, . . . ,An]

    Adn

    = Mdr

    Drr

    Srn

    1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.

  • Retrieving documents with LSI

    Ai = MDSi

    D1Mq = q

    simLSI(q,Ai) = qSi

    LSI outperformed keyword searchsimLSI(q,Ai) outperformed qAi

    1Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51stAnnual Meeting of the American Society for Information Science 25, 1988, pp. 3640.

  • Retrieving documents

    Project query on columns of M and finddocuments closest to the projected query.

    Works well but why? Maybe M encodes semantics

  • LSI: A probabilistic analysisPapadimitrou et al. 2000

    Each document has only one Topic Each topic, column of M, has some primary words Probability mass on the primary words are very high. Could mathematically explain superior performance

    of LSI

  • What are Topics

    Topic is a probability distribution over words

    Each document mi.i.d draws fromTopic

  • LSI and Information retrieval

    When does LSI workIf corpus is pure and each topic has some primary words then

    Si Sj c whenever Si and Sj share the same topic

    Primary words of a topic Group of words with significant fraction of probability

    mass within a topic Should be disjoint run,inning,hit, season, game

  • LSI is not the answer

    Topics: Computer Science, Arts

  • Outline

    What are topics

    Latent Semnatic Indexing

    Probabilistic Topic ModelsLDA

    Learning Topics from finite number of Samples

  • Probabilistic Topic Models: Latent DirichletAllocation(LDA)

    Unsupervised corpus analysis Generative model for

    documents Topic defined by a pmf over

    words Learn topics inherent in

    corpus

  • LDA: Generative model

    Document: p.m.f over topics,

    Topic: p.m.f. over words, z

    Process: Pick , z and thenchoose word from z

    Generative Model: sports, cooking, moviesz: sportw : runs

  • Example: Dynamic Topic Model of Science

    75-topic dynamic topic model of the Journal Science(1880-2002)

    Words in topics evolve over time Source:http://topics.cs.princeton.edu/Science/

    http://topics.cs.princeton.edu/Science/

  • Topic Model of Science: Example Topics

  • Dynamic Topic Model of Science: Example I

  • Dynamic Topic Model of Science: Example II

  • Resource scarce languages: Multilingualtopics

    Training: English-Hindi-Bengali Wikipedia 3.3K doc triplets Test: EN-HI-BN news from FIRE EN 14K, HI 15K, BN 12K articles

    English Hindi Bengalifilm films awarddisney awardshitchcock simp-sons chaplinmovie academy

    (chaplin film the jerrytom film pitt best actorand)

    (film prize do film onethe r him cyrus film)

  • Insights into LDA

    LDA works well for large documents Big Corpus Applies to Dyadic data

    Videos, Software Codes, Cross-Lingual retrieval,

  • When does LDA work?

    Theorem((Teng et al. 2014))W.h.p if logn m then d(G,G) C

    =

    lognn

    +logmm

    +logmn

    n = number of documents m = number of words in a document

    n is very large = logmmNot good for short messages

    m is very large = lognnNot good for small corpus

    1Jian Tang et al. (2014) Understanding the Limiting Factors of Topic Modeling via Posterior ContractionAnalysis ICML 2014

  • LDA: Observations

    Inference is NP hard Learning parameters from a corpus is also NP hard Requires MCMC techniques or variational techniques

  • Topic SimplexThree Topics: Can be viewed as a triangle in 2-D

  • Topic SimplexDocuments put weights on the vertices of the triangle

  • Topic SimplexDocuments are points inside the triangle !

  • Outline

    What are topics

    Latent Semnatic Indexing

    Probabilistic Topic ModelsLDA

    Learning Topics from finite number of Samples

  • General model for probabilistic topic models

    Each column of M be a topicProbability distribution over words

    Randomly choose l , weight over topic l ,they should sum to 1

    Sample m words fromk

    i=1Mll to create adocument

  • Fit multiple topics to a single documentprovably

    QuestionHow many documents do i need to recover M from A

    Recent breakthrough (Arora et al. 2012)2gave guarantees Polynomial time algorithms

    Separability

    Mwt = p0, Mwt = 0

    1Learning Topic ModelsGoing beyond SVD, STOC 2012

  • Fitting topics using separability

    TheoremIf all topics have anchor words, there is a polynomial timealgorithm that returns an M such that with high probability,

    kl=1

    di=1

    |Mil Mil | provided

    s Max{O(d2k6 logda42p602m

    ),O

    (k4

    2a2

    )},

    where, is the condition number of E(WW T ), a is theminimum expected weight of a topic and m is the numberof words in each document.

  • Fitting topics using separability

    s Max{O(d2k6 logda42p602m

    ),O

    (k4

    2a2

    )} The dependence of s on parameters p0 is 1/p60 For the topic baseball the word run maybe an

    anchor word with p0 = 0.1 Then the requirement is that every 10-th word in a

    document on this topic is run (too strong) More realistic to ask that a set of words like - run, hit,

    score, together have frequency 0.1

  • Our Assumptions: Dominant Topics

    Dominant Admixture assumption every document has a dominant topic: one topic has

    weight significantly higher than others for every topic, there is a small fraction of documents

    which are nearly purely on that topic

    Formally, let , , , , 0 be non-negative reals satisfying:

    + (1 ), + 2 0.5, 0.08

    For j [s], document j has a dominant topic l(j), s.t.Wl(j),j and Wl j , l = l(j)

    For each topic l , there are at least 0w0s documentsfor which topic l has weight at least 1 .

  • Our Assumptions: Catchwords

    CatchwordsCatchwords of a topic: a group of words

    each word occurs strictly more frequently in the topicthan other topics

    together they have high frequency

    Formally: Sl

    Sl l , l {1, . . . , k} such thati Sl , l = l ,

    Mil Mil ,iSl

    Mil p0, and m2Mil 8 ln(

    20w0

    )

  • Our Results (Bansal et al., NIPS 2014)

    Under the assumptions, the TSVD algorithm succeedswith high probability in finding an M so that

    i,l

    |Mil Mil | O(k), provided

    s (

    1w0

    (k6m2

    2p20+

    m2k20

    2p0+

    d02

    )).

    Dependence of s on w0, that is (1/w0), is optimal Dependence of s on d , (d/0w02), is optimal For Arora, to get comparable error we need a

    quadratic dependence on d

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)Randomly partition the columns of A into A(1) and A(2)

    Thresholding

    Compute Thresholds on A(1): For each i , let i be the highestvalue of {0, 1,2, . . . ,m} such that|{j : A(1)ij >

    m}|

    w02 s; |{j : A

    (1)ij =

    m}| 3w0s.

    Do the thresholding on A(2):

    Bij =

    {i if A

    (2)ij > i/m and i 8 ln(20/w0)

    0 otherwise.

    SVD: Find the best rank k approximation B(k) to B.

    Identify Dominant Topics

    Project and Cluster Find (approximately) optimal k -meansclustering of the columns of B(k).

    Lloyds Algorithm Using the clustering found previous step asthe starting clustering, apply Lloyds k -means algorithm to thecolumns of B (B, not B(k)).

    Let R1,R2, . . . ,Rk be the corresponding kpartition of [s]

  • Thresholded SVD-based k -means (TSVD)

    Identify Catchwords

    For each i , l , compute g(i , l) = the (0w0s/2)th highestelement of {A(2)ij : j Rl}.

    Let Jl ={i : g(i , l) > max

    ( 4m2 ln(20/w0),maxl =l g(i , l

    ))}

    ,

    where, = 12(1+)(+) .

    Find Topic Vectors

    Find the 0w0s/2 highest

    iJl A(2)ij among all j [s]

    Return the average of these A,j as our approximation M,l to M,l

  • Thresholded SVD-based k -means (TSVD)

    Identify Catchwords

    For each i , l , compute g(i , l) = the (0w0s/2)th highestelement of {A(2)ij : j Rl}.

    Let Jl ={i : g(i , l) > max

    ( 4m2 ln(20/w0),maxl =l g(i , l

    ))}

    ,

    where, = 12(1+)(+) .

    Find Topic Vectors

    Find the 0w0s/2 highest

    iJl A(2)ij among all j [s]

    Return the average of these A,j as our approximation M,l to M,l

  • Why TSVD Works

    Data matrix A (left) and Thresholded matrix B (right)Black: non-catchwords, Blue: catchwords

  • Empirical ResultsDatasets

    NIPS: 1,500 NIPS full papers NYT: Random subset of 30,000 documents from the

    New York Times dataset Pubmed: Random subset of 30,000 documents from

    the Pubmed abstracts dataset 20NG: 13,389 documents from 20NewsGroup

    Baselines Recover (Arora et. al., 2013): state-of-art provable

    algorithm based on separability assumption Tensor (Anandkumar et. al., 2012): state-of-art

    provable algorithm using tensor decomposition

  • Empirical Results: AssumptionsCorpus Documents K Fraction of Documents

    = 0.4 = 0.8 = 0.9NIPS 1,500 50 56.6% 10.7% 4.8%NYT 30,000 50 63.7% 20.9% 12.7%

    Pubmed 30,000 50 62.2% 20.3% 10.7%20NG 13,389 20 74.1% 54.4% 44.3%

    Table: Fraction of documents satisfying dominant topic assumption.

    Corpus K Mean per topic % Topicsfrequency of CW with CWNIPS 50 0.05 95%NYT 50 0.11 100%

    Pubmed 50 0.05 90%20NG 20 0.06 100%

    Table: CatchWords (CW) assumption with = 1.1, = 0.25

  • Empirical Results: L1 Reconstruction ErrorAverage improvement over best of R-KL & Tensor: 30.7%

    Corpus Documents Tensor R-L2 R-KL TSVD % Improvement

    NIPS

    40,000 0.298 0.342 0.308 0.094 68.5%60,000 0.296 0.346 0.311 0.089 69.9%80,000 0.285 0.335 0.303 0.087 69.4%100,000 0.280 0.344 0.306 0.086 69.3%150,000 0.320 0.336 0.302 0.084 72.2%200,000 0.322 0.335 0.301 0.113 62.5%

    Pubmed

    40,000 0.379 0.388 0.332 0.326 1.8%60,000 0.317 0.372 0.328 0.287 9.5%80,000 0.321 0.358 0.320 0.276 13.8%100,000 0.304 0.350 0.315 0.276 9.2%150,000 0.355 0.344 0.313 0.239 23.6%200,000 0.322 0.334 0.309 0.225 27.3%

    20NG

    40,000 0.174 0.126 0.120 0.124 -3.3%60,000 0.207 0.114 0.110 0.106 3.6%80,000 0.203 0.110 0.108 0.095 12.0%100,000 0.151 0.103 0.102 0.087 14.7%200,000 0.162 0.096 0.097 0.072 25.8%

    NYT

    40,000 0.316 0.214 0.208 0.174 16.3%60,000 0.330 0.205 0.200 0.156 22.0%80,000 0.330 0.198 0.196 0.168 14.3%100,000 0.353 0.198 0.196 0.163 16.8%150,000 0.310 0.192 0.192 0.156 18.8%200,000 0.292 0.189 0.189 0.173 8.5%

  • Empirical Results: L1 Reconstruction Error

    Histogram of L1 error across topics for 40k syntheticdocuments. On majority of the topic (> 90%) the recoveryerror for TSVD is significantly smaller.

    0

    10

    20

    30

    40

    50

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

    NIPS

    0

    10

    20

    30

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3

    NYT

    0

    10

    20

    0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6

    Pubmed

    0

    5

    10

    15

    20

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

    20NG

    L1 Reconstruction Error

    AlgorithmRKLTensorTSVD

    Num

    ber

    of T

    opic

    s

  • Empirical Results on Real Data:Perplexity & Topic Coherence

    0

    1000

    2000

    20NG NIPS NYT Pubmed

    Perplexity

    100

    50

    0

    20NG NIPS NYT Pubmed

    Topic Coherence

    AlgorithmTSVDTensorRL2RKL

  • Check out!

    PaperA provable SVD-based algorithm for learning topics indominant admixture corpus (NIPS 2014)

    Codehttp://mllab.csa.iisc.ernet.in/tsvd/

    http://mllab.csa.iisc.ernet.in/tsvd/

  • Thank you

    What are topicsLatent Semnatic IndexingProbabilistic Topic ModelsLDALearning Topics from finite number of Samples