1 csc 594 topics in ai – text mining and analytics fall 2015/16 7. topic extraction
TRANSCRIPT
![Page 1: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/1.jpg)
1
CSC 594 Topics in AI –Text Mining and Analytics
Fall 2015/16
7. Topic Extraction
![Page 2: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/2.jpg)
• Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts).
• The same idea can be applied to understand the association between documents associated to a topic.
Text Topics
2
![Page 3: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/3.jpg)
Problems with “Term as Topic”
• Using single term to define a topic is problematic.– Lack of expressive power
• Can only represent simple topics
• Cannot represent complicated topics
– Incompleteness in vocabulary coverage• Cannot capture variations of vocabulary (e.g. related terms)
– Ambiguous word• Many words have more than one meaning/sense.
3
![Page 4: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/4.jpg)
Multiple Terms as Topic
• A solution is to use multiple terms to define a topic.– Topic = {word1, word2, .. }– A weight assigned to each term represents the
importance/relevance of the term in the topic.– Every document in the corpus can be given a score that
represents the strength of association to a topic.– A document can contain zero, one or many topics.
4
![Page 5: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/5.jpg)
Approach (1): Probabilistic Topic Mining
Coursera, Text Mining and Analytics, ChengXiang Zhai 5
![Page 6: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/6.jpg)
Topic as Word Distribution
Coursera, Text Mining and Analytics, ChengXiang Zhai 6
![Page 7: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/7.jpg)
Probabilistic Topic Mining
Coursera, Text Mining and Analytics, ChengXiang Zhai 7
![Page 8: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/8.jpg)
Techniques for Probabilistic Topic Mining
• Several techniques have been used in probabilistic topic mining to extract topics.– Maximum Likelihood– Bayesian– Mixture Model (where parameters are estimated typically
using the Expectation Maximization (EM) algorithm)
8
![Page 9: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/9.jpg)
Mixture Model for Topic Extraction (1)
Coursera, Text Mining and Analytics, ChengXiang Zhai 9
![Page 10: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/10.jpg)
Mixture Model for Topic Extraction (2)
Coursera, Text Mining and Analytics, ChengXiang Zhai 10
![Page 11: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/11.jpg)
Mixture Model as a Generative Model
Coursera, Text Mining and Analytics, ChengXiang Zhai 11
![Page 12: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/12.jpg)
Mixture of Two Unigram Language Models
Coursera, Text Mining and Analytics, ChengXiang Zhai 12
![Page 13: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/13.jpg)
Coursera, Text Mining and Analytics, ChengXiang Zhai 13
![Page 14: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/14.jpg)
Coursera, Text Mining and Analytics, ChengXiang Zhai 14
![Page 15: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/15.jpg)
Coursera, Text Mining and Analytics, ChengXiang Zhai 15
![Page 16: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/16.jpg)
Expectation-Maximization (EM) Algorithm
Coursera, Text Mining and Analytics, ChengXiang Zhai 16
![Page 17: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/17.jpg)
Coursera, Text Mining and Analytics, ChengXiang Zhai 17
![Page 18: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/18.jpg)
18
Approach (2): Dimensionality Reduction for Topics Extraction
• Reduced dimensions can also be considered topics.• Singular Value Decomposition derives eigenvectors
(SVD dimensions/Principal Components) Topics.
D1: “I love iPad.”D2: “iPad is great for kids.”
D3: “Kids love to play soccer.”
D4: “I play soccer at OSU.”
![Page 19: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/19.jpg)
19
Example: Topics extracted by SAS Enterprise Miner for the yelp data
![Page 20: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/20.jpg)
20
• Term topic weight – relevance of the term in the topic• Each term is assigned a weight corresponding to each topic.• Since each topic is an SVD dimension, the term topic weights for
a term are the coordinates of the term in the SVD space.• The Term cutoff is used to determine whether a term belongs to
a topic.
• Document topic weight – relevance of the document to the topic• Every document in the corpus is assigned a weight corresponding
to each topic.• The document topic weight of a document towards a topic is the
normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights.
• The Document cutoff is used to determine whether a document belongs to a topic.
![Page 21: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction](https://reader036.vdocuments.site/reader036/viewer/2022062720/56649f065503460f94c1c626/html5/thumbnails/21.jpg)
21
Interpretability of Extracted Topics
• A topic as a collection of weighted terms provides precise information about the topic.
• But some analysts find the binary topics are easier to understand.