turning text into insights: an introduction to topic models
Post on 18-Feb-2017
193 Views
Preview:
TRANSCRIPT
AN INTRODUCTION TO TOPIC MODELING
Turning text into insight:
Handling Raw, Unlabeled Text
§ Common Datasets: ª Product/ Customer Reviews ª Call Center Transcripts ª News Paper Articles ª Legal Documents
§ Common Tasks: ª Find documents were interested in? ª Categorize documents? ª Retrieve information?
2
Handling Raw, Unlabeled Text
3
§ Common Datasets: ª Product/ Customer Reviews ª Call Center Transcripts ª News Paper Articles ª Legal Documents
§ Common Tasks: ª Find documents were
interested in? ª Categorize documents? ª Retrieve information?
§ The Challenge ª Normal quantitative approaches don’t work with text. ª Datasets are large, complicated, sparse, and unwieldy. ª Data is often unlabeled.
Example: Understanding Customer Reviews
4
§ Mon Ami Gabi is a restaurant in the Paris Paris Hotel and Casino.
§ Thousands of customer reviews for the restaurant over the last 8 years.
What are customers saying?
Excellent breakfast menu. They just need to hire more staff to have a better service.
Great place for brunch!
Highly recommend the steak and fries and sitting outside.
Had a great meal with a great atmosphere
Food was ok… What it has going for it is the view from the outside
terrace.
Topic Modeling: Framework
5
Excellent breakfast menu. They just need to hire more staff to have a better service
Breakfast
Quality of Service
breakfast
better
service
staff
Documents Topics Words and Phrases
Topic Modeling: Preprocessing
6
§ Tokenize: Extract meaningful units from sentences ª I ordered a french toast
ª Regular expression cleanup, end-‐of-‐line hyphenation, contraction, and sentence-‐initial capitalization rules.
§ Stemming Algorithm: Consolidate feature space into word stems or lemmas ª {I, ordered, a, french toast}
ª Suffix stripping, part of speech tagging
§ Matrix Factorization: Convert text into data structure for learning algorithms.
ª Word-‐document matrices often have 1,000,000,000,000+ values. Need special compression algorithms to make data manageable.
{I, ordered, a, french toast}
{I, order, a, french toast}
Topic Modeling: Estimation with Gibbs Sampler
7
ª Use Markov Chain Monte Carlo methods to simulate our document-‐topic and topic-‐word probability distributions.
ª Results:
Topic-‐Word
Breakfast Service
Breakfast: 0.31 Service: 0.28
Eggs: 0.27 Staff: 0.24
Coffee: 0.24 Friendly: 0.21
Document-‐Topic
The french toast was great The staff was great, but the outdoor patio was a bit noisy.
French Toast: 0.71 Service: 0.51
Breakfast: 0.25 Environment: 0.44
Service: 0.03 Breakfast: 0.02
Harnessing the Model: Topic Frequency
8
What are my customers talking about?
Harnessing the Model: Evaluate Products and Verticals
9
How do customers feel about my products?
Harnessing the Model: Temporal Insights
10
How has customer sentiment evolved among my product lines over time?
Harnessing the Model: Deep Product Insights
11
Which properties of French Toast drive satisfaction (or dissatisfaction)?
Thank you.
top related