text mining mengmeng & jack_lsu

19
Introduction to Text Mining in SAS® Enterprise Miner Mengmeng Liu and Jack Dai November 10, 2014

Upload: jjdai

Post on 08-Aug-2015

31 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Text mining mengmeng & jack_lsu

Introduction to Text Mining in SAS® Enterprise MinerMengmeng Liu and Jack DaiNovember 10, 2014

Page 2: Text mining mengmeng & jack_lsu

Growing Text

• First tweet: 2006

• Tweets per day in 2007: 5,000

• Tweets per day in 2013: 500,000,000

Page 3: Text mining mengmeng & jack_lsu

What is Text Mining?

Unstructured Text Data

Numeric Data

Statistical Analysis

Page 4: Text mining mengmeng & jack_lsu

Text Mining Process Flow in SAS Enterprise Miner

Page 5: Text mining mengmeng & jack_lsu

Data Importing

• Clean the text as much as you can before importing

Import node Data Structure

File Import All text in one file (CSV)

Text Import Separate documents (TXT, PDF)

Create New Data Sources

SAS dataset (sas7bdat)

Page 6: Text mining mengmeng & jack_lsu

Data Importing

Page 7: Text mining mengmeng & jack_lsu

Text Parsing

• Parse the variable with longest length

• Associate similar terms into one group

• Build customized dictionary of relevant terms

• Control number of terms per document

Page 8: Text mining mengmeng & jack_lsu

Text Filter

• Correct misspellings

• Assign frequency weights and term weights

• Manually filter out terms using filter view

Target variables Options

Present Mutual information

Not present Entropy

Page 9: Text mining mengmeng & jack_lsu

• Text cluster groups documents with similar text contents

• Convert documents to Singular Value Decomposition(SVD) based on the term weights and frequency weights

• Group documents into mutually exclusive cluster based on SVD

• Select dimensions of SVD and numbers of clusters

• Select number of clusters

Text Cluster

Page 10: Text mining mengmeng & jack_lsu

Text Topic

• Create a number of topics that are prevalent in documents

• Score each document on probability of containing the topic

• Each document could have multiple topics

Page 11: Text mining mengmeng & jack_lsu

Possible statistical analysis methods

For Classification Purposes:

Page 12: Text mining mengmeng & jack_lsu

Demo: Hotel Reviews for Riviera

Page 13: Text mining mengmeng & jack_lsu

Data Structure

Page 14: Text mining mengmeng & jack_lsu

Cleaning the raw data

Page 15: Text mining mengmeng & jack_lsu

Manually filtering terms

Terms Eliminated:• quot• riviera• hotel• stay verb• strip• vegas• year• riv

Page 16: Text mining mengmeng & jack_lsu

Results

Page 17: Text mining mengmeng & jack_lsu

Alternative: Read the Reviews

Page 18: Text mining mengmeng & jack_lsu

Questions?

Page 19: Text mining mengmeng & jack_lsu

Resources:

• Dr. Jim Love and Dr. Joni Shreve from LSU ISDS Department

• Data obtained from UCI Data Repository• http://www.internetlivestats.com/twitter-statistics/• ‘Text Analytics Using SAS Enterprise Miner’