text mining mengmeng & jack_lsu
TRANSCRIPT
Introduction to Text Mining in SAS® Enterprise MinerMengmeng Liu and Jack DaiNovember 10, 2014
Growing Text
• First tweet: 2006
• Tweets per day in 2007: 5,000
• Tweets per day in 2013: 500,000,000
What is Text Mining?
Unstructured Text Data
Numeric Data
Statistical Analysis
Text Mining Process Flow in SAS Enterprise Miner
Data Importing
• Clean the text as much as you can before importing
Import node Data Structure
File Import All text in one file (CSV)
Text Import Separate documents (TXT, PDF)
Create New Data Sources
SAS dataset (sas7bdat)
Data Importing
Text Parsing
• Parse the variable with longest length
• Associate similar terms into one group
• Build customized dictionary of relevant terms
• Control number of terms per document
Text Filter
• Correct misspellings
• Assign frequency weights and term weights
• Manually filter out terms using filter view
Target variables Options
Present Mutual information
Not present Entropy
• Text cluster groups documents with similar text contents
• Convert documents to Singular Value Decomposition(SVD) based on the term weights and frequency weights
• Group documents into mutually exclusive cluster based on SVD
• Select dimensions of SVD and numbers of clusters
• Select number of clusters
Text Cluster
Text Topic
• Create a number of topics that are prevalent in documents
• Score each document on probability of containing the topic
• Each document could have multiple topics
Possible statistical analysis methods
For Classification Purposes:
Demo: Hotel Reviews for Riviera
Data Structure
Cleaning the raw data
Manually filtering terms
Terms Eliminated:• quot• riviera• hotel• stay verb• strip• vegas• year• riv
Results
Alternative: Read the Reviews
Questions?
Resources:
• Dr. Jim Love and Dr. Joni Shreve from LSU ISDS Department
• Data obtained from UCI Data Repository• http://www.internetlivestats.com/twitter-statistics/• ‘Text Analytics Using SAS Enterprise Miner’