text mining mengmeng & jack_lsu

Introduction to Text Mining in SAS® Enterprise MinerMengmeng Liu and Jack DaiNovember 10, 2014

Growing Text

• First tweet: 2006

• Tweets per day in 2007: 5,000

• Tweets per day in 2013: 500,000,000

What is Text Mining?

Unstructured Text Data

Numeric Data

Statistical Analysis

Text Mining Process Flow in SAS Enterprise Miner

Data Importing

• Clean the text as much as you can before importing

Import node Data Structure

File Import All text in one file (CSV)

Text Import Separate documents (TXT, PDF)

Create New Data Sources

SAS dataset (sas7bdat)

Data Importing

Text Parsing

• Parse the variable with longest length

• Associate similar terms into one group

• Build customized dictionary of relevant terms

• Control number of terms per document

Text Filter

• Correct misspellings

• Assign frequency weights and term weights

• Manually filter out terms using filter view

Target variables Options

Present Mutual information

Not present Entropy

• Text cluster groups documents with similar text contents

• Convert documents to Singular Value Decomposition(SVD) based on the term weights and frequency weights

• Group documents into mutually exclusive cluster based on SVD

• Select dimensions of SVD and numbers of clusters

• Select number of clusters

Text Cluster

Text Topic

• Create a number of topics that are prevalent in documents

• Score each document on probability of containing the topic

• Each document could have multiple topics

Possible statistical analysis methods

For Classification Purposes:

Demo: Hotel Reviews for Riviera

Data Structure

Cleaning the raw data

Manually filtering terms

Terms Eliminated:• quot• riviera• hotel• stay verb• strip• vegas• year• riv

Results

Alternative: Read the Reviews

Questions?

Resources:

• Dr. Jim Love and Dr. Joni Shreve from LSU ISDS Department

• Data obtained from UCI Data Repository• http://www.internetlivestats.com/twitter-statistics/• ‘Text Analytics Using SAS Enterprise Miner’

text mining mengmeng & jack_lsu

Data & Analytics

text topic

text analytics

text parsing

growing text

terms terms

similar text contents

text cluster groups

raw data