sentiment analysis

15
Sentiment Analysis in Machine Learning Jennifer D. Davis, Ph.D. American Computing Machinery, Austin Chapter Sub-group on Knowledge, Discovery and Data Mining June 2, 2015

Upload: jennifer-d-davis-phd

Post on 17-Aug-2015

16 views

Category:

Documents


4 download

TRANSCRIPT

Sentiment Analysis in Machine LearningJennifer D. Davis, Ph.D.American Computing Machinery, Austin ChapterSub-group on Knowledge, Discovery and Data MiningJune 2, 2015

Who uses sentiment analysis anyway?

What is sentiment analysis?

Machine learning technique that classifies comments and phrases based on what is called a ‘corpus’—a group of annotated texts with weights given to words in numerical terms

Defined as: “Sentiment analysis (opinion mining) refers to the

use of natural language processing, text analysis and computational linguistics to identify and extract subjective information to source materials.” wikipedia encyclopedia

Sentiment Analysis: Not your Mother’s Twitter Feed!

Sentiment Analysis can be used to: Understand the intent behind language in an unbiased

manner Business areas that frequently use Sentiment Analysis:

Retail Entertainment Healthcare Any customer-centered organization

Respond to customer complaints with better solutions, a sort of virtual call center (e.g. Amelia)

Retail

Introduce new products more successfully by understanding culture & social media

Understand and respond to customer needs using internal data sources such as customer reviews or feedback

Develop new products based on customer wants and needs as expressed in reviews, on-line and social media

Entertainment

Create interest or excitement about movies by understanding the market segment Target movie advertising or recommender systems

based on social commentary and collaborative filtering

Target advertising to gender or population or by cultural affinity.

Healthcare and Medical Treatment

Healthcare: Learn about patient wellness –

Potentially detect depression from journal entries Assist with patient adherence to treatment Learn about patient satisfaction and what is working Gather outcomes measures associated with patient

satisfaction This is a hot area of research and several academic

institutions are investing in research related to patient outcomes and sentiment analysis.

What are the overall steps for sentiment analysis?

Gather unstructured data from your own sources, web-sources, databases (healthcare.gov surprisingly has some) and competitions like Kaggle.

Parse out unnecessary punctuation and “stop” words or phrases, perform other pre-processing as needed or appropriate.

Transform the words or phrases to a numerical representation such as a vector

Choose an appropriate classification algorithm. For example Random Forrest has a high accuracy rate, but isn’t always computationally efficient. We discussed several other methods previously.

Apply your algorithm to a training set and if enough data is available, cross-validate. Tune the algorithm using appropriate parameters matched to features, but avoid over-fitting.

Apply the algorithm to test data (the fun part).

What techniques can we use?

Many are under development by machine-learning focused corporations and in academic linguistic laboratories

Often an ensemble of algorithms works best and is most accurate

Text data is often unstructured data. You will spend a portion of time cleaning and organizing data. Not fun, but necessary.

Today we will very briefly give high-level overview of 3 methods (i) Bayesian Probability classification, (ii) Word2Vec and (iii) Neural Recursive Networks

Bayesian Probability and classification method

Naïve Bayes classification uses probability formulas that are based on the assumptions that all features function independently

For most cases this is surprisingly accurate, and typically can yield 70-80% accuracies

You can read more about this in the textbook for this course, “Building Machine Learning Systems with Python”

Word2vec “deep” learning method

This method relies upon creating a “Bag of Words” from semi-structured data

Many tools are available in scikit learn and nltk python libraries (we will show some in our Jupyter (iPython) notebook

Invented by Google engineers who describes it as a “tool [that provides] an efficient implementation of a continuous bag-of-words and skip-gram architectures for computing vector representations of words”

In other words, (pun intended) words are assigned a vector of numbers representing their importance, and meaning

Neural recursive network method

The best (and most convenient to use) library is Stanford University’s Natural Language Processing library.

The method uses a recursion algorithm that will distinguish between phrases based upon the order of words & phrases

For example “this movie has humor that could not be denied” would be graded as positive whereas “this movie did not have any humor whatsoever” would be graded as negative based on order and choice of words & phrases.

SNLP Group can be found at: nlp.stanford.edu; their live demonstration is available at: nlp.stanford.edu/sentiment

So which do I choose?

It depends upon the complexity of data you are analyzing

It depends upon the accuracy you desire versus scalability (always a balancing act)

It depends on your time frame and how you will integrate the knowledge derived from using sentiment analysis

Out of the box solutions can work, but sometimes you will need to build your own

So now we can give it a try! A Jupyter Notebook has been created and can be accessed via my

Github account at: https://github.com/jddavis-100/Statistics-and-Machine-Learning/

Data is available at: Kaggle.com by joining the Kaggle Competition The test set was designed by me, and I can provide it to you or

Omar.

Gather your own data from a number of APIs including or web-crawlers such as: Rotten Tomatoes API Twitter API Web-scraping tools such as Scrapy (Python tool available at

scrapy.org)