opinion analysis sudeshna sarkar iit kharagpur. introduction – facts and opinions two main types...

37
Opinion Analysis Sudeshna Sarkar IIT Kharagpur

Upload: alice-mills

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Opinion Analysis

Sudeshna Sarkar

IIT Kharagpur

Page 2: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Introduction – facts and opinions

Two main types of information on the Web. • Facts and Opinions

Current search engines search for facts (assume they are true)• Facts can be expressed with topic keywords.

Search engines do not search for opinions• Opinions are hard to express with a few keywords

• How do people think of Motorola Cell phones?

• Current search ranking strategy is not appropriate for opinion retrieval/search.

Page 3: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Overview

Motivation Definitions Coarse grained vs Fine grained opinion analysis Opinion Lexicons Approaches to document level opinion analysis

• Lexicon based• Supervised learning approaches• Mixed approaches

Approaches to fine-grained opinion analysis• Rule based• Learning

Opinion mining work at IIT Kharagpur

Page 4: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Opinion Mining

Search for and aggregate opinions from online sources

Many reviews have both positive and negative sentences

Many products are liked by some and disliked by others – there must be different reasons

Identify different features/ aspects of the target and the opinion on these separately

Page 5: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Why do opinion analysis?

Opinion search• to extract examples of particular types of positive or

negative statements on some topic. Opinion question answering

• What is the reaction to the Left Front’s stand on the nuclear deal? • Is support diminishing for the UPA government?

Product review mining• What features of “Mr Coffee programmable coffee maker” do users

like and what they dislike (Microsoft Live) Review classification Tracking sentiment toward topics over time

• to track the ups and downs of aggregate attitudes to a brand or product

Page 6: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Introduction – Applications

Businesses and organizations: product and service benchmarking. Market intelligence. • Business spends a huge amount of money to find consumer sentiments

and opinions.

• Consultants, surveys and focused groups, etc Individuals: interested in other’s opinions when

• Purchasing a product or using a service,

• Finding opinions on political topics,

• Many other decision making tasks. Ads placements: Placing ads in user-generated content

• Place an ad when one praises an product.

• Place an ad from a competitor if one criticizes an product. Opinion retrieval/search: providing general search for opinions.

Page 7: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Question Answering

Opinion question answering: Q: What is the international reaction to the reelection of Robert Mugabe as President of Zimbabwe?

A: African observers generally approved of his victory while Western Governments denounced it.

Page 8: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Opinion search (Liu, Web Data Mining book, 2007)

Can you search for opinions as conveniently as general Web search?

Whenever you need to make a decision, you may want some opinions from others, • Wouldn’t it be nice? you can find them on a search

system instantly, by issuing queries such as • Opinions: “Motorola cell phones”

• Comparisons: “Motorola vs. Nokia”

Cannot be done yet!

Page 9: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Typical opinion search queries

Find the opinion of a person or organization (opinion holder) on a particular object or a feature of an object. • E.g., what is Bill Clinton’s opinion on abortion?

Find positive and/or negative opinions on a particular object (or some features of the object), e.g., • customer opinions on a digital camera,

• public opinions on a political topic. Find how opinions on an object change with time. How object A compares with Object B?

• Gmail vs. Yahoo mail

Page 10: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Find the opinion of a person on X

In some cases, the general search engine can handle it, i.e., using suitable keywords. • Bill Clinton’s opinion on abortion

Reason: • One person or organization usually has only one

opinion on a particular topic.

• The opinion is likely contained in a single document.

• Thus, a good keyword query may be sufficient.

Page 11: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Find opinions on an object X

We use product reviews as an example: Searching for opinions in product reviews is different from

general Web search.• E.g., search for opinions on “Motorola RAZR V3”

General Web search for a fact: rank pages according to some authority and relevance scores. • The user views the first page (if the search is perfect).

• One fact = Multiple facts Opinion search: rank is desirable, however

• reading only the review ranked at the top is dangerous because it is only the opinion of one person.

• One opinion Multiple opinions

Page 12: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Search opinions (contd) Ranking:

• produce two rankings

• Positive opinions and negative opinions

• Some kind of summary of both, e.g., # of each

• Or, one ranking but

• The top (say 30) reviews should reflect the natural distribution of all reviews (assume that there is no spam), i.e., with the right balance of positive and negative reviews.

Questions: • Should the user reads all the top reviews? OR

• Should the system prepare a summary of the reviews?

Page 13: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

User generated content

Word of mouth on the web. • Review sites

• Blogs

• Online forums

• Shopping comparison sites

• User reviews Mine opinions expressed in the user-

generated content • Challenging task

• Useful to individual consumers and companies.

Page 14: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Motivation for Consumer

I want to buy a camera. Which model should I pick?

• Ask my friends

• Use the internet CEA-CNET Study: Tech-Savvy Consumers Use Internet to

Research Products Before Buying Them• Wireless News,  November, 2007  

Seventy Percent of Consumers Use Internet to Research Consumer Packaged Goods, According to Prospectiv Survey• Market Wire,  January, 2008  

Page 15: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Businesses

Identify opinions about products – help to position/ adapt products

Much of product feedback is web-based• provided by customers/critiques online through

websites, discussion boards, mailing lists, and blogs, CRM Portals.

Market research is becoming unwieldy • Sources are heterogeneous and multilingual in

nature

Page 16: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Facts vs Opinions

An opinion is a person's ideas and thoughts towards something. It is an assessment, judgment or evaluation of something. An opinion is not a fact, because opinions are either not falsifiable, or the opinion has not been proven or verified. ...en.wikipedia.org/wiki/Opinion

Subjectivity: The linguistic expression of somebody’s emotions, sentiments, evaluations, opinions, beliefs, speculations, etc.

Polarity: positive and negative

• This camera is awesome.

• The movie is too long and boring. Strength of opinion

Page 17: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Levels of opinion analysis

Coarse to fine grained opinion analysis Document level: At the document (or review) level

• Subjective vs Objective

• Sentiment classification: positive, negative or neutral Sentence level, Expression level

Task 1: identifying subjective/opinionated sentences (or clauses/ phrases)

• Classes: objective and subjective (opinionated)

Task 2: sentiment classification of sentences

• Classes: positive, negative and neutral.

But a document/ sentence may contain multiple opinions on more than one topic from one or more opinion holder

Page 18: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Lexicon Development

Manual Semi-automatic Fully automatic

Find relevant words, phrases, patterns that can be used to express subjectivity

Determine the polarity of subjective expressions

Page 19: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Opinion Words

An opinion lexicon containing lists of positive and negative phrases is very useful for the opinion mining task at different levels

Positive: beautiful, wonderful, good, amazing, Negative: bad, poor, terrible, cost someone an arm and a leg

How to compile such a list?• Dictionary-based approaches • Corpus-based approaches

• Supervised• Semi-supervised

BUT • Some opinion words are context independent (e.g., good).• Some are context dependent (e.g., long).

Page 20: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Hand created lists

Create lists of opinion words appropriate for the domain manually• Sentiment term

• Polarity

• Strength

These approaches, while being interesting, are labor intensive and can be vulnerable to error and high maintenance costs

Page 21: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

21

Dictionary-based approaches

Start from a set of seed opinion words Use WordNet’s synsets and hierarchies to acquire opinion

words• Use the seeds to search for synonyms and antonyms in WordNet (eg, Hu

and Liu, 2004).

Page 22: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

22

Dictionary-based approaches

• Use additional information (e.g., glosses) and learning from WordNet (Andreevskaia and Bergler, 2006) (Esuti and Sebastiani, 2005).

Page 23: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

23

Dictionary-based approaches

Advantage: Good to find a lot of such words

Weakness: Do not find context dependent opinion words, e.g., small, long, fast.

Page 24: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Corpus-based approaches

Rely on syntactic rules and co-occurrence patterns to extract from large corpora • Use a list of seed words

• A large domain corpus

• Machine learning

Advantages: This approach can find domain (corpus) dependent opinions.

24

Page 25: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

How to identify subjective terms?

Assume that contexts are coherent Statistical Association: If words of the same orientation

like to co-occur together, then the presence of one makes the other more probable

Use statistical measures of association to capture this interdependence

Assume that contexts are coherent Assume that alternatives are similarly subjective

Page 26: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

26

Corpus-based approaches (contd)

Conjunctions: Conjoined adjectives usually have the same orientation (Hazivassiloglou and McKeown 1997).

• E.g., “This car is beautiful and spacious.”(conjunction)

1. Start with seed words

2. Use conjunctions to find adjectives with similar orientations

3. Use log-linear regression to aggregate information from various conjunctions

4. Use hierarchical clustering on a graphrepresentation of adjective similarities to find two groups of same orientation

Page 27: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

nice

handsome

terrible

comfortable

painful

expensivefun

scenic

nice

handsometerrible

comfortable

painful

expensivefun

scenic slow

Page 28: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Growing contextual opinion words

[Ding, Liu, Wu] Intra-sentence conjunction rule Opinion on both sides of “and” / two

consecutive sentences tend to be the same

• E.g., “This camera takes great pictures and has a long battery life”.

But with a “but”-like clause, the opinions tend to be of opposite polarity. Context is important

• Long battery life vs Long time to focus Growing

• by applying various conjunctive rules Verifying the results as the system sees more reviews by those

conjunctive rules Only keep those opinions which the system is confident about,

controlled by a confidence limit.

28

Page 29: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Semantic Orientation by Association

Labeled semantic orientation of words Pwords = {good, nice, excellent, positive, fortunate, correct,

superior} Nwords = {bad, nasty, poor, negative, unfortunate, wrong,

inferior}.

Various approach to calculate the semantic association of two words

• Pointwise Mutual Information (PMI) [Church and Hanks 1989]

• Latent Semantic Indexing (LSI) Dumais et al. 1990]

• Likelihood Ratios [Dunning 1993]

NwordsnwordPwordspword

nwordwordApwordwordAwordSO ),(),()(

Page 30: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Turney 2002; Turney & Littman 2003

Determine the semantic orientation of each extracted phrase based on their association with seven positive and seven negative seed words

)()(

)&(log),(

21

21221 wordpwordp

wordwordpwordwordPMI

)_()_(

)_()_(log)( 2 queryphitsquerynNEARwordhits

querynhitsquerypNEARwordhitswordIRPMISO

Page 31: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Weakly spervised learning

Gammon Aue 2005 Given a list of seed words (seed words 1) Get more seed words (seed words 2)– words with

low PMI at sentence level Get semantic orientation of (seed words 2) by PMI

at document level Get Semantic orientation of all words by PMI with

all seed words

Page 32: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Document level opinion analysis

Polarity classification: Classify documents (e.g., reviews) based on the overall sentiments expressed by authors,

Approaches• Use opinion lexicon

• Knowledge Engineering

• Supervised learning techniques

• Classifying using the Web as a corpus

• Semi-supervised

Page 33: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Knowledge Engineering

Make use of lists of sentiment terms Manually create analysis components based

on cognitive linguistic theory: parser, feature structure representation, etc

Page 34: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Supervised polarity classifier

Requirements: A labeled database of opinion• Download ratings from Amazon.com,

epinions.com etc. Build a binary opinion classifier

• From positive and negative ratings• Merge 1 and 2 stars to negative and 3, 4 and 5 to

positive

• Use thresholded SVM, maximum entropy, naïve Bayes, etc.

Page 35: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Supervised Training

1. Obtain Labeled Sentences: positive, neutral, negative

2. Extract features: words, n-grams, multi word expressions, feature generalization [Kim & Hovy 2007]

3. Feature values: binary/ frequency

4. Run Training algorithm on the features to give a classifier

5. [Optional] Do feature selection (use log-likelihood ratio)

Page 36: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Semi-supervised approaches

Fully supervised techniques require• large amount of labeled data for the given domain

Semi-supervised systems• Use small amount of domain knowledge

1. From a small set of seed words use domain corpus to get domain relevant opinion words as discussed earlier

Page 37: Opinion Analysis Sudeshna Sarkar IIT Kharagpur. Introduction – facts and opinions Two main types of information on the Web. Facts and Opinions Current

Semi-supervised approach

Gamon & Aue 20051. Obtain opinion words by semi-supervised

approach

2. Given a domain corpus, label data using average semantic orientation

3. Train classifier on labeled data