sentiment classification with case-based reasoning

Case-Based Approach to Cross-Domain Sentiment Classification

ICCBR - Sep/2012

Bruno OhanaSarah-Jane Delany

Brendan Tierney

Dublin Institute of Technology - Ireland

Outline

● Sentiment Classification

● Domain Dependence

● Lexicon-based methods.

● Case Based Approach

● Experiment and Results.

Sentiment Classification● For a given piece of text, determine sentiment

orientation.

● Positive or Negative?

“This is by far the worst hotel experience i've ever had. the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasn't even a hotel room!”

Applications● Search and Recommendation Engines.

○ Show only positive/negative/neutral.

● Market Research.○ What is being said about brand X on Twitter?

● Ad Placement.

● Mediation of online communities.

Domain Dependence

Supervised Learning Methods

● Good Performance, but:○ Labeled data is Expensive.○ Availability for all domains unlikely.

● Classifiers are domain specific.○ Ex: “Kubrick” may be a good opinion predictor for film

reviews, but not on other domains.

● (Aue & Gamon '05)○ Straightforward Train/Test across domains yields poor

results.

Using a Sentiment LexiconDatabase of terms associated with positive or negative sentiment.

● Manual: General Enquirer (Stone et al '67)● Corpus Based (Hatzivassiloglou & McKeown '97)● Lexical Induction: SentiWordNet (Esuli et al '06)● Some sample sizes:

○ GI: 4K○ SWN: 26K

Approach:● Scan document for term ocurrences, prediction based

on agregated results for positive/negative classes.

● No need for Training data sets.

Sentiment Classification with Lexicons

POS Tagger NegEx Classifier Prediction

Sent.Lexicon

Lexicon-Based classification

● Annotate text with POS and negation information.● Identify words present on lexicon.

○ Retrieve numerical score from lexicon indicating opinion.

● Aggregate results, use a rule to make prediction.○ Ex: max(PosScore,NegScore)


The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ imaginative/JJ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS

The computer-animated comedy "shrek" is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs

Lexicon-Based Classification: Issues● Performance of supervised learning methods is better.

● Selection of lexicon, classifier are established upfront.○ Ex: Use SWN with classifier F.○ Your choice can be sub-optimal.

● Lexicons perform differently on different domains. (Ohana et al, '11)


POS Tagger NegEx Classifier Prediction

Sent.Lexicon

Classifier Considerations

● Which Sentiment Lexicon to Use?● How to apply term sentiment information to the document?

○ What part-of-speech to use.○ Enable/Disable Negation Detection.○ How to count terms? (once, every time, adjust for

frequency)

ClassifierClassifier

Sent.LexiconSent.

Lexicon

Our ApproachBuild a case-base using out-of-domain data where:

● Problem description maps to document characteristics.

● Solution description maps to successful combinations of lexicons/classifiers.

Use case base to decide on which lexicon and classifier to use on a new document/domain.

Experiment - Case RepresentationProblem Description

Solution Description● Set of lexicons S={L1,...Ln} that yielded a correct prediction on input

document.● We use 5 different lexicons from the literature.

Counts for words, tokens and sentences; Avg. sentence size

Part-of-speech frequencies.

Counts for total Syllable and Monosyllable count.

Spacing ratio; Word-token ratio.

Stop words ratio.

Unique words count.

Experiment - Data Sets

User generated reviews on 6 x domains● English, Plain text.● Balanced classes.● Borderline cases removed.

Data Set Size Source

Hotels 2874 Tripadvisor

Films 2000 IMDB

Electronics 2072 Amazon.com

Music 5902 Amazon.com

Books 2034 Amazon.com

Apparel 566 Amazon.com

Experiment - Case Base

6 x domains.● Customer reviews in raw text.● Build 6 x case-bases of 5 x domains (Leave one out).

Movies

Electronics

Apparel

Hotels

Books

Music Albums

Building the Case Base

Experiment - Case Bases

Case creation:● Found at least one lexicon that gives a correct

prediction.Left out Domain Case Base Size % Positive % Negative

Books 9683 53.3 46.7

Electronics 9592 53.6 46.4

Film 9614 54.1 45.9

Music 6137 52.6 47.4

Hotels 11516 53.5 46.5

Apparel 11002 53.4 46.6

Lexicons in Case Solution

Experiment - Retrieval and Ranking

● K-NN and Euclidean Distance.

● Ranking: Select most common Lexicon out of K cases retrieved.

Solutions (k=3) Ranking (Count) Selected

case1 = {L1,L3,L4} L1 (3) {L1}

case2 = {L1,L2} L3 (2)

case3 = {L1,L3,L5} L2, L4, L5 (1)

Case Based Approach

Experiment Results

Baseline Results● Results for lexicon that performed best in domain (out of 5

lexicons)

Summary

Case Based Approach● Selection of lexicon/classifier up to case-base.

● Expandable.○ Easy to add more lexicons, classifiers, cases.

● Experimental results beat best-lexicon baseline in 4 of 6 domains.

Next Steps

Grow Solution Search Space● More lexicons, more classifiers.

Retrieval and Ranking● For larger search space, will not scale.● Room to improve case problem description.

Case Base Creation● Add negative results instead of discarding.

Thank You.

sentiment classification with case-based reasoning

Documents