generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus †...

32
Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus Institute for Advanced Computer Studies and CLIP lab Human-Computer Interaction Lab Department of Computer Science, University of Maryland. *Human Language Technology Center of Excellence. Saif Mohammad , Cody Dunne , and Bonnie Dorr

Upload: marcos-jose

Post on 15-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating High-Coverage Semantic Orientation Lexicons

From Overtly Marked Words and a Thesaurus

†Institute for Advanced Computer Studies and CLIP lab‡Human-Computer Interaction Lab

Department of Computer Science, University of Maryland. *Human Language Technology Center of Excellence.

Saif Mohammad†, Cody Dunne‡,

and Bonnie Dorr†∗

Page 2: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

2

Evaluative sentences

Sony’s new digital camera is fabulous.

The characters in the movie are flawed.

Creative solutions are valued.

Singapore has an immaculate transportation system.

Our waters have never been more contaminated.

Page 3: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

3

Evaluative sentences

Sony’s new digital camera is fabulous.

The characters in the movie are flawed.

Creative solutions are valued.

Singapore has an immaculate transportation system.

Our waters have never been more contaminated.

Page 4: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

4

Semantic orientation Positive semantic orientation (SO) (or polarity) ◦ Term is often used to convey favorable sentiment or

evaluation of the target.◦ E.g.: excellent, happy, honest, …

Negative semantic orientation ◦ Term is often used to convey unfavorable sentiment

or evaluation of the target. ◦ E.g.: poor, sad, dishonest, …

Page 5: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

5

Applications Automatic product recommendation systems

(Tatemura, 2000; Terveen1 et al., 1997)

Question answering (Somasundaran et al., 2007; Lita et al., 2005)

Summarizing multiple view points and opinions (Seki et al., 2004; Mohammad et al., 2008a)

Identifying flames(Spertus, 1997)

Appropriate ad placement(Jin et al. 2007)

Page 6: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

6

Manually created lexicons General Inquirer (GI) (Stone et al., 1966)

◦ http://www.wjh.harvard.edu/inquirer◦ has labels for only about 3,600 entries

Pittsburgh subjectivity lexicon (PSL) (Wilson et al., 2005)

◦ http://www.cs.pitt.edu/mpqa◦ draws from the General Inquirer and other sources◦ has labels for only for about 8,000 words.

Page 7: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

7

Automatically created lexicons Hatzivassiloglou and McKeown (1997) ◦ a supervised algorithm to determine the semantic

orientation of adjectives.

Turney and Littman lexicon (TLL) (2003)◦ Exploit tendency to co-occur with a seed set◦ Need very large corpora (100 billion words)

Esuli and Sebastiani (2006) — SentiWordNet (SWN) ◦ Attach labels to WordNet synsets◦ Use supervised classifiers◦ Need significant manual annotation

Page 8: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

8

Semantic oppositeness scale

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

antonymousnot antonymousbig–smallbig–large

many antonym pairs have oppositesemantic orientation (one positive, one negative)good–bad; beautiful–ugly; honest–dishonest

Page 9: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

9

Detecting word-pair antonymy:Mohammad, Dorr, Hirst (2008)

Use affix patterns to identify seed pairs of strong antonyms.

Use a Roget-like thesaurus to identify near-synonyms of seed words.

Mark pairs of words near-synonymous to seed pairs as contrasting.

The degree of antonymy is proportional to their tendency to co-occur.

Created a list of more than 3 million strongly antonymous word pairs.

Page 10: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

10

Our approach Identify a seed set of positive and negative words:◦ From edicts of marking theory

Identify their synonyms:◦ Use a Roget-like thesaurus

Mark as negative: ◦ words synonymous with a negative seed

Mark as positive: ◦ words synonymous to a positive seed

Page 11: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

11

Step 1:

Identify seed words From marking theory:◦ Overtly marked words tend to be negative.

E.g., undo, unhappy, dishonest, immobile◦ Their unmarked counterparts tend to be positive.

E.g., do, happy, honest, mobile

Exceptions exist:◦ impartial—partial, unbiased—biased, unstuck—stuck

Page 12: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

12

Affix patternsword1 word2 # of word pairs example pairs

X disX 382 honest–dishonest

X imX 196 possible–impossible

X inX 691 consistent–inconsistent

X malX 28 adroit–maladroit

X misX 146 fortune–misfortune

X nonX 73 sense–nonsense

X unX 844 happy–unhappy

X Xless 208 gut–gutless

lX illX 25 legal–illegal

rX irX 48 responsible–irresponsible

Xless Xful 51 harmless–harmful

Total 2692

Page 13: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

13

Step 2:

Identify synonyms of seed wordsTake synonyms from a Roget-like thesaurus◦ We used the Macquarie Thesaurus◦ Has 98,000 word-types

Page 14: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

14

Thesaurus categories All words classified into ~1000 categories

abilityabsence accept accompanied action affect affirm agree allow approach ask assemble attack attitude awareness

be beautiful beings belief better big blood body breath calmcare for careful cause certain change

choice clean clear collect colors comfort concern conflict connect continue control convex correct count courtesy

Page 15: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

15

Example category entry

369 HONESTY

adj. paragraph honest above board authentic bona fide legit …

noun paragraph bona fides reliability soundness trueness trustiness …

adj. paragraph reliable sound steadfast trustworthy trusty …

noun paragraph honesty incorruptness integrity probity sincerity …

Page 16: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

16

369 HONESTY

Words in each paragraph are near-synonyms.

Step 2:

Identify synonyms of seed words

adj. paragraph honest above board authentic bona fide legit …

noun paragraph bona fides reliability soundness trueness trustiness …

adj. paragraph reliable sound steadfast trustworthy trusty …

noun paragraph honesty incorruptness integrity probity sincerity …

Page 17: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

17

adj. paragraph reliable sound steadfast trustworthy trusty …

adj. paragraph honest above board authentic bona fide legit …

Seed pair:honest — dishonest(positive) (negative)

+++++

Seed pair:reliable — unreliable(positive) (negative)

+

++

++

Step 3: Mark as positive synonyms of positive seeds

369 HONESTY

noun paragraph bona fides reliability soundness trueness trustiness …

noun paragraph honesty incorruptness integrity probity sincerity …

Page 18: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

18

370 DISHONESTY

noun paragraph crookedness dishonesty fraudulence improbity trickery …

adj. paragraph crooked dishonest knavish shady unjust …

… …

Seed pair:honest — dishonest(positive) (negative)

--

-

--

Step 4: Mark as negative synonyms of negative seeds

Page 19: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

19

Majority voting All words in a paragraph assigned identical orientation. If multiple seeds in the same paragraph:◦ simple voting determines orientation.

369 HONESTY

noun paragraph honesty incorruptness integrity probity sincerity …

Seed pairs:

honesty — dishonesty(positive) (negative)

+-

corruptness — incorruptness(positive) (negative)

+ probity … — improbity(positive) (negative)+sincerity.. — insincerity (positive) (negative)

Page 20: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

20

369 HONESTY

noun paragraph honesty incorruptness integrity probity sincerity …

Majority voting All words in a paragraph have identical orientation. If multiple seeds in the same paragraph:◦ simple voting determines orientation.

+

++

++

Positive orientation has majority, so all words in the paragraph are marked positive.

Page 21: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

21

Sense and word lexicons Macquarie Semantic Orientation Lexicon (MSOL)◦ Assigns orientation to word—category combinations◦ Categories are coarse word senses

Most natural language text is not sense disambiguated

We create word lexicons from MSOL and SentiWordNet◦ By choosing for each word the orientation most

common amongst its senses

Page 22: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

22

Size of lexicons SentiWordNet (SWN)

◦ 56,200 entries (85.1◦ sitive and 14.9% negative)

Affix seeds lexicon (ASL)

◦ 5,031 entries (47.3% positive and 52.7% negative)

MSOL(ASL)

◦ 51,157 entries (66.8% positive and 33.2% negative)

◦ 3,643 multi-word expressions

MSOL(ASL and GI)

◦ Uses both affix pairs and GI entries as seeds

◦ 76,400 entries (39.9% positive and 60.1% negative)

◦ Available for download:http://www.umiacs.umd.edu/~saif/WebPages/ResearchInterests.html#SemanticOrientation

Page 23: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

23

Intrinsic evaluation:The percentage of GI entries that match those of the automatically generated lexicons.

F-s

core

SWN TLL MSOL(ASL)0

10

20

30

40

50

60

70

80

90

Page 24: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

24

Extrinsic evaluation Gold standard of phrases manually annotated with

semantic orientation:◦ MPQA corpus (version 1.1)◦ positive phrases (1726) and negative phrases (4485)

A simple algorithm to determine the polarity of a phrase: ◦ If target phrase has a negative word, then the phrase

is marked negative.◦ If target phrase has no negative word and has at least

one positive word, then it is marked positive. ◦ Otherwise, the classifier refrains from assigning a tag.

Even better accuracies: supervised classifiers and more sophisticated context features (Choi and Cardie, 2008).

Page 25: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

25

ASL GI SWN TLL MSOL(ASL)

0

0.1

0.2

0.3

0.4

0.5

0.6

F-s

core

Extrinsic evaluation:Performance of phrase polarity tagging.No semantic-orientation labeled data used.

Page 26: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

26

F-s

core

Extrinsic evaluation:Performance of phrase polarity tagging.Using GI labels.

PSL GI-SWN MSOL(ASL,GI)0.2

0.3

0.4

0.5

0.6

0.7

0.8

Page 27: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

27

Orientation of thesaurus categories

Red: negative; Blue: positive; Size of node: intensity; Edge: oppositeness

Page 28: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

28

Polyanna Hypothesis

People use positive expressions more

frequently than negative expressions.(Boucher and Osgood, 1969; Kelly, 2000)

5031 entriesASL MSOL(ASL)

01020304050607080

positive negative

Per

cent

age

of e

ntrie

s

Page 29: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

29

Polyanna Hypothesis

People use positive expressions more

frequently than negative expressions.(Boucher and Osgood, 1969; Kelly, 2000)

ASL MSOL(ASL)0

1020304050607080

positive negative

5031 entries

Per

cent

age

of e

ntrie

s

51157 entries

Page 30: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

30

Summary Created a high-coverage semantic orientation lexicon:◦ using only affix rules and a Roget-like thesaurus.◦ no manually annotated semantic orientation labels

required.

The lexicon:◦ has about twenty times the number of entries in GI.◦ has entries for both single-words and common multi-

word expressions.◦ more useful in phrase-polarity annotation than

SentiWordNet, GI, or the Turney lexicon.

Page 31: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab

Generating Semantic Orientation Lexicons. Mohammad, Dunne, Dorr.

31

Future work Creating even better semantic orientation lexicons by

combining:◦ our approach (affix rules and thesaurus) ◦ with the Turney–Littman 2003 method (co-occurrence

statistics).

Create orientation lexicons for resource-poor languages.◦ use a bilingual dictionary◦ use English thesaurus◦ use affix rules from both (multiple) languages.

Page 32: Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus † Institute for Advanced Computer Studies and CLIP lab