empirical sentiment accuracy bounds

16
On Empirical Sentiment Accuracy Bounds Shawn Rutledge, Chief Scientist

Upload: visible-technologies

Post on 07-Jul-2015

1.211 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Empirical Sentiment Accuracy Bounds

On Empirical Sentiment Accuracy Bounds Shawn Rutledge, Chief Scientist

Page 2: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Visible’s Sentiment Approach

Algorithms

• State of the art

• Beyond overhyped NLP

Features

• Deep experience

• Social NLP & Context

Data

• Massive proprietary data

A sentiment model

based on years of

labeling social data for

enterprises.

107+ labels, 105+

topics, 102+

enterprises.

Visible was one of the

first Social Media

Monitoring solution in

the market.

Page 3: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Visible’s Sentiment Approach

Algorithms

• State of the art

• Beyond overhyped NLP

Features

• Deep experience

• Social NLP & Context

Data

• Massive proprietary data

A sentiment model

based on years of

labeling social data for

enterprises.

107+ labels, 105+

topics, 102+

enterprises.

We have 10s of

millions of human

annotated social

media posts

Page 4: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Visible’s Sentiment Approach

Algorithms

• State of the art

• Beyond overhyped NLP

Features

• Deep experience

• Social NLP & Context

Data

• Massive proprietary data

A sentiment model

based on years of

labeling social data for

enterprises.

107+ labels, 105+

topics, 102+

enterprises.

Basically all break-

through in the last two

decades have come

from better data

Page 5: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

• Claims: “We have 97%

Accuracy”

• Experience: “The best

vendor tested had 50%

accuracy at the post

level”

• Experience: Sentiment

Accuracy most

dissatisfying feature

according to Forrester

research, only 45%

satisfied with vendor

sentiment accuracy

Sentiment, The Accuracy Disconnect

There is a disconnect

between the hype and the

experience in the

marketplace

Page 6: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

1. Solve relevance first, sentiment second.

2. Accuracy is the wrong measure to optimize.

3. Sentiment is more subjective than you think it is.

Key Findings After spending several years of

research with the best available data,

here are some of the key findings.

Page 7: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

1. Solve relevance first, sentiment second.

2. Accuracy is the wrong measure to optimize.

3. Sentiment is more subjective than you think it is.

Key Findings

We won’t have time to cover the first two. The

third could be an alternate title for this talk.

Page 8: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Double Blind, Multi-Reviewer Study:

Audit Findings, Large Financial Institution

No statistically significant

difference between human

labeled and AI labeled

sentiment

1. Same posts labeled by both human

labeling practice and automation.

2. At least two auditors grade each

label. Blind to label source.

A typical study.

Reviewers can’t tell the

difference between Visible’s

statistical models and human

annotators.

Page 9: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Double Blind, Multi-Reviewer Study:

But…

Auditors agree with each other only 73% of the time [95%CI: 69%-77%].

Audit Findings, Large Financial Institution

No statistically significant

difference between human

labeled and AI labeled

sentiment

1. Same posts labeled by both human

labeling practice and automation.

2. At least two auditors grade each

label. Blind to label source. So is Sentiment “solved”?

No, Auditors think people and

automation are both poor. And they

don’t agree with each other.

Page 10: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

Key Audit Findings, Large Financial Institution

Both auditors agree with

label only 58% of the time

At least one auditor agrees with label 91%

of the time

Social Media Professionals Grading Human Annotations

58% - 91% is a huge range.

Another way of looking at the same study

Proxy for

“hard”

graders

Proxy for

“easy”

graders

Page 11: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

True Across a Wide Variety of Problems

Multi-Reviewer 3rd party audits across a variety of Brands consistently show

relatively low agreement rates.

About 81% Inter-Annotator Agreement [IQR: 78% - 83%]

This talk

promised

bounds and

here they are.

Page 12: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

True Across a Wide Variety of Problems

Multi-Reviewer 3rd party audits across a variety of Brands consistently show

relatively low agreement rates.

About 81% Inter-Annotator Agreement [IQR: 78% - 83%]

80% is also consistent

with academic research

Page 13: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

1. Yes, your team

2. Evaluating sentiment takes care

3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)

4. It will take effort to get your team in tight agreement on sentiment definitions

5. Real breakthroughs in sentiment accuracy will come from personalization

Take Aways

We all think we’re better than average drivers.

Similarly, although most of us have heard something like the

80% agreement statistic, we don’t think it applies to us. The

main thing I want you to take away from this talk is that it

does apply to you. People within your department, your

team, sitting in the cube next to you, disagree with you

about 20% of the time.

Page 14: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

1. Yes, your team

2. Evaluating sentiment takes care

3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)

4. It will take effort to get your team in tight agreement on sentiment definitions

5. Real breakthroughs in sentiment accuracy will come from personalization

Take Aways

The implications are also worth taking

to heart. When people claim accuracies

much higher than 80% they are either

lying or they don’t know what they are

doing (overfit to one dataset) .

Page 15: Empirical Sentiment Accuracy Bounds

Copyright © 2011 Visible. All rights reserved.

1. Yes, your team

2. Evaluating sentiment takes care

3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)

4. It will take effort to get your team in tight agreement on sentiment definitions

5. Real breakthroughs in sentiment accuracy will come from personalization

Take Aways

Similar to what has happened in Search, real breakthroughs will come

though personalization. Deeper linguistics (dealing with sarcasm, humor,

contextual knowledge) are interesting but can’t help break the 80% barrier.

If teams put the work into getting tight, consistent sentiment definitions (with

>80% agreement), only then do algorithms have a chance to do that well.

Page 16: Empirical Sentiment Accuracy Bounds

Thank You!

@shawnrut

@Visible

VisibleTechnologies.com