twitter sentiment detection via ensemble classification using … · 2020. 8. 1. · fails 32...

35
Twitter Sentiment Detection via Ensemble Classification Using Averaged Confidence Scores Matthias Hagen Martin Potthast Michel Büchner Benno Stein Bauhaus-Universität Weimar [ www.webis.de]

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Twitter Sentiment Detection via Ensemble ClassificationUsing Averaged Confidence Scores

Matthias Hagen Martin Potthast Michel Büchner Benno Stein

Bauhaus-Universität Weimar

[www.webis.de]

Page 2: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer Science

2 © www.webis.de March 30th 2015

Page 3: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer Science

Insight

Phenomenon Hypothesis

Data

Raw Cooked

Experiment

Field Lab�

Software

Source Executable

q General goal: algorithmic automation of certain tasksq Evaluations establish success at doing soq Both rely on pieces of software

3 © www.webis.de March 30th 2015

Page 4: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer Science

Insight

Phenomenon Hypothesis

Data

Raw Cooked

Experiment

Field Lab�

Software

Source Executable

Report

ModelTheory�

Software

Source Executable

q General goal: algorithmic automation of certain tasksq Evaluations establish success at doing soq Both rely on pieces of software

q Reports frequently lack information for “painless” reproductionq Assets used during modeling and evaluation are frequently not published

4 © www.webis.de March 30th 2015

Page 5: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceMotivation, Incentives, and Barriers to Reproducing and Sharing Software

Personal motivation to reproduce a piece of research: Bias:1. to compare it with one’s own approach for a given task high2. to double-check the results (e.g., to police fraud) medium3. to employ it as sub-module of another algorithm low4. to complete a library on a given task low5. to identify the best approach for application low

5 © www.webis.de March 30th 2015

Page 6: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceMotivation, Incentives, and Barriers to Reproducing and Sharing Software

Personal motivation to reproduce a piece of research: Bias:1. to compare it with one’s own approach for a given task high2. to double-check the results (e.g., to police fraud) medium3. to employ it as sub-module of another algorithm low4. to complete a library on a given task low5. to identify the best approach for application low

Personal incentives to share one’s software:1. to ensure optimal performance in evaluations2. to build trust

3.-5. to foster adoption in research and practice

6 © www.webis.de March 30th 2015

Page 7: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceMotivation, Incentives, and Barriers to Reproducing and Sharing Software

Personal motivation to reproduce a piece of research: Bias:1. to compare it with one’s own approach for a given task high2. to double-check the results (e.g., to police fraud) medium3. to employ it as sub-module of another algorithm low4. to complete a library on a given task low5. to identify the best approach for application low

Personal incentives to share one’s software:1. to ensure optimal performance in evaluations2. to build trust

3.-5. to foster adoption in research and practice

Top barriers to sharing software [Stodden 2010]: (n=134)q The time it takes to clean up and document for release 77.78%q Dealing with questions from user about the code / software 51.85%q Supporting others without getting credit / acknowledgement 44.78%q Patents or IP constraints 40.00%

7 © www.webis.de March 30th 2015

Page 8: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceRelated Work from [Gollub et al. 2012]

Data

Rec

omm

ende

r S

yste

ms

Test

Col

lect

ions

Effi

cien

cy

Que

ry S

ugge

stio

ns

Effe

ctiv

enes

sC

lust

erin

gLi

ngui

stic

Ana

lysi

s

Mul

tilin

gual

IR

Sum

mar

izat

ion

Ver

tical

Sea

rch

Late

nt S

eman

tic A

naly

sis

Mul

timed

ia IR

Col

labo

rativ

e F

ilter

ing

IIW

eb Q

uerie

sIn

dexi

ngIm

age

Sea

rch

Ret

rieva

l Mod

els

IIC

lass

ifica

tion

Com

mun

ities

Que

ry A

naly

sis

IIU

sers

IIC

olla

bora

tive

Filt

erin

g I

Web

IRC

onte

nt A

naly

sis

Soc

ial M

edia

Ret

rieva

l Mod

els

IP

erso

naliz

atio

n

Que

ry A

naly

sis

ILe

arni

ng T

o R

ank

Use

rs I

51%

Image source: Gollub et al. 2012

q SIGIR 2011: 108 full papers, grouped by conference sessionq Papers were analyzed regarding claims of availability,

no attempts were made at downloading the mentioned assets

8 © www.webis.de March 30th 2015

Page 9: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceRelated Work from [Gollub et al. 2012]

Data

Rec

omm

ende

r S

yste

ms

Test

Col

lect

ions

Effi

cien

cy

Que

ry S

ugge

stio

ns

Effe

ctiv

enes

sC

lust

erin

gLi

ngui

stic

Ana

lysi

s

Mul

tilin

gual

IR

Sum

mar

izat

ion

Ver

tical

Sea

rch

Late

nt S

eman

tic A

naly

sis

Mul

timed

ia IR

Col

labo

rativ

e F

ilter

ing

IIW

eb Q

uerie

sIn

dexi

ngIm

age

Sea

rch

Ret

rieva

l Mod

els

IIC

lass

ifica

tion

Com

mun

ities

Que

ry A

naly

sis

IIU

sers

IIC

olla

bora

tive

Filt

erin

g I

Web

IRC

onte

nt A

naly

sis

Soc

ial M

edia

Ret

rieva

l Mod

els

IP

erso

naliz

atio

n

Que

ry A

naly

sis

ILe

arni

ng T

o R

ank

Use

rs I

51%

Software

18%Image source: Gollub et al. 2012

q SIGIR 2011: 108 full papers, grouped by conference sessionq Papers were analyzed regarding claims of availability,

no attempts were made at downloading the mentioned assets

9 © www.webis.de March 30th 2015

Page 10: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceRelated Work from [Collberg et al. 2015]

601

Nocode63

Specialhardware

30

402

Authoroverlap

106

Image source: Collberg et al. 2015

q Papers sampled from 8 ACM conferences, and 5 journalsq About 21% of applicable papers contained link to codeq Code successfully built for about 48% of applicable papers

10 © www.webis.de March 30th 2015

Page 11: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceRelated Work from [Collberg et al. 2015]

Link inpaper

85

Websearch

54

226

Reply“Yes“

87

Emailrequest

263

Noreply30

Reply“No”146

601

Nocode63

Specialhardware

30

402

Authoroverlap

106

Image source: Collberg et al. 2015

q Papers sampled from 8 ACM conferences, and 5 journalsq About 21% of applicable papers contained link to codeq Code successfully built for about 48% of applicable papers

11 © www.webis.de March 30th 2015

Page 12: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Reproducibility in Computer ScienceRelated Work from [Collberg et al. 2015]

Buildfails32

Built in<30 min.

130

Built in>30 min.

64

Link inpaper

85

Websearch

54

226

Reply“Yes“

87

Emailrequest

263

Noreply30

Reply“No”146

601

Nocode63

Specialhardware

30

402

Authoroverlap

106

Image source: Collberg et al. 2015

q Papers sampled from 8 ACM conferences, and 5 journalsq About 21% of applicable papers contained link to codeq Code successfully built for about 48% of applicable papers

12 © www.webis.de March 30th 2015

Page 13: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyNotebooks from a Shared Task

13 © www.webis.de March 30th 2015

Page 14: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyNotebooks from a Shared Task

What we did:

q Reproduced three selected Tweet sentiment classifiers from SemEval 2013q Trained an ensemble classifier

14 © www.webis.de March 30th 2015

Page 15: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyNotebooks from a Shared Task

What we did:

q Reproduced three selected Tweet sentiment classifiers from SemEval 2013q Trained an ensemble classifier

Why particularly this task?

q Related investigations, where this task came upq Focus on software, since data sets and performance measures are fixed

Why notebooks?

q Not all shared task notebooks are flawlessq Shared task results are frequently cited

15 © www.webis.de March 30th 2015

Page 16: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyNotebooks from a Shared Task

What we did:

q Reproduced three selected Tweet sentiment classifiers from SemEval 2013q Trained an ensemble classifier

Why particularly this task?

q Related investigations, where this task came upq Focus on software, since data sets and performance measures are fixed

Why notebooks?

q Not all shared task notebooks are flawlessq Shared task results are frequently cited

Reproducibility approach:

q Create low bias situation toward originalsq Reproduce rather than replicateq Maximize performance, where possible

16 © www.webis.de March 30th 2015

Page 17: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility

q Replicability: same input and same method⇒ same output

Image source: F1000Research.com, 2014

17 © www.webis.de March 30th 2015

Page 18: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable output

Image source: ScienceNews, 2015

18 © www.webis.de March 30th 2015

Page 19: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq What should we do in case of failure?

Image source: The Economist, 2013

19 © www.webis.de March 30th 2015

Page 20: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq What should we do in case of failure?

Image source: The Economist, 2013

IMPROVE IT!IMPROVE IT!

20 © www.webis.de March 30th 2015

Page 21: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility vs. Improvability

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq Improvability: same input and better method⇒ better output

(may include improvisation)

21 © www.webis.de March 30th 2015

Page 22: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility vs. Improvability

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq Improvability: same input and better method⇒ better output

(may include improvisation)

22 © www.webis.de March 30th 2015

Page 23: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility vs. Improvability

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq Improvability: same input and better method⇒ better output

(may include improvisation)

q Order of degrees of freedom:

improve � reproduce � replicate

q What shared asset helps:

source code � library or API � demo or web service

23 © www.webis.de March 30th 2015

Page 24: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

An Ongoing Debate in Science ReproducibilityReplicability vs. Reproducibility vs. Improvability

q Replicability: same input and same method⇒ same outputq Reproducibility: similar input and equivalent method⇒ comparable outputq Improvability: same input and better method⇒ better output

(may include improvisation)

q Order of degrees of freedom:

improve � reproduce � replicate

q What shared asset helps:

source code � library or API � demo or web service

q Possible code of conduct for reproducing research:

1. Try to improve it (e.g., by adding your own expertise and experience)2. Try to reproduce it with variations (e.g., different domains of application)3. As a last resort, replicate it, following the original to the letter

24 © www.webis.de March 30th 2015

Page 25: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyRemarks on Reproducing the Selected Appraoches

25 © www.webis.de March 30th 2015

Page 26: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyRemarks on Reproducing the Selected Appraoches

Selection criteria:

q High performance at SemEval 2013: NRC-Canada, GU-MLT-LT, KLUEq Complementary approaches (i.e., not simply the top three)

26 © www.webis.de March 30th 2015

Page 27: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyRemarks on Reproducing the Selected Appraoches

Selection criteria:

q High performance at SemEval 2013: NRC-Canada, GU-MLT-LT, KLUEq Complementary approaches (i.e., not simply the top three)

Notable improvments / improvisations:

q All: feature descriptions generally very terseq All: unification of Tweet normalization proceduresq All: L2-regularized logistic regression instead of original learning algorithmsq All: different parameter settings per approachq KLUE: creation of our own emoticon polarity dictionaryq KLUE: unification from word frequency to Boolean occurrence

27 © www.webis.de March 30th 2015

Page 28: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyRemarks on Reproducing the Selected Appraoches

Selection criteria:

q High performance at SemEval 2013: NRC-Canada, GU-MLT-LT, KLUEq Complementary approaches (i.e., not simply the top three)

Notable improvments / improvisations:

q All: feature descriptions generally very terseq All: unification of Tweet normalization proceduresq All: L2-regularized logistic regression instead of original learning algorithmsq All: different parameter settings per approachq KLUE: creation of our own emoticon polarity dictionaryq KLUE: unification from word frequency to Boolean occurrence

Performance comparison:

Team Original SemEval 2013 Reimplementation DeltaNRC-Canada 69.02 69.44 +0.42GU-MLT-LT 65.27 67.27 +2.00KLUE 63.06 67.05 +3.99

28 © www.webis.de March 30th 2015

Page 29: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyPerformance in the Context of SemEval

SemEval 2013Team F1Our ensemble 71.09NRC-Canada 69.02GU-MLT-LT 65.27teragram 64.86BOUNCE 63.53KLUE 63.06AMI&ERIC 62.55FBM 61.17AVAYA 60.84SAIL 60.1427 more ...

SemEval 2014Team F1TeamX 70.96coooolll 70.14RTRGO 69.95NRC-Canada 69.85Our ensemble 69.79TUGAS 69.00CISUC KIS 67.95SAIL 67.77Swiss-Chocolate 67.54Synalp-Empathic 67.4340 more ...

SemEval 2015Team F1Our ensemble 64.84unitn 64.59Isislif 64.27INESC-ID 64.17Splusplus 63.73wxiaoac 63.00IOA 62.62Swiss-Chocolate 62.61CLaC-SentiPipe 62.00TwitterHawk 61.9930 more ...

29 © www.webis.de March 30th 2015

Page 30: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Our Reproducibility StudyPerformance in the Context of SemEval

SemEval 2013Team F1Our ensemble 71.09NRC-Canada 69.02GU-MLT-LT 65.27teragram 64.86BOUNCE 63.53KLUE 63.06AMI&ERIC 62.55FBM 61.17AVAYA 60.84SAIL 60.1427 more ...

SemEval 2014Team F1TeamX 70.96coooolll 70.14RTRGO 69.95NRC-Canada 69.85Our ensemble 69.79TUGAS 69.00CISUC KIS 67.95SAIL 67.77Swiss-Chocolate 67.54Synalp-Empathic 67.4340 more ...

SemEval 2015Team F1Our ensemble 64.84unitn 64.59Isislif 64.27INESC-ID 64.17Splusplus 63.73wxiaoac 63.00IOA 62.62Swiss-Chocolate 62.61CLaC-SentiPipe 62.00TwitterHawk 61.9930 more ...

q Adding TeamX pushes our ensemble to the top in 2015q Refer to [Hagen et al. 2015] for detailsq Task organizers should predict ensemble performance as a baseline

30 © www.webis.de March 30th 2015

Page 31: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Evaluation as a Service using TIRA

[www.tira.io]

31 © www.webis.de March 30th 2015

Page 32: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Summary & ConclusionSummary:

q State-of-the-art Twitter sentiment detection approaches reproducibleq Our code is publicly available at GitHub: http://www.github.com/webis-deq Neither of the existing approaches maximizes performance

32 © www.webis.de March 30th 2015

Page 33: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Summary & ConclusionSummary:

q State-of-the-art Twitter sentiment detection approaches reproducibleq Our code is publicly available at GitHub: http://www.github.com/webis-deq Neither of the existing approaches maximizes performance

Take-home messages:

q Computer science can tackle reproducibility at a fundamental levelq Replicability vs. reproducibility lacks a third dimension: improvabilityq Reproducibility should incorporate personal expertise and experienceq Sharing software may greatly improve aspects of reproducibility

33 © www.webis.de March 30th 2015

Page 34: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Summary & ConclusionSummary:

q State-of-the-art Twitter sentiment detection approaches reproducibleq Our code is publicly available at GitHub: http://www.github.com/webis-deq Neither of the existing approaches maximizes performance

Take-home messages:

q Computer science can tackle reproducibility at a fundamental levelq Replicability vs. reproducibility lacks a third dimension: improvabilityq Reproducibility should incorporate personal expertise and experienceq Sharing software may greatly improve aspects of reproducibility

Open questions ahead:

q What are the most worthy targets?q What constitutes impact in reproducing science?q Will sharing software become the norm?

34 © www.webis.de March 30th 2015

Page 35: Twitter Sentiment Detection via Ensemble Classification Using … · 2020. 8. 1. · fails 32 Built in 30 min. 64 Link in paper 85 Web search 54 226

Summary & ConclusionSummary:

q State-of-the-art Twitter sentiment detection approaches reproducibleq Our code is publicly available at GitHub: http://www.github.com/webis-deq Neither of the existing approaches maximizes performance

Take-home messages:

q Computer science can tackle reproducibility at a fundamental levelq Replicability vs. reproducibility lacks a third dimension: improvabilityq Reproducibility should incorporate personal expertise and experienceq Sharing software may greatly improve aspects of reproducibility

Open questions ahead:

q What are the most worthy targets?q What constitutes impact in reproducing science?q Will sharing software become the norm?

Thank you for your attention!

35 © www.webis.de March 30th 2015