chuma kabaghe, jason qin

1
Classifying Tweets Based on Climate Change Stance Chuma Kabaghe, Jason Qin Department of Computer Science, Stanford University Department of Chemical Engineering , Stanford University Motivation Climate change has become an increasingly polarizing topic in popular and political media. Using Twitter, we seek to develop a classifier that can discriminate between text that shows belief vs. disbelief in human-caused climate change. We use semisupervised classification with Quasi- Newton Semi-SVM implementation (QN- S3VM) and Multinomial Naive Bayes with Expectation Maximization (MNB-EM) into positive, neutral, and negative categories. Our results show that semisupervised learning does not offer significant improvements over supervised learning. Data Set - Three categories: positive, neutral, negative belief of human-caused climate change - Manually labeled training/ validation/test set of ~14K tweets - Unlabeled set of ~38K tweets related to climate change/global warming collected using TweePy - Data split into 70/20/10 train/test/val Tweet Class "climate change, a factor in texas floods, largely ignored" 1 "global warming on mars...." 0 "scientists were attempting the same global warming scam 60 years ago" -1 Results Supervised Learning Multinomial Naive Bayes - Unigram: ! " # is each word - Bigram: ! " # = (& " # ,& "() # ) pair of words - Assume independence of all words/pairs within and between each example Semisupervised Learning Self Training QN-S3VM - Maximum-margin classification algorithm - Find the optimal y for the unlabeled data MNB-EM - E-step: - M-step: *=+ #,) - + ",) . / 0 ! " # 1 # ;3 4|6 0(1 # ;3 6 ) L = #,) - ",) . / log 0 ! " # 1 # ;3 4|6 + log 0 1 # ;3 6 > #,) 4 log 6 ? @ A 1 A CDE F ? G ! H A 1 A ;3 4|6 G 6 / ;I J K ? 6 ? Train Model with Training Data Predict Labels of Unlabeled Data Add data with >50% confidence in label Final Model @ A 1 A = H,) . ? 0 ! H A 1 A ;3 4|6 0 1 # ;3 6 6 ? H,) . ? 0 ! H A 1 A ;3 4|6 0 1 # ;3 6 3 4|6 = 1+>∑ #,) - ",) . / 1(! " # =M⋀1 # = O) + ∑ A,) 4 @ A (1 A )∑ H,) . ? 1(! H A = M) P+>∑ #,) - ",) . / 1(1 # = O) + ∑ A,) 4 @ A (1 A )∑ H,) . ? 1 3 6 = 1+>∑ #,) - 1(1 # = O) + ∑ A,) 4 @ A (1 A ) Q + >R + M Methods Comparing Algorithms Accuracy Per Class Model Tuning Model Training Acc Validation Acc Test Acc MNB Unigram 76.8 67.6 66.4 MNB Bigram 72.8 62.5 60.6 Naive Bayes Self- Training 66.7 60.9 61.6 QN-S3VM 59.2 59.0 55.0 MNB-EM 81.9 62.5 63.9 Conclusions References MNB-EM MNB Unigram Kernel Training Acc Validation Acc RBF 59.2 59.0 Linear 52.9 49.2 Alpha Training Acc Validation Acc 1 67.3 60.7 5 76.8 62.3 20 81.9 65.2 50 82.6 65.1 100 82.9 64.5 Data Downsampling Question: how much would unlabeled data help if we had less labeled data? Conclusions - Unigram Multinomial Naive Bayes model performs best - Semisupervised learning with NB performs well, likely since underlying NB model fits data well - More data would improve predictions In future - Incorporate more recently labeled data - Attempt classification with DL models [1] Edward C Qian. “edwardcqian/climate_change_sentiment”. In: GitHub (Mar. 2019). URL:https://github.com/edwardcqian/climate_change_sentiment [2] Fabian Gieseke et al. “Sparse Quasi-Newton Optimization for Semi- Supervised Support Vector Machines”. In: vol. 1. Jan. 2012, pp. 45–54. [3] Kamal Nigam et al. “Text Classification from Labeled and Unlabeled Documents using EM”. In: Machine Learning 39.2 (May 2000), pp. 103–134. ISSN:1573-0565. DOI:10.1023/A:1007692713085. URL:https://doi.org/10.1023/A:1007692713085

Upload: others

Post on 03-Nov-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chuma Kabaghe, Jason Qin

Classifying Tweets Based on Climate Change StanceChuma Kabaghe, Jason Qin

Department of Computer Science, Stanford University Department of Chemical Engineering , Stanford University

MotivationClimate change has become an increasingly polarizing topic in popular and political media. Using Twitter, we seek to develop a classifier that can discriminate between text that shows belief vs. disbelief in human-caused climate change. We use semisupervised classification with Quasi-Newton Semi-SVM implementation (QN-S3VM) and Multinomial Naive Bayes with Expectation Maximization (MNB-EM) into positive, neutral, and negative categories. Our results show that semisupervisedlearning does not offer significant improvements over supervised learning.

Data Set- Three categories: positive, neutral,

negative belief of human-caused climate change

- Manually labeled training/ validation/test set of ~14K tweets

- Unlabeled set of ~38K tweets related to climate change/global warming collected using TweePy

- Data split into 70/20/10 train/test/val

Tweet Class

"climate change, a factor in texas floods, largely ignored"

1

"global warming on mars...." 0

"scientists were attempting the same global warming scam 60 years ago"

-1

ResultsSupervised LearningMultinomial Naive Bayes

- Unigram: !"# is each word

- Bigram: !"#= (&"

#, &"()

# ) pair of words- Assume independence of all words/pairs within and

between each example

Semisupervised LearningSelf Training

QN-S3VM- Maximum-margin classification algorithm - Find the optimal y for the unlabeled data

MNB-EM

- E-step:

- M-step:

* = +

#,)

-

+

",)

./

0 !"#1 # ; 34|6 0(1 # ; 36)

L = ∑#,)- ∑",)./ log 0 !"

#1 # ; 34|6 + log 0 1 # ; 36 ∗

> ∑#,)4 log ∑

6 ? @A 1A

∏CDE

F? G !HA1 A ; 34|6 G 6 / ;IJ

K? 6?

Train Model with Training

Data

Predict Labels of Unlabeled

DataAdd data with >50% confidence in label

Final Model

@A 1A =

∏H,).? 0 !H

A1 A ; 34|6 0 1 # ; 36

∑6 ? ∏H,)

.? 0 !HA1 A ; 34|6 0 1 # ; 36

34|6 =1 + >∑#,)

- ∑",)./ 1(!"

#= M ⋀1 # = O) + ∑A,)

4 @A(1A ) ∑H,)

.? 1(!HA= M)

P + >∑#,)- ∑

",)

./ 1(1 # = O) + ∑A,)4 @A(1

A ) ∑H,).? 1

36 =1 + >∑#,)

- 1(1 # = O) + ∑A,)4 @A(1

A )

Q + >R + M

MethodsComparing Algorithms

Accuracy Per Class

Model Tuning

Model Training Acc Validation Acc Test Acc

MNB Unigram 76.8 67.6 66.4

MNB Bigram 72.8 62.5 60.6

Naive Bayes Self-Training

66.7 60.9 61.6

QN-S3VM 59.2 59.0 55.0

MNB-EM 81.9 62.5 63.9

Conclusions

References

MNB-EMMNB Unigram

Kernel Training Acc Validation Acc

RBF 59.2 59.0

Linear 52.9 49.2

Alpha Training Acc Validation Acc

1 67.3 60.7

5 76.8 62.3

20 81.9 65.2

50 82.6 65.1

100 82.9 64.5

Data DownsamplingQuestion: how much would unlabeled data help if we had less labeled data?

Conclusions- Unigram Multinomial Naive Bayes model

performs best - Semisupervised learning with NB performs

well, likely since underlying NB model fits data well

- More data would improve predictions

In future- Incorporate more recently labeled data- Attempt classification with DL models

[1] Edward C Qian. “edwardcqian/climate_change_sentiment”. In: GitHub(Mar. 2019). URL:https://github.com/edwardcqian/climate_change_sentiment

[2] Fabian Gieseke et al. “Sparse Quasi-Newton Optimization for Semi-Supervised Support Vector Machines”. In: vol. 1. Jan. 2012, pp. 45–54.

[3] Kamal Nigam et al. “Text Classification from Labeled and Unlabeled Documents using EM”. In: Machine Learning 39.2 (May 2000), pp. 103–134. ISSN:1573-0565. DOI:10.1023/A:1007692713085. URL:https://doi.org/10.1023/A:1007692713085