2013-1 machine learning lecture 03 - sergio jimenez - text classification …
DESCRIPTION
TRANSCRIPT
![Page 1: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/1.jpg)
Text Classification and Clustering
with
WEKAWEKA
A guided example by
Sergio Jiménez
![Page 2: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/2.jpg)
The Task
Building a model for movies revisions in English
for classifying it into positive or negative.
![Page 3: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/3.jpg)
Sentiment Polarity Dataset Version 2.0
1000 positive movie review and 1000 negative review texts from:
Thumbs up? Sentiment Classification using Machine Learning Techniques. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Proceedings of EMNLP, pp. 79--86, 2002.
“Our data source was the Internet Movie Database (IMDb) archive of “Our data source was the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews newsgroup.3 We selected only reviews where the author rating was expressed either with stars or some numerical value (other conventions varied too widely to allow for automatic processing). Ratings were automatically extracted and converted into one of three categories: positive, negative, or neutral. For the work described in this paper, we concentrated onlyon discriminating between positive and negative sentiment.”
http://www.cs.cornell.edu/people/pabo/movie-review-data/
![Page 4: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/4.jpg)
The Data (1/2)
![Page 5: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/5.jpg)
The Data (2/2)
0
50
100
150
200
250
300
350
# D
ocu
me
nts
1000 negative revisions histogram
0
# characters
0
50
100
150
200
250
300
#D
ocu
me
nts
# characters
1000 positive revisions histogram
![Page 6: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/6.jpg)
What WEKA is?
• “Weka is a collection of machine learning algorithms for data mining tasks”.
• “Weka contains tools for:
– data pre-processing,
– classification,
– regression,
– clustering,
– association rules,
– and visualization”
![Page 7: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/7.jpg)
Where to start?
![Page 8: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/8.jpg)
Getting WEKA
![Page 9: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/9.jpg)
Before Running WEKAIncreasing available memory for Java in RunWeka.ini
Change
maxheap=256m
to
maxheap=1024m
![Page 10: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/10.jpg)
Running WEKA
using
“RunWeka.bat”
![Page 11: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/11.jpg)
Creating a .arff dataset
![Page 12: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/12.jpg)
Saving the .arff dataset
![Page 13: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/13.jpg)
From text to vectors
],,,,,[ 321 classvvvvV nL=
review1=“great movie”
review2=“excellent film”
review3=“worst film ever”
review4=“sucks”
exce
lle
nt
],0,0,1,1,0,0,0[1 +=V],0,0,0,0,1,1,0[2 +=V],1,0,0,0,1,0,1[3 −=V],0,1,0,0,0,0,0[4 −=V
ev
er
exce
lle
nt
film
gre
at
mo
vie
suck
s
wo
rst
![Page 14: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/14.jpg)
Converting to Vector Space Model
Edit “movie_reviews.arff”
and change “class” to
“class1”. Apply the filter
again after the change.
![Page 15: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/15.jpg)
Visualize the vector data
![Page 16: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/16.jpg)
StringToWordVector filter options
lowerCase convertion
TF-IDF weigthing
Stopwords removal using a list
of words in a file
Stemming
Use frequencies instead of
single presence
![Page 17: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/17.jpg)
Generating datasets for experiments
dataset file name Stopwords StemmingPresence or
freq.
movie_reviews_1.arff no presence
movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency
movie_reviews_3.arff yes presence
movie_reviews_4.arff yes frequency
movie_reviews_5.arff removed no presence
movie_reviews_6.arff removed no frequency
movie_reviews_7.arff removed yes presence
movie_reviews_8.arff removed yes frequency
![Page 18: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/18.jpg)
Classifying ReviewsClick!
Select number
Select a
classifier
Select class
attribute
Select number
of folds
Start !
![Page 19: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/19.jpg)
Results
![Page 20: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/20.jpg)
Results Correctly Classified Reviews
dataset name Stopwords StemmingPresence
or freq.
Naive
Bayes 3-
fold
NaiveBayes
Multinomial
3-fold
movie_reviews_1.arff no presence 80.65% 83.80%
movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency 69.30% 78.65%
movie_reviews_3.arff yes presence 79.40% 82.15%
movie_reviews_4.arff yes frequency 68.10% 79.70%
movie_reviews_5.arff removed no presence 81.80% 84.35%
movie_reviews_6.arff removed no frequency 69.40% 81.75%
movie_reviews_7.arff removed yes presence 78.90% 82.40%
movie_reviews_8.arff removed yes frequency 68.30% 80.50%
![Page 21: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/21.jpg)
Attribute (word) Selecction
Choose an Attribute
Selection Algorithm
Select the
class attribute
![Page 22: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/22.jpg)
Selected Attributes (words)also
awful
bad
boring
both
dull
fails
pointless
poor
ridiculous
script
seagal
sometimes
stupid
deserves
effective
flaws
greatest
hilarious
memorable
overallgreat
joke
lame
life
many
maybe
mess
nothing
others
perfect
performances
stupid
tale
terrible
true
visual
waste
wasted
world
worst
animation
definitely
overall
perfectly
realistic
share
solid
subtle
terrific
unlike
view
wonderfully
![Page 23: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/23.jpg)
Pruned movie_reviews_1.arff dataset
![Page 24: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/24.jpg)
Naïve Bayes with the pruned dataset
![Page 25: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/25.jpg)
Clustering
Correctly clustered instances: 65.25%
![Page 26: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/26.jpg)
Other results
Results of Pang et al. (2002) with version 1.0 of the dataset with 700+ and 700-
![Page 27: 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …](https://reader033.vdocuments.site/reader033/viewer/2022051613/54c63cc34a7959c9388b471a/html5/thumbnails/27.jpg)
Thanks