![Page 1: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/1.jpg)
Automatic term extraction of dynamically updated text collections for sentiment
classification into three classes
Yuliya Rubtsova
The A.P. Ershov Institute of Informatics Systems (IIS)
![Page 2: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/2.jpg)
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
![Page 3: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/3.jpg)
![Page 4: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/4.jpg)
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
recommender systems;
![Page 5: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/5.jpg)
![Page 6: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/6.jpg)
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
recommender systems;
Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
![Page 7: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/7.jpg)
Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the
current emotional state of the person
psychological and medical diagnosis;
safety control by analyzing the behavior of mass gatherings;
assistance in carrying out investigative measures.
![Page 8: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/8.jpg)
Most common sentiment analysis approaches
Supervised machine learning
Dictionaries and rules
Combined method
![Page 9: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/9.jpg)
Existing corpora
Corpora of reviews which contain user marks
Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)
Corps of news (a few emotional texts)
![Page 10: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/10.jpg)
Filtration
Texts containing both positive and negative emotions;
Not informative tweets (less than 40 characters long);
Copied texts and retweets.
![Page 11: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/11.jpg)
Corpus of short texts consists of
114 991 – positive texts
111 923 – negative texts
107 990 – neutral texts
![Page 12: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/12.jpg)
Corpus of short texts
Collection type Number of words Number of unique words
Positive messages 1 559 176 150 720
Negative messages 1 445 517 191 677
Neutral messages 1 852 995 105 239
![Page 13: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/13.jpg)
Unique terms distribution in relation depending on the number of tweets
![Page 14: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/14.jpg)
Uniformity of used collections
Words frequency distribution
![Page 15: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/15.jpg)
Most common approaches for used for N-grams extracting
Manually, using a thesaurus.
Term Extraction, based on significance of this term for a collection
![Page 16: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/16.jpg)
Data sets characteristics
The entire data set is known
The entire data set is avaliable
The entire data set is static (can’t change during calculation)
When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
![Page 17: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/17.jpg)
Human speech is constantly changing => there is a need to update emotional dictionaries
![Page 18: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/18.jpg)
Change in vocabulary and topics discussed
Febrary August0%
2%
4%
6%
8%
10%
12%
14%12.00%
0.50%
Percentage of references to the Olympic theme on all posts
![Page 19: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/19.jpg)
Change in vocabulary and topics discussed
Febrary August0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.06%
0.12%
Percentage of references to the vacation theme on all posts
![Page 20: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/20.jpg)
Change in vocabulary and topics discussed
Febrary August0.00%
0.01%
0.02%
0.03%
0.00%
0.02%
Percentage of using term “Sebyashka” (selfie – rus) on all posts
![Page 21: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/21.jpg)
Filtration Punctuation – commas, colons, quotation marks
(exclamation marks, question marks and ellipses were retained);
References to significant personalities and events
Proper names;
Numerals;
All links were replaced with the word "Link" and were taken into consideration as a whole;
Many dots were replaced with ellipsis.
![Page 22: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/22.jpg)
TF-ICF
C – number of categories,
cf – the number of categories in which weighed term is found
![Page 23: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/23.jpg)
TF-IDF
tf – is the frequency of term occurrence in the collection (positive or negative tweets) ,
T – total number of messages in the collections,
– the number of messages in the positive and negative collections contained the term
![Page 24: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/24.jpg)
Experiments
![Page 25: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/25.jpg)
Corpus of News texts consists of
46 339 – positive news
46 337 – negative news
46 340 – neutral news
![Page 26: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/26.jpg)
ROMIP mixed collection consists of
543– positive blog texts
236– negative blog texts
103– neutral blog texts
Reviews on books, movies, or digital camera from blogs
![Page 27: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/27.jpg)
Short text collection
News collection
TF-IDF TF-ICFAccuracy 53,9773 57,9545Precision 0,561341047 0,558902611Recall 0,5311636 0,535790598F-Measure 0,545835539 0,547102625
ROMIP collection
TF-IDF TF-ICFAccuracy 69,8619 58,1397Precision 0,709246342 0,61278022Recall 0,698624505 0,581402868F-Measure 0,703895355 0,596679322
TF-IDF TF-ICFAccuracy 95,5981 95,0664Precision 0,958092631 0,953112184Recall 0,955204837 0,94984672F-Measure 0,956646554 0,95147665
![Page 28: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/28.jpg)
Results
![Page 29: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/29.jpg)
Short texts News Romip0
20
40
60
80
100
120
95.66
70.39
54.58
95.15
59.6854.71 TF-IDF
TF-ICF
Experimental results in terms of F-measure
![Page 30: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes](https://reader035.vdocuments.site/reader035/viewer/2022081515/55790e13d8b42a03578b4c9c/html5/thumbnails/30.jpg)
dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;
take into account the lexical speech changes in time;
investigate new terms entering into active vocabulary.
The program module allows