sentichenews - sentiment analysis on newspapers and tweets
TRANSCRIPT
SentiCheNewsA tool for analyzing possible relationships between news and tweet sentiments
Data Mining Class
Sapienza, University of Rome
A. Y. 2016 - 2017
To begin Data Collection & Preprocessing Results Analysis
Hi!
Simone [email protected]
https://it.linkedin.com/in/simone-santacroce-272739134
Manuel [email protected]
https://it.linkedin.com/in/manuelcoppotelli
George Adrian [email protected]
https://it.linkedin.com/in/george-adrian-munteanu-707744134
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Agenda
1 To begin• Sentiment Analysis: what is it?• Our goals• A good lexicon• Dictionary Structure
2 Data Collection & Preprocessing• Collecting Data• Preprocessing• Design Choices
3 Results Analysis• Dashboard• Analysis
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Agenda
1 To begin• Sentiment Analysis: what is it?• Our goals• A good lexicon• Dictionary Structure
2 Data Collection & Preprocessing• Collecting Data• Preprocessing• Design Choices
3 Results Analysis• Dashboard• Analysis
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Sentiment Analysis. . .
. . . refers to the use of:
• natural language processing
• text analysis
• computational linguistic
to identify and capture subjectiveinformation from source materials(news, social media, reviews...)
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Our goals
Given a collection of italian news and italian tweets within the sametime period...is there any connection between them? In particular:
• do newspapers and tweets report the same sentiment for a certainday (a sort of influence of the news on the tweets)?
• what is the newspaper whose average feeling is closer to theaverage of tweets feeling?
• are there any differences among newspapers’ sentiments?
• the variance in time for each newspaper
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
A good lexicon
Sentiment Analysis for the englishlanguage is:
• a well studied problem, therefore
• there are a lot of excellent lexiconsready to use
This is not true for the italian language:WE HAD TO BUILD OUR OWNDICTIONARY starting from an englishdictionary available at:http://sentiwordnet.isti.cnr.it
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Dictionary Structure 1/2
Each row of the dictionary is representedby a tuple: < s, (p, n) > where
• s: string
• p: positive score
• n: negative score
Therefore, each string is represented by apositive and a negative score.
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Dictionary Structure 2/2
Given a tuple t: < s, (p, n) >
String s can be composed by a single word or up to four words,separated by underscore.
E.g.
• tuple x : < a, (p,n)>
• tuple y : < a b, (p’,n’)>
• . . .
• tuple y : < a b c d, (p”,n”)>
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Agenda
1 To begin• Sentiment Analysis: what is it?• Our goals• A good lexicon• Dictionary Structure
2 Data Collection & Preprocessing• Collecting Data• Preprocessing• Design Choices
3 Results Analysis• Dashboard• Analysis
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Collecting Data
Tweets
• Step 1: getting Twitter APIkeys
• Step 2: connecting to TwitterStreaming API
• Step 3: for each tweet savetext and date
News
• We exploit the RSS Feed andfor each of them we save:
• date
• title
• newspaper source
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Preprocessing
Tweets and news are preprocessed with the following techniques:
• stop-word removal
• normalization (lower case, accents, etc)
• stemming (but we realize that...)
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Design Choices: stemming operation
Stemming operation upon different words may produce the sameresult.
E.g.
• ’amaro’
• ’amare’
have both the same root ’amar’ whereas they have an entirelydifferent meaning and different (positive, negative) values.
We do not apply the stemming preprocessing operation.
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Design Choices: string scoring
Given a string s (either a news or a tweet) we exploit as efficient aspossible the dictionary’s structure to assign a score to s.
In particular reason by four tokens at a time.
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Agenda
1 To begin• Sentiment Analysis: what is it?• Our goals• A good lexicon• Dictionary Structure
2 Data Collection & Preprocessing• Collecting Data• Preprocessing• Design Choices
3 Results Analysis• Dashboard• Analysis
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Dashboard: spotting the results
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Mean & Variance
Each colored bubblerepresents a datasource, news ortweets, where:
• center representsthe sentiments’mean
• radius representsthe variance ofsentiments
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
What’s inside each bubble?
A point for eachnews/tweet. Eachpoint is representedby a tuple <p,n,t>
• p: positive score
• n: negative score
• t: time
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Sentiments’ trend per time interval (mean)
Given time interval [t1, t2] there is a mean sentiment bubble every 6hours.
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Sentiments’ trend per time interval (variance)
SentiCheNews
To begin Data Collection & Preprocessing Results Analysis
Thank you for your attention
All the material can be found at:
GitHub Repository
https://github.com/manuelcoppotelli/SentiCheNews
SentiCheNews