mining topics and sentiments from yahoo! finance message...
TRANSCRIPT
Mining Topics and Sentiments from Yahoo! Finance Message Boards
Jerry FuMarch 16, 2010
Thursday, April 15, 2010
Presentation Overview
• Background information about Yahoo! dataset
• Goals of project
• Topic Sentiment Mixture (TSM) model approach
• Experimentation
• Conclusions
Thursday, April 15, 2010
Yahoo! Finance Message Board• Text dataset is extracted
from Yahoo!’s finance message boards
• Individual message board for each stock ticker symbol
• Unmoderated postings from anyone that wishes to write on message boards
Thursday, April 15, 2010
Example Messages
Thursday, April 15, 2010
How can we make use of the information on Yahoo!
• Previously developed a system that periodically downloads messages and stores them in a database
• Messages encoded with length, author, and relevance information
• Created message filtering system that uses a SVM classifier
• Train SVM with a set of handpicked messages
Thursday, April 15, 2010
Examples from SVM filtering system
• High rated messages• Article copied from the Financial Times:
Moscow suspends By Catherine Belton and Charles Clover in Moscow...
• J&J is one of only five companies rated with five stars in the United States. The company has a history of consecutive dividend payment for something like 75 years and the dividend is consistently increased by ten to fifteent percent year after year...
• Low rated messages:• Re: Law Firm has contacted us back! This law firm won against FDIC
im in,,,,[email protected]
• Ge is a 15 dollar stock.Put yor bit in there for this anti american socialist company.
• Any doubt now that taxpayers did not pay for IRAQ war?George Bush and John McCain that Americans did not have to pay for the war in Iraq evidence by the fact that taxes were not raised to fund the war. After this bailout, does anyone have any doubt left that taxpayers are paying for the war? Who will pay the $6 trillion of debt accumulated by the GOP in 8 years? more than has been accumulated since independence.
Thursday, April 15, 2010
Presentation Overview
• Background information about Yahoo! dataset
• Goals of project
• Topic Sentiment Mixture (TSM) model approach
• Experimentation
• Conclusions
Thursday, April 15, 2010
Goals of TSM approach
• Find the topics discussed in messages, and the sentiment/opinion on these topics
• Original work by Mei et al [1] presented at WWW2007 focused on blog articles
• Re-use bullish/bearish sentiment from existing investment publication(s)
Thursday, April 15, 2010
Key Questions
• Quality differences between blog articles and message board posts
• Is the TSM model effective on a non-blog text dataset?
• What is the effect of applying TSM in conjunction SVM filtering?
• Does TSM find unexpected topics in the text?
Thursday, April 15, 2010
Presentation Overview
• Background information about Yahoo! dataset
• Goals of project
• Topic Sentiment Mixture (TSM) model approach
• Experimentation
• Conclusions
Thursday, April 15, 2010
Topic Sentiment Mixture Model
• Model that extracts topics and sentiments regarding these topics from a text dataset
• Topic: a set of semantically coherent words
• Sentiment: a set of words representing opinions about a topic. Typically, we have positive and negative sentiment
• Can incorporate prior data derived from other text datasets
Thursday, April 15, 2010
Topic Sentiment Mixture Model
• Each word in each document d is sampled from one of the j topics or the background model
• A word from one of the topics is sampled from either the neutral, positive or negative sentiment models
w
Background
λBTopics
.
.
.
T1
T2
T3
Tj
Neutral
.
.
.
θ1
θ2
θ3
θj
PositiveθP
NegativeθN
πdj
Document
d
υj,d,F
υj,d,P
υj,d,N
Thursday, April 15, 2010
Topic Sentiment Mixture ModelLet α = {d1,..., dm} represent a collection of m messages. The following equation represent the log likelihood of the collection:
where c(w,d) is the count of word w in document d,λB is the probability of choosing a word from background modelπdj is the probablity of the j-th topic occurring in document dμ j,d,X where X∈{F,P,N} is the sentiment coverage of topic j in
document d
log(α) =�
d∈α
�w∈V c(w, d)× log [λBp(w|B)
+ (1 - λB)�k
j=1 πdj × (δj,d,F p(w|θj)+ δj,d,P p(w|θP ) + δj,d,Np(w|θN ))]
Thursday, April 15, 2010
EM on Topic Sentiment Mixture Model
• Parameters to estimate:1. Topic (neutral sentiment) models θ1,...,θj
2. Sentiment models θP and θN
3. Document topic probabilities πdj
4. Sentiment coverage per document μj,d,S, where S ∈ {F,P,N}
• Can estimate these parameters using expectation maximization (EM)
• Define hidden variables {zd,w,j,S} where S ∈ {F,P,N}• p(zd,w,j,S=1) represents the probability that word w in
document d is generated from topic j using sentiment model S
Thursday, April 15, 2010
EM on Topic Sentiment Mixture Model
• Use EM to calculate maximum likelihood estimate iteratively
• E-step: calculate p(zd,w,j,S=1)
• M-step: calculate the parameters to estimate
• Results:1. top positive and negative sentiment words (θP and θN)
2. topic words (θ1,...,θj)3. Overall sentiment strength for each topic
Ψ(j, S) =P
d∈α πdjµj,d,SPd∈α πdj
Thursday, April 15, 2010
Incorporating prior knowledge
• Without incorporating prior knowledge, the sentiment and topics are mixed and the results are highly biased towards the text dataset
• Prior data can be calculated by using this model on an existing text dataset where we know positive and negative labels (can fix μj,d,S)
Thursday, April 15, 2010
Prior knowledge from Morningstar
• Morningstar is a reputable investment website with written analyses of many stocks
• These analyses include a bullish (positive) and bearish (negative) perspective on the stock
• Used S&P 500 stocks to gather a general text dataset for sentiment
Thursday, April 15, 2010
Calculated prior sentiments
θP θN
strong may brand hurt should press improv declin provide competit benefit difficult world suff system stil boost be develop could
Thursday, April 15, 2010
Shortcomings of original TSM paper
• Sentiment models apply only to entire text set, as opposed to one topic
• The generated topics are not verified, so is difficult to tell if results happened by random chance or if the model consistently produces useful/coherent results
• Several parameters are set empirically but there is not much detail
Thursday, April 15, 2010
Presentation Overview
• Background information about Yahoo! dataset
• Goals of project
• Topic Sentiment Mixture (TSM) model approach
• Experimentation
• Conclusions
Thursday, April 15, 2010
Experiments
• For each dataset:
• run algorithm with and without preset topics. This tests how well the algorithm can detect topics on its own
• run on a unfiltered and a SVM-filtered dataset
• All words stemmed with a Paice Husk stemmer[2]
• Documents converted into term-document matrices using Matlab Text-to-Matrix Generator toolkit (TMG) [3]
• Words that appear five times or fewer are dropped from the dataset (and, if necessary, corresponding documents)
Thursday, April 15, 2010
GM messages - May 31, 2009
• Messages taken from May 31, 2009 around the time GM was filing for bankruptcy
• 1510 documents, 1090 unique words
• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics
Thursday, April 15, 2010
GM set results with preset topicsθ1 = bailout θ2 = uaw θ3=bondhol
derθ4 θ5
bailout uaw bondholder obama for
bank pay bankruptcy and that
sav your bill died who
lmao unsec deal lied old
mess pen individ obam perc
republic taxpayer warr car they
luck manage bondhold wil but
announce task equ peopl monday
obam bill restruct that don
θP θN
should not
sel may
next hav
buy stil
own wil
help put
wel any
has could
company fail
Ψ(j,P)
Ψ(j,F)
Ψ(j,N)
0.2444 0.2695 0.2836 0.0947 0.0837
0.4755 0.4274 0.3485 0.7999 0.8256
0.2799 0.3029 0.3678 0.1052 0.0906
Thursday, April 15, 2010
GM results with no preset topicsθ1 θ2 θ3 θ4 θ5
com bondholder you you the
new obama and lik would
http lied the was can
www died they but what
that the wil fail fil
and individ bush that bankruptcy
bloomberg our that just company
share their for when job
about restruct peopl not said
θP θN
should fund
system cost
cash low
next price
benefit must
mov pen
famy dollar
help already
mill interest
Ψ(j,P)
Ψ(j,F)
Ψ(j,N)
0.0429 0.0967 0.0364 0.0330 0.0397
0.9116 0.8119 0.9228 0.9307 0.9186
0.0453 0.0912 0.0406 0.0361 0.0416
Thursday, April 15, 2010
GM messages - May 31, 2009
• Same message set, but filtered with SVM
• 511 documents, 505 unique words
• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics
Thursday, April 15, 2010
GM SVM-filtered, preset topicsθ1 = bailout θ2 = uaw θ3=bondhold
erθ4 θ5
his uaw bondholder new you
was bill bill the what
presid credit deal wil and
bush pay off and buy
with unsec equ hav obama
econom the debt would obam
black bondhold investor not how
yet class govern but the
democr produc individ that want
θP θN
wel low
sel may
driv problem
strong might
should stil
last fail
year press
brand cost
per foreign
Ψ(j,P)
Ψ(j,F)
Ψ(j,N)
0.1239 0.1529 0.1305 0.0330 0.0205
0.7656 0.6801 0.6684 0.9333 0.9527
0.1104 0.1668 0.2010 0.0336 0.0267
Thursday, April 15, 2010
GM SVM-filtered, no preset topicsθ1 θ2 θ3 θ4 θ5
obama they real buy wil
you that right govern new
and deal the bondholder company
old want them and has
the hav and debt plan
new com just what with
how obam could mor and
trad http bondholder mak monday
pay not tax blam that
θP θN
strong may
brand hurt
should press
improv declin
provide competit
benefit stil
world difficult
system suff
boost be
Ψ(j,P)
Ψ(j,F)
Ψ(j,N)
0.0017 0.0023 0.0015 0.0013 0.0021
0.9965 0.9954 0.9975 0.9963 0.9959
0.0016 0.0022 0.0009 0.0022 0.0019
Thursday, April 15, 2010
Observations
• When topics are preset the EM algorithm learns more coherent topics than the randomly (not preset) initialized topics
• There are also stronger sentiments for pre-defined topics
• Fair number of background words make it into randomly initialized topics
• The SVM filtering does not seem to have much impact
Thursday, April 15, 2010
APPL messages - Mar 12-15, 2010
• This was the weekend when iPad presale began
• 905 documents, 727 unique words
• Run with preset topics of ‘ipad’, ‘macbook’, and ‘cap
Thursday, April 15, 2010
AAPL messages with preset topicsθ1 θ2 θ3 θ4 θ5
ipad apple market how you
sale for cap they what
day board big week aapl
produc macbook talk the out
first battery msft right hav
http tho sit now was
sold iphon dollar for just
com put when you ther
apple beca grow that the
θP θN
should may
strong could
doubl low
cash hou
technolog like
brand hurt
help mat
allow stil
sel cost
Ψ(j,P)
Ψ(j,F)
Ψ(j,N)
0.0944 0.1268 0.1644 0.0300 0.0255
0.7657 0.6971 0.6771 0.9318 0.9374
0.1397 0.1760 0.1584 0.0380 0.0369
Thursday, April 15, 2010
Conclusions
• TSM model can be applied to a message board dataset
• SVM filtering does not seem to be necessary, and might remove messages with interesting topics
• TSM picked out political topics that were not the primary focus of discussion.
Thursday, April 15, 2010
References
[1] Mei, Qiaozhu, Ling, Xu, Wondra, Matthew, Su, Hang, & Zhai, ChengXiang. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of WWW 2007 / Track: Data Mining, pp. 171-180, 2007
[2] PHP Implementation of Paice/Husk stemmer. http://alx2002.free.fr/utilitarism/stemmer/stemmer_en.html
[3] D. Zeimpekis and E. Gallopoulos, "Design of a MATLAB toolbox for term-document matrix generation", Technical Report HPCLAB-SCG 2/02-05, Computer Engineering & Informatics Dept., University of Patras, Greece, Februry 2005. In Proc. Workshop on Clustering High Dimensional Data and its Applications, (held in conjunction with 5th SIAM Int'l Conf. Data Mining), I.S. Dhillon, J. Kogan and J. Ghosh eds., pp. 38-48, April 2005, Newport Beach, California.
Thursday, April 15, 2010
Questions?
Thursday, April 15, 2010
Topic Sentiment Mixture Model
• [show document generation process]
Thursday, April 15, 2010
EM Update Equations
Thursday, April 15, 2010