mining topics and sentiments from yahoo! finance message...

Mining Topics and Sentiments from Yahoo! Finance Message Boards

Jerry FuMarch 16, 2010

Thursday, April 15, 2010

Presentation Overview

• Background information about Yahoo! dataset

• Goals of project

• Topic Sentiment Mixture (TSM) model approach

• Experimentation

• Conclusions


Yahoo! Finance Message Board• Text dataset is extracted

from Yahoo!’s finance message boards

• Individual message board for each stock ticker symbol

• Unmoderated postings from anyone that wishes to write on message boards


Example Messages


How can we make use of the information on Yahoo!

• Previously developed a system that periodically downloads messages and stores them in a database

• Messages encoded with length, author, and relevance information

• Created message filtering system that uses a SVM classifier

• Train SVM with a set of handpicked messages


Examples from SVM filtering system

• High rated messages• Article copied from the Financial Times:

Moscow suspends By Catherine Belton and Charles Clover in Moscow...

• J&J is one of only five companies rated with five stars in the United States. The company has a history of consecutive dividend payment for something like 75 years and the dividend is consistently increased by ten to fifteent percent year after year...

• Low rated messages:• Re: Law Firm has contacted us back! This law firm won against FDIC

im in,,,,[email protected]

• Ge is a 15 dollar stock.Put yor bit in there for this anti american socialist company.

• Any doubt now that taxpayers did not pay for IRAQ war?George Bush and John McCain that Americans did not have to pay for the war in Iraq evidence by the fact that taxes were not raised to fund the war. After this bailout, does anyone have any doubt left that taxpayers are paying for the war? Who will pay the $6 trillion of debt accumulated by the GOP in 8 years? more than has been accumulated since independence.


mailto:[email protected]

mailto:[email protected]





• Experimentation

• Conclusions


Goals of TSM approach

• Find the topics discussed in messages, and the sentiment/opinion on these topics

• Original work by Mei et al [1] presented at WWW2007 focused on blog articles

• Re-use bullish/bearish sentiment from existing investment publication(s)


Key Questions

• Quality differences between blog articles and message board posts

• Is the TSM model effective on a non-blog text dataset?

• What is the effect of applying TSM in conjunction SVM filtering?

• Does TSM find unexpected topics in the text?






• Experimentation

• Conclusions


Topic Sentiment Mixture Model

• Model that extracts topics and sentiments regarding these topics from a text dataset

• Topic: a set of semantically coherent words

• Sentiment: a set of words representing opinions about a topic. Typically, we have positive and negative sentiment

• Can incorporate prior data derived from other text datasets



• Each word in each document d is sampled from one of the j topics or the background model

• A word from one of the topics is sampled from either the neutral, positive or negative sentiment models

w

Background

λBTopics

.

.

.

T1

T2

T3

Tj

Neutral

.

.

.

θ1

θ2

θ3

θj

PositiveθP

NegativeθN

πdj

Document

d

υj,d,F

υj,d,P

υj,d,N


Topic Sentiment Mixture ModelLet α = {d1,..., dm} represent a collection of m messages. The following equation represent the log likelihood of the collection:

where c(w,d) is the count of word w in document d,λB is the probability of choosing a word from background modelπdj is the probablity of the j-th topic occurring in document dμ j,d,X where X∈{F,P,N} is the sentiment coverage of topic j in

document d

log(α) =�

d∈α

�w∈V c(w, d)× log [λBp(w|B)

+ (1 - λB)�k

j=1 πdj × (δj,d,F p(w|θj)+ δj,d,P p(w|θP ) + δj,d,Np(w|θN ))]


EM on Topic Sentiment Mixture Model

• Parameters to estimate:1. Topic (neutral sentiment) models θ1,...,θj

2. Sentiment models θP and θN

3. Document topic probabilities πdj

4. Sentiment coverage per document μj,d,S, where S ∈ {F,P,N}

• Can estimate these parameters using expectation maximization (EM)

• Define hidden variables {zd,w,j,S} where S ∈ {F,P,N}• p(zd,w,j,S=1) represents the probability that word w in

document d is generated from topic j using sentiment model S


EM on Topic Sentiment Mixture Model

• Use EM to calculate maximum likelihood estimate iteratively

• E-step: calculate p(zd,w,j,S=1)

• M-step: calculate the parameters to estimate

• Results:1. top positive and negative sentiment words (θP and θN)

2. topic words (θ1,...,θj)3. Overall sentiment strength for each topic

Ψ(j, S) =P

d∈α πdjµj,d,SPd∈α πdj


Incorporating prior knowledge

• Without incorporating prior knowledge, the sentiment and topics are mixed and the results are highly biased towards the text dataset

• Prior data can be calculated by using this model on an existing text dataset where we know positive and negative labels (can fix μj,d,S)


Prior knowledge from Morningstar

• Morningstar is a reputable investment website with written analyses of many stocks

• These analyses include a bullish (positive) and bearish (negative) perspective on the stock

• Used S&P 500 stocks to gather a general text dataset for sentiment


Calculated prior sentiments

θP θN

strong may brand hurt should press improv declin provide competit benefit difficult world suff system stil boost be develop could


Shortcomings of original TSM paper

• Sentiment models apply only to entire text set, as opposed to one topic

• The generated topics are not verified, so is difficult to tell if results happened by random chance or if the model consistently produces useful/coherent results

• Several parameters are set empirically but there is not much detail






• Experimentation

• Conclusions


Experiments

• For each dataset:

• run algorithm with and without preset topics. This tests how well the algorithm can detect topics on its own

• run on a unfiltered and a SVM-filtered dataset

• All words stemmed with a Paice Husk stemmer[2]

• Documents converted into term-document matrices using Matlab Text-to-Matrix Generator toolkit (TMG) [3]

• Words that appear five times or fewer are dropped from the dataset (and, if necessary, corresponding documents)


GM messages - May 31, 2009

• Messages taken from May 31, 2009 around the time GM was filing for bankruptcy

• 1510 documents, 1090 unique words

• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics


GM set results with preset topicsθ1 = bailout θ2 = uaw θ3=bondhol

derθ4 θ5

bailout uaw bondholder obama for

bank pay bankruptcy and that

sav your bill died who

lmao unsec deal lied old

mess pen individ obam perc

republic taxpayer warr car they

luck manage bondhold wil but

announce task equ peopl monday

obam bill restruct that don

θP θN

should not

sel may

next hav

buy stil

own wil

help put

wel any

has could

company fail

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.2444 0.2695 0.2836 0.0947 0.0837

0.4755 0.4274 0.3485 0.7999 0.8256

0.2799 0.3029 0.3678 0.1052 0.0906


GM results with no preset topicsθ1 θ2 θ3 θ4 θ5

com bondholder you you the

new obama and lik would

http lied the was can

www died they but what

that the wil fail fil

and individ bush that bankruptcy

bloomberg our that just company

share their for when job

about restruct peopl not said

θP θN

should fund

system cost

cash low

next price

benefit must

mov pen

famy dollar

help already

mill interest

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0429 0.0967 0.0364 0.0330 0.0397

0.9116 0.8119 0.9228 0.9307 0.9186

0.0453 0.0912 0.0406 0.0361 0.0416


GM messages - May 31, 2009

• Same message set, but filtered with SVM


• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics


GM SVM-filtered, preset topicsθ1 = bailout θ2 = uaw θ3=bondhold

erθ4 θ5

his uaw bondholder new you

was bill bill the what

presid credit deal wil and

bush pay off and buy

with unsec equ hav obama

econom the debt would obam

black bondhold investor not how

yet class govern but the

democr produc individ that want

θP θN

wel low

sel may

driv problem

strong might

should stil

last fail

year press

brand cost

per foreign

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.1239 0.1529 0.1305 0.0330 0.0205

0.7656 0.6801 0.6684 0.9333 0.9527

0.1104 0.1668 0.2010 0.0336 0.0267


GM SVM-filtered, no preset topicsθ1 θ2 θ3 θ4 θ5

obama they real buy wil

you that right govern new

and deal the bondholder company

old want them and has

the hav and debt plan

new com just what with

how obam could mor and

trad http bondholder mak monday

pay not tax blam that

θP θN

strong may

brand hurt

should press

improv declin

provide competit

benefit stil

world difficult

system suff

boost be

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0017 0.0023 0.0015 0.0013 0.0021

0.9965 0.9954 0.9975 0.9963 0.9959

0.0016 0.0022 0.0009 0.0022 0.0019


Observations

• When topics are preset the EM algorithm learns more coherent topics than the randomly (not preset) initialized topics

• There are also stronger sentiments for pre-defined topics

• Fair number of background words make it into randomly initialized topics

• The SVM filtering does not seem to have much impact


APPL messages - Mar 12-15, 2010

• This was the weekend when iPad presale began


• Run with preset topics of ‘ipad’, ‘macbook’, and ‘cap


AAPL messages with preset topicsθ1 θ2 θ3 θ4 θ5

ipad apple market how you

sale for cap they what

day board big week aapl

produc macbook talk the out

first battery msft right hav

http tho sit now was

sold iphon dollar for just

com put when you ther

apple beca grow that the

θP θN

should may

strong could

doubl low

cash hou

technolog like

brand hurt

help mat

allow stil

sel cost

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0944 0.1268 0.1644 0.0300 0.0255

0.7657 0.6971 0.6771 0.9318 0.9374

0.1397 0.1760 0.1584 0.0380 0.0369


Conclusions

• TSM model can be applied to a message board dataset

• SVM filtering does not seem to be necessary, and might remove messages with interesting topics

• TSM picked out political topics that were not the primary focus of discussion.


References

[1] Mei, Qiaozhu, Ling, Xu, Wondra, Matthew, Su, Hang, & Zhai, ChengXiang. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of WWW 2007 / Track: Data Mining, pp. 171-180, 2007

[2] PHP Implementation of Paice/Husk stemmer. http://alx2002.free.fr/utilitarism/stemmer/stemmer_en.html

[3] D. Zeimpekis and E. Gallopoulos, "Design of a MATLAB toolbox for term-document matrix generation", Technical Report HPCLAB-SCG 2/02-05, Computer Engineering & Informatics Dept., University of Patras, Greece, Februry 2005. In Proc. Workshop on Clustering High Dimensional Data and its Applications, (held in conjunction with 5th SIAM Int'l Conf. Data Mining), I.S. Dhillon, J. Kogan and J. Ghosh eds., pp. 38-48, April 2005, Newport Beach, California.


http://alx2002.free.fr/utilitarism/stemmer/stemmer_en.html




http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tmg_chdd_finalCTI.ps




http://www.cs.utexas.edu/users/inderjit/sdm05.html




http://www.siam.org/meetings/sdm05/




Questions?



• [show document generation process]


EM Update Equations