mining topics and sentiments from yahoo! finance message...

35
Mining Topics and Sentiments from Yahoo! Finance Message Boards Jerry Fu March 16, 2010 Thursday, April 15, 2010

Upload: others

Post on 17-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Mining Topics and Sentiments from Yahoo! Finance Message Boards

Jerry FuMarch 16, 2010

Thursday, April 15, 2010

Page 2: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Presentation Overview

• Background information about Yahoo! dataset

• Goals of project

• Topic Sentiment Mixture (TSM) model approach

• Experimentation

• Conclusions

Thursday, April 15, 2010

Page 3: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Yahoo! Finance Message Board• Text dataset is extracted

from Yahoo!’s finance message boards

• Individual message board for each stock ticker symbol

• Unmoderated postings from anyone that wishes to write on message boards

Thursday, April 15, 2010

Page 4: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Example Messages

Thursday, April 15, 2010

Page 5: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

How can we make use of the information on Yahoo!

• Previously developed a system that periodically downloads messages and stores them in a database

• Messages encoded with length, author, and relevance information

• Created message filtering system that uses a SVM classifier

• Train SVM with a set of handpicked messages

Thursday, April 15, 2010

Page 6: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Examples from SVM filtering system

• High rated messages• Article copied from the Financial Times:

Moscow suspends By Catherine Belton and Charles Clover in Moscow...

• J&J is one of only five companies rated with five stars in the United States. The company has a history of consecutive dividend payment for something like 75 years and the dividend is consistently increased by ten to fifteent percent year after year...

• Low rated messages:• Re: Law Firm has contacted us back! This law firm won against FDIC

im in,,,,[email protected]

• Ge is a 15 dollar stock.Put yor bit in there for this anti american socialist company.

• Any doubt now that taxpayers did not pay for IRAQ war?George Bush and John McCain that Americans did not have to pay for the war in Iraq evidence by the fact that taxes were not raised to fund the war. After this bailout, does anyone have any doubt left that taxpayers are paying for the war? Who will pay the $6 trillion of debt accumulated by the GOP in 8 years? more than has been accumulated since independence.

Thursday, April 15, 2010

Page 7: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Presentation Overview

• Background information about Yahoo! dataset

• Goals of project

• Topic Sentiment Mixture (TSM) model approach

• Experimentation

• Conclusions

Thursday, April 15, 2010

Page 8: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Goals of TSM approach

• Find the topics discussed in messages, and the sentiment/opinion on these topics

• Original work by Mei et al [1] presented at WWW2007 focused on blog articles

• Re-use bullish/bearish sentiment from existing investment publication(s)

Thursday, April 15, 2010

Page 9: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Key Questions

• Quality differences between blog articles and message board posts

• Is the TSM model effective on a non-blog text dataset?

• What is the effect of applying TSM in conjunction SVM filtering?

• Does TSM find unexpected topics in the text?

Thursday, April 15, 2010

Page 10: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Presentation Overview

• Background information about Yahoo! dataset

• Goals of project

• Topic Sentiment Mixture (TSM) model approach

• Experimentation

• Conclusions

Thursday, April 15, 2010

Page 11: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Topic Sentiment Mixture Model

• Model that extracts topics and sentiments regarding these topics from a text dataset

• Topic: a set of semantically coherent words

• Sentiment: a set of words representing opinions about a topic. Typically, we have positive and negative sentiment

• Can incorporate prior data derived from other text datasets

Thursday, April 15, 2010

Page 12: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Topic Sentiment Mixture Model

• Each word in each document d is sampled from one of the j topics or the background model

• A word from one of the topics is sampled from either the neutral, positive or negative sentiment models

w

Background

λBTopics

.

.

.

T1

T2

T3

Tj

Neutral

.

.

.

θ1

θ2

θ3

θj

PositiveθP

NegativeθN

πdj

Document

d

υj,d,F

υj,d,P

υj,d,N

Thursday, April 15, 2010

Page 13: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Topic Sentiment Mixture ModelLet α = {d1,..., dm} represent a collection of m messages. The following equation represent the log likelihood of the collection:

where c(w,d) is the count of word w in document d,λB is the probability of choosing a word from background modelπdj is the probablity of the j-th topic occurring in document dμ j,d,X where X∈{F,P,N} is the sentiment coverage of topic j in

document d

log(α) =�

d∈α

�w∈V c(w, d)× log [λBp(w|B)

+ (1 - λB)�k

j=1 πdj × (δj,d,F p(w|θj)+ δj,d,P p(w|θP ) + δj,d,Np(w|θN ))]

Thursday, April 15, 2010

Page 14: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

EM on Topic Sentiment Mixture Model

• Parameters to estimate:1. Topic (neutral sentiment) models θ1,...,θj

2. Sentiment models θP and θN

3. Document topic probabilities πdj

4. Sentiment coverage per document μj,d,S, where S ∈ {F,P,N}

• Can estimate these parameters using expectation maximization (EM)

• Define hidden variables {zd,w,j,S} where S ∈ {F,P,N}• p(zd,w,j,S=1) represents the probability that word w in

document d is generated from topic j using sentiment model S

Thursday, April 15, 2010

Page 15: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

EM on Topic Sentiment Mixture Model

• Use EM to calculate maximum likelihood estimate iteratively

• E-step: calculate p(zd,w,j,S=1)

• M-step: calculate the parameters to estimate

• Results:1. top positive and negative sentiment words (θP and θN)

2. topic words (θ1,...,θj)3. Overall sentiment strength for each topic

Ψ(j, S) =P

d∈α πdjµj,d,SPd∈α πdj

Thursday, April 15, 2010

Page 16: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Incorporating prior knowledge

• Without incorporating prior knowledge, the sentiment and topics are mixed and the results are highly biased towards the text dataset

• Prior data can be calculated by using this model on an existing text dataset where we know positive and negative labels (can fix μj,d,S)

Thursday, April 15, 2010

Page 17: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Prior knowledge from Morningstar

• Morningstar is a reputable investment website with written analyses of many stocks

• These analyses include a bullish (positive) and bearish (negative) perspective on the stock

• Used S&P 500 stocks to gather a general text dataset for sentiment

Thursday, April 15, 2010

Page 18: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Calculated prior sentiments

θP θN

strong may brand hurt should press improv declin provide competit benefit difficult world suff system stil boost be develop could

Thursday, April 15, 2010

Page 19: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Shortcomings of original TSM paper

• Sentiment models apply only to entire text set, as opposed to one topic

• The generated topics are not verified, so is difficult to tell if results happened by random chance or if the model consistently produces useful/coherent results

• Several parameters are set empirically but there is not much detail

Thursday, April 15, 2010

Page 20: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Presentation Overview

• Background information about Yahoo! dataset

• Goals of project

• Topic Sentiment Mixture (TSM) model approach

• Experimentation

• Conclusions

Thursday, April 15, 2010

Page 21: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Experiments

• For each dataset:

• run algorithm with and without preset topics. This tests how well the algorithm can detect topics on its own

• run on a unfiltered and a SVM-filtered dataset

• All words stemmed with a Paice Husk stemmer[2]

• Documents converted into term-document matrices using Matlab Text-to-Matrix Generator toolkit (TMG) [3]

• Words that appear five times or fewer are dropped from the dataset (and, if necessary, corresponding documents)

Thursday, April 15, 2010

Page 22: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM messages - May 31, 2009

• Messages taken from May 31, 2009 around the time GM was filing for bankruptcy

• 1510 documents, 1090 unique words

• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics

Thursday, April 15, 2010

Page 23: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM set results with preset topicsθ1 = bailout θ2 = uaw θ3=bondhol

derθ4 θ5

bailout uaw bondholder obama for

bank pay bankruptcy and that

sav your bill died who

lmao unsec deal lied old

mess pen individ obam perc

republic taxpayer warr car they

luck manage bondhold wil but

announce task equ peopl monday

obam bill restruct that don

θP θN

should not

sel may

next hav

buy stil

own wil

help put

wel any

has could

company fail

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.2444 0.2695 0.2836 0.0947 0.0837

0.4755 0.4274 0.3485 0.7999 0.8256

0.2799 0.3029 0.3678 0.1052 0.0906

Thursday, April 15, 2010

Page 24: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM results with no preset topicsθ1 θ2 θ3 θ4 θ5

com bondholder you you the

new obama and lik would

http lied the was can

www died they but what

that the wil fail fil

and individ bush that bankruptcy

bloomberg our that just company

share their for when job

about restruct peopl not said

θP θN

should fund

system cost

cash low

next price

benefit must

mov pen

famy dollar

help already

mill interest

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0429 0.0967 0.0364 0.0330 0.0397

0.9116 0.8119 0.9228 0.9307 0.9186

0.0453 0.0912 0.0406 0.0361 0.0416

Thursday, April 15, 2010

Page 25: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM messages - May 31, 2009

• Same message set, but filtered with SVM

• 511 documents, 505 unique words

• One run with preset topics of ‘bailout’, ‘uaw’, and ‘bondholder’, another run with no preset topics

Thursday, April 15, 2010

Page 26: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM SVM-filtered, preset topicsθ1 = bailout θ2 = uaw θ3=bondhold

erθ4 θ5

his uaw bondholder new you

was bill bill the what

presid credit deal wil and

bush pay off and buy

with unsec equ hav obama

econom the debt would obam

black bondhold investor not how

yet class govern but the

democr produc individ that want

θP θN

wel low

sel may

driv problem

strong might

should stil

last fail

year press

brand cost

per foreign

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.1239 0.1529 0.1305 0.0330 0.0205

0.7656 0.6801 0.6684 0.9333 0.9527

0.1104 0.1668 0.2010 0.0336 0.0267

Thursday, April 15, 2010

Page 27: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

GM SVM-filtered, no preset topicsθ1 θ2 θ3 θ4 θ5

obama they real buy wil

you that right govern new

and deal the bondholder company

old want them and has

the hav and debt plan

new com just what with

how obam could mor and

trad http bondholder mak monday

pay not tax blam that

θP θN

strong may

brand hurt

should press

improv declin

provide competit

benefit stil

world difficult

system suff

boost be

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0017 0.0023 0.0015 0.0013 0.0021

0.9965 0.9954 0.9975 0.9963 0.9959

0.0016 0.0022 0.0009 0.0022 0.0019

Thursday, April 15, 2010

Page 28: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Observations

• When topics are preset the EM algorithm learns more coherent topics than the randomly (not preset) initialized topics

• There are also stronger sentiments for pre-defined topics

• Fair number of background words make it into randomly initialized topics

• The SVM filtering does not seem to have much impact

Thursday, April 15, 2010

Page 29: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

APPL messages - Mar 12-15, 2010

• This was the weekend when iPad presale began

• 905 documents, 727 unique words

• Run with preset topics of ‘ipad’, ‘macbook’, and ‘cap

Thursday, April 15, 2010

Page 30: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

AAPL messages with preset topicsθ1 θ2 θ3 θ4 θ5

ipad apple market how you

sale for cap they what

day board big week aapl

produc macbook talk the out

first battery msft right hav

http tho sit now was

sold iphon dollar for just

com put when you ther

apple beca grow that the

θP θN

should may

strong could

doubl low

cash hou

technolog like

brand hurt

help mat

allow stil

sel cost

Ψ(j,P)

Ψ(j,F)

Ψ(j,N)

0.0944 0.1268 0.1644 0.0300 0.0255

0.7657 0.6971 0.6771 0.9318 0.9374

0.1397 0.1760 0.1584 0.0380 0.0369

Thursday, April 15, 2010

Page 31: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Conclusions

• TSM model can be applied to a message board dataset

• SVM filtering does not seem to be necessary, and might remove messages with interesting topics

• TSM picked out political topics that were not the primary focus of discussion.

Thursday, April 15, 2010

Page 32: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

References

[1] Mei, Qiaozhu, Ling, Xu, Wondra, Matthew, Su, Hang, & Zhai, ChengXiang. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of WWW 2007 / Track: Data Mining, pp. 171-180, 2007

[2] PHP Implementation of Paice/Husk stemmer. http://alx2002.free.fr/utilitarism/stemmer/stemmer_en.html

[3] D. Zeimpekis and E. Gallopoulos, "Design of a MATLAB toolbox for term-document matrix generation", Technical Report HPCLAB-SCG 2/02-05, Computer Engineering & Informatics Dept., University of Patras, Greece, Februry 2005. In Proc. Workshop on Clustering High Dimensional Data and its Applications, (held in conjunction with 5th SIAM Int'l Conf. Data Mining), I.S. Dhillon, J. Kogan and J. Ghosh eds., pp. 38-48, April 2005, Newport Beach, California.

Thursday, April 15, 2010

Page 33: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Questions?

Thursday, April 15, 2010

Page 34: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

Topic Sentiment Mixture Model

• [show document generation process]

Thursday, April 15, 2010

Page 35: Mining Topics and Sentiments from Yahoo! Finance Message ...cse.ucsd.edu/sites/cse/files/cse/assets/studentaffairs/docs/Jerry.Fu... · Mining Topics and Sentiments from Yahoo! Finance

EM Update Equations

Thursday, April 15, 2010