liwc dictionary expansion
DESCRIPTION
This presentation explains the research I made during while working at the Social Computing Lab at KAIST. The main goal was to expand the LIWC vocabulary and adapt for Twiter sentiment analysis. Download it to see the animations :)TRANSCRIPT
![Page 1: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/1.jpg)
LIWC Dictionary Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST
![Page 2: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/2.jpg)
Motivation
• Dictionary-based classifiers have high precision
• But usually low recall
• Natural language is very dynamic
• New words appear
• Words change their meaning and sentiment
• Heap’s Law
• Hard to update the dictionary at the same speed
![Page 3: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/3.jpg)
LIWC Dictionary
• Fairly large dictionary
• Almost 4,500 words and steams
• 406 positive
• 499 negative
• Development and Update is a long process
• Almost exclusively done manually
• Requires a lot of human resources
• Last update was in 2007
• Twitter was launched in July, 2006
![Page 4: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/4.jpg)
System overview19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i
'm gonna go with the magic in 6.... just cause now
that bron bron's out i wanna
see kobe lose too.</t> SeanBennettt 98 434 159 -
18000 0 0 <n>Sean Bennett</n> <u
d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
& Canada)</t> <l>Long Island,
NY</l>...
Postive:
.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
album via luv photo ;- john pic different kno wearing
la ).
Negative:
!! :( ?? getting twitter omg ?! ppl :/ dude idk da
weather bout wtf iphone smh wat internet =( heat dnt
=/ facebook :| gosh kate :[ fml ima jon swear punch
text =[ cringe ): nd ** imma
![Page 5: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/5.jpg)
System overview
![Page 6: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/6.jpg)
System overview/Parser19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i'm gonna go
with the magic in 6.... just cause now that bron bron's
out i wanna see kobe lose too.</t> SeanBennettt 98
434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
01-15 16:36:04</ud> <t>Eastern Time (US &
Canada)</t> <l>Long Island, NY</l>...
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
![Page 7: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/7.jpg)
Parser
Extract tweet(RegEx)
Filter
Remove user name
(RegEx)
Remove URL
(RegEx)
Remove hash tag(RegEx)
Clean
Structured Text
Tweets
Clean Tweets
![Page 8: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/8.jpg)
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0
<n>Adam Smith</n> <ud>2007-03-07
18:17:20</ud> <t>Eastern Time (US
& Canada)</t>
<t>(.*?)</t>I just reached level 2. #spymaster
http://bit.ly/playspy
![Page 9: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/9.jpg)
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
#[0-9a-zA-Z+_]*I just reached level 2.
#spymaster
http://bit.ly/playspy
I just reached level 2.
http://bit.ly/playspy
![Page 10: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/10.jpg)
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
((http://|www.)([a-zA-
Z0-9/.~])*)
I just reached level 2.
#spymaster
http://bit.ly/playspy
I just reached level 2.
#spymaster
![Page 11: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/11.jpg)
System overview/Masterhaha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
Index Frequency Chunks Co-frequency
![Page 12: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/12.jpg)
Master
Tweets
Splitter
Indexer
Mapper
Reducer
Sort
M M M
TweetsTweets
Chunks
Index
R
R
R
Co-frequencyCo-frequencyCo-frequencyUnsortedFrequency
Frequency
![Page 13: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/13.jpg)
Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC dictionary
• Split the input file in smaller chunks
![Page 14: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/14.jpg)
Master/Indexer
• Simply save the vocabulary on a file sorted alphabetically
• Important in the future
![Page 15: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/15.jpg)
Master/Mapper
• Spawn processes in parallel and divide the chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
WorkerChunk
someone 6down 8ever 10kinda 2crazy 14…
Frequency.tmp
![Page 16: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/16.jpg)
Master/Mapper
• Spawn processes in parallel and divide the chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
• Second: save the co-words for each word
![Page 17: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/17.jpg)
haha
Worker
Master/Mapper
haha nooo! i just wanna kill
mee!!!! i didn`t do my
homework...and i feel sick =(
haha
nooo
!
ijust
wanna
kill
mee
!!!!
i
didn`t
do
my
homework
... and
ifeel
sick
=(
Split Words
Remove Duplicates
Generate files
Save co-words
![Page 18: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/18.jpg)
Master/Mapper/Issues
• Splitting is not trivial• Splitting in whitespaces
• homework… ≠ homework
• Remove punctuation
• :) ☐
• Solution: RegEx again• ([\w\-\'`]*)(\W*)
• File names:• Unique, easy to find and respect OS rules
• Hash• This is why the index file is important
![Page 19: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/19.jpg)
Master/Mapper/Issues
• Parallel programming on Python
• Original interpreter don’t support multi-thread…• Alternatives, such as Jython and IronPython, do
• …but it is still possible to work in parallel
• Multi-thread vs. Multi-process
• Multi-process in Python• multiprocessing module
• http://docs.python.org/library/multiprocessing.html#module-multiprocessing.pool
![Page 20: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/20.jpg)
Master/Reducer
• Spawn processes in parallel and split the words among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
car 4house 2ball 5car 1house 1
frequency.tmp
car 5house 3ball 5
frequency.txt
Reducer
![Page 21: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/21.jpg)
Master/Reducer
• Spawn processes in parallel and split the words among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
• Second: sums the co-occurrence frequency
Workercar 1ball 3car 2house 1
trip
car 3Ball 3house 1
trip
![Page 22: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/22.jpg)
Master/Reducer/Issues
• Index file
• Useful to access the files
• Each word has a file with a list of co-words
• But file name is hashed
• Non-invertible function
• Look-up on index, hash the word and get the file
![Page 23: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/23.jpg)
Master/Sort
• Simply sort the frequencies file
• Most frequent first
![Page 24: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/24.jpg)
Classifier
Frequency
Co-frequency
Scores
New words
α β γδ
Max results
![Page 25: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/25.jpg)
Classifier/Sentiment words
Car 232Ball 143Street 125House 121Boat 114Pencil 105Pen 98Computer 81
FrequencyTop α%
CarBallStreetHouseBoat
![Page 26: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/26.jpg)
Classifier/Co-words
CarBallStreet
Top β%
tire doorengine
court playgame
name size
![Page 27: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/27.jpg)
Classifier/Score
tire doorengine
court playgame
door size
size type homeroom
size doorprice
engine
tire
door
size
1 0
1 0
2
2
1
1
![Page 28: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/28.jpg)
Classifier/Collapse
• Created to deal with problems like:
• :) :)) :), :).
• They should all be treated as the same token
• Harder for words
![Page 29: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/29.jpg)
Classifier/New words
• Rules to compare the scores
• So far the rules are
• If the positive score is bigger than the negative score plus delta, tag the word as positive
• Same idea for negative
• Returns the new words up to a maximum value
![Page 30: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/30.jpg)
Other ideas
• WordNet based
• PMI similarity score
![Page 31: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/31.jpg)
Evaluation
• Two evaluation methods:
• First method
• Find tweets that could not be categorized before but now they can
• Manually check the precision of the result
• Second method
• Manually select positive and negative tweets
• Compare the precision of the old dictionary with the new dictionary
![Page 32: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/32.jpg)
Sub-product
• LIWC Dictionary Library for Python
• Provides easy access to the dictionary information• Easy search
• Reverse index
• Match wildcard
• Ex.:
![Page 33: LIWC Dictionary Expansion](https://reader036.vdocuments.site/reader036/viewer/2022081404/559a56ab1a28abef788b4838/html5/thumbnails/33.jpg)