working with text data

66

Upload: katerina-willamow

Post on 10-May-2015

935 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Working with linguistic data

Ekaterina Vylomova

April 14, 2014

Ekaterina Vylomova Working with linguistic data

Page 2: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Possible data sources

Dictionaries and corpora

Linked Data

Social media

Ekaterina Vylomova Working with linguistic data

Page 3: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Thesauri & Corpora

WordNets

Roget's Thesaurus

Moby Project

Ekaterina Vylomova Working with linguistic data

Page 4: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Moby Project

Moby Hyphenator - 185,000 entries fully hyphenated

Moby Language - Word lists in �ve of the world's greatlanguages

Moby Part-of-Speech - 230,000 entries fully described bypart(s) of speech

Moby Pronunciator - 175,000 entries fully InternationalPhonetic Alphabet coded

Moby Thesaurus - 30,000 root words, 2.5 million synonymsand related words

Moby Words - 610,000+ words and phrases

Ekaterina Vylomova Working with linguistic data

Page 5: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Linked & Structured Data

Using RDF format.

DBPedia is a project aiming to extract structured contentfrom the information created as part of Wikipedia project

FreeBase is a large collaborative knowledge base consisting ofmetadata composed mainly by its community members

BabelNet is a multilingual lexicalized semantic network andontology. Automatically created using Wikipedia.

YAGO is a knowledge base developed at the Max PlanckInstitute. Also automatically built.

Ekaterina Vylomova Working with linguistic data

Page 6: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Spoken corpus

TalkBank(multilingual): �rst language acquisition, secondlanguage acquisition, conversation analysis, classroomdiscourse, and aphasic language.

CHILDES(part of TalkBank): Child Language Data ExchangeSystem

Ekaterina Vylomova Working with linguistic data

Page 7: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Possible Data Sources

Sentiment data

SentiWordNet

Dictionary by Warriner et al.

Dictionary by Hu and Liu

Ekaterina Vylomova Working with linguistic data

Page 8: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible Data Sources

Social media

Rating systems: IMDB, Amazon, TripAdvisor, OpenTable

Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage

Facebook (OpenGraph)

Twitter

Blogs (LiveJournal, Blogger, etc.)

Ekaterina Vylomova Working with linguistic data

Page 9: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Possible ways to get the data

Corpora: just download it!

Facebook, Twitter and other social media: use API

Blogs, Internet data: parse HTML or XML (download webpageusing wget/curl)

Linked data: parse RDF

Ekaterina Vylomova Working with linguistic data

Page 10: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Don'n forget this step!

Tokenization

Remove punctuation, may be number and stop words,lower-case everything

Lemmatization or stemming(Porter, Snowball)

In case of bag-of-words you maycreate term x document or term x term matrix(using TF,TFIDF, RIDF for normalization)

Ekaterina Vylomova Working with linguistic data

Page 11: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Few key words from data mining

Compute set similarity: Jaccard, Dice, F-scores

Transform words to vectors: LSA, MDS

Get topics of documents: LDA

For classi�cation you may use: SVM, neural networks,discriminant analysis, bayesian networks, decision trees,random forest,adaboost

For clustering you may use: k-means, knn, SOM, SVM

For regression you may use: SVM, neural networks, GLM, NLS

Ekaterina Vylomova Working with linguistic data

Page 12: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Connect to Facebook OpenGraph

Get access token

Go tohttps:

//developers.facebook.com/tools/access_token/

Check it works:https://developers.facebook.com/tools/explorer?

method=GET&path=me%3Ffields%3Did%2Cnameme?fields=

id,name,gender

Use tutorial:https://developers.facebook.com/docs/graph-api/

common-scenarios/

Ekaterina Vylomova Working with linguistic data

Page 13: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Facebook & Python

Download the package:https://github.com/pythonforfacebook/facebook-sdk

Install it : python setup.py install

Ekaterina Vylomova Working with linguistic data

Page 14: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Facebook & Python

Get names and gender of your friends. Possible project: predictionof gender according to the names

import facebook

token='your_token '

graph = facebook.GraphAPI(token)

profile = graph.get_object("me")

friends = graph.get_connections("me", "friends")

friend_list = [friend['id'] for friend in friends['data']]

for friend_id in friend_list:

data=graph.get_object(friend_id)

if 'gender ' in data.keys():

print data['name'], data['gender ']

Ekaterina Vylomova Working with linguistic data

Page 15: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Using R

Packages you may need

tm - text mining + tm.plugin.webmining for webcorpora, htmlparsers, plain text extraction

topicmodels - topicality

wordcloud - create a cloud of words

qdap - sentiment analysis

RCurl - curl (get the contents of a webpage)

twitteR - to use data from twitter

Wordnet - wordnet usage (dictionary needed)

e1071 - machine learning(clustering, SVM, naive Bayes, LSA)

Ekaterina Vylomova Working with linguistic data

Page 16: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Packages usage

Installation: install.packages(name)

Usage: library(name)

Ekaterina Vylomova Working with linguistic data

Page 17: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Load packages:

library(twitteR)

library(tm)

library(RCurl)

library(qdap)

library(wordcloud)

Ekaterina Vylomova Working with linguistic data

Page 18: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Get Token:

reqURL <- "https://api.twitter.com/oauth/request_token"

accessURL <- "https://api.twitter.com/oauth/access_token"

authURL <- "https://api.twitter.com/oauth/authorize"

consumerKey <- "key"

consumerSecret <- "secret"

twitCred <- OAuthFactory$new(consumerKey=consumerKey ,

consumerSecret=consumerSecret ,

requestURL=reqURL ,

accessURL=accessURL ,

authURL=authURL)

# The method will return a link to get a PIN code , you

should enter the code

twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.

pem",

package = "RCurl"))

registerTwitterOAuth(twitCred)

Ekaterina Vylomova Working with linguistic data

Page 19: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Get the data and convert to corpus:

# search by hashtag , you may also search by plain words. Get

n=1000 entries

gglTweets <- searchTwitter('#sochi2014 ', n=1000)

n <- length(gglTweets)

# show first 3 entries

gglTweets [1:3]

# put it in a data frame

df <- do.call("rbind",

lapply(gglTweets , as.data.frame))

# get dimenstionality

dim(df)

# create a corpus

myCorpus <- Corpus(VectorSource(df$text))

Ekaterina Vylomova Working with linguistic data

Page 20: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Do normalization:

myCorpus <- tm_map(myCorpus , tolower)

# remove punctuation

myCorpus <- tm_map(myCorpus , removePunctuation)

# remove numbers

myCorpus <- tm_map(myCorpus , removeNumbers)

# remove stopwords (very frequent words , e.g. articles ,

prepositions)

myStopwords <- c(stopwords('english ')), "sochi","amp", "get"

)

myCorpus <- tm_map(myCorpus , removeWords , myStopwords)

Ekaterina Vylomova Working with linguistic data

Page 21: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Stem the documents:

dictCorpus <- myCorpus

# apply stemming for normalization , you may use

lemmatization instead

myCorpus <- tm_map(myCorpus , stemDocument)

inspect(myCorpus [1:3])

myCorpus <- tm_map(myCorpus ,

stemCompletion , dictionary=dictCorpus)

inspect(myCorpus [1:3])

Ekaterina Vylomova Working with linguistic data

Page 22: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Create TDM:

# create term -document matrix , you may use TF or TFIDF

metric

myDtm <- TermDocumentMatrix(myCorpus , control =

list(minWordLength = 1,

weighting = weightTfIdf))

inspect(myDtm [66:70 ,11:20])

# frequent terms and associations

findFreqTerms(myDtm , lowfreq =10)

Ekaterina Vylomova Working with linguistic data

Page 23: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Twitter with R

Create a wordcloud:

# convert TDM to plain matrix

m<-as.matrix(myDtm)

# sort by decreasing frequencies

v<-sort(rowSums(m),decreasing=TRUE)

# show first 14 entries

head(v,14)

# get colnames

words <-names(v)

# create dataframe for words with frequencies

dat <-data.frame(word=words ,freq=v)

# create wordcloud from words which appeared at least 5

times

wordcloud(dat$word ,dat$freq , min.freq =5)

Ekaterina Vylomova Working with linguistic data

Page 24: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Experience Project is a free social networking website consisting ofvarious online communities. Users/members submit"experiences personal stories, confessions, blogs, groups, photos,and videos.The users assign categories to the stories.

Example: I really hate being shy ... I just want to be able to talk tosomeone about anything and everything and be myself ... That's allI've ever wanted.

Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;

Author age: 21

Author gender:female

Text group: friends

Ekaterina Vylomova Working with linguistic data

Page 25: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Data

Let's load the data:

# read .cvs file with data

ep = read.csv('ep3 -context.csv')

Here: Count is the number of Category reactions received byconfessions containing Word in Group with an author of Genderand Age.Total is the number of Category reactions used by confessionscontaining any Word in Group with an author of Gender and Age.

Ekaterina Vylomova Working with linguistic data

Page 26: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Data

Look at di�erent parameters:

# show examples of words

levels(ep$Word)

Ekaterina Vylomova Working with linguistic data

Page 27: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Words and categories

Word-Category Correlation

Check if there is any correlation between words and categories

# include source file

source('ep.R')

# create a subset for word "funny"

funny = epCollapsedFrame(ep, 'funny')

# plot the frequencies of the word for each category

plot(funny$Category , funny$Count , xlab='Category ', ylab='

Count', main='funny')

Ekaterina Vylomova Working with linguistic data

Page 28: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Words and categories

Word-Category Correlation

"Funny"corresponds to "understand"category. This doesn't lookrealistically..

Ekaterina Vylomova Working with linguistic data

Page 29: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Words and categories

Word-Category Correlation

We need normalization!

# apply normalization (divide by the total number of words)

funny$Count / funny$Total

# get a subset for "funny", take frequencies into account

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE)

# create a plot

plot(funny$Category , funny$Freq , xlab='Category ', ylab='

Count/Total', main='funny')

Ekaterina Vylomova Working with linguistic data

Page 30: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Words and categories

Word-Category Correlation

Much better!Ekaterina Vylomova Working with linguistic data

Page 31: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Probability theory

Get category from word

Freq corresponds to the conditional probability P(word|category),i.e. the probability to that a speaker used 'word' in a given'category'.Let's apply Bayesian rule and compute P(category|word), i.e. theprobability of category given that a speaker used 'word'.

funny$Freq / sum(funny$Freq)

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE

)

plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count

/Total)/sum(Count/Total)', main='funny')

Question: any other words speci�c for a category?

Ekaterina Vylomova Working with linguistic data

Page 32: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Compare with estimated value

Estimate expected value

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE

, oe=TRUE)

Estimated value: Exp =∑N

i=1xip(xi ), p(xi ) is a probability of xi .

category.probs = (funny$Total/sum(funny$Total))

funny.count = sum(funny$Count)

funny.expected = funny.count * category.probs

funny.expected

Compare estimated and observed values:

(funny$observed / funny.expected) - 1

Value less than 0 means that a word is underrepresented in acategory.

Ekaterina Vylomova Working with linguistic data

Page 33: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' by gender

Usage of 'awesome' by male/female/unknown

eptok = read.csv('ep3 -context -tokencounts.csv')

par(mfrow=c(1,3))

epPlot(ep , eptok , 'awesome ', genders='male', probs=T)

epPlot(ep , eptok , 'awesome ', genders='female ', probs=T)

epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T)

Ekaterina Vylomova Working with linguistic data

Page 34: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' by gender

Usage of 'awesome' by male/female/unknown

Ekaterina Vylomova Working with linguistic data

Page 35: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' by age

Usage of 'awesome' by people of di�erent ages

par(mfrow=c(2,3))

for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs=

T) }

Ekaterina Vylomova Working with linguistic data

Page 36: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' by age

Usage of 'awesome' by people of di�erent ages

Ekaterina Vylomova Working with linguistic data

Page 37: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' comparing gender with the

category

'Awesome': gender+category

Changing the parameter for each category separately:

epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs

=T, type='b')

Ekaterina Vylomova Working with linguistic data

Page 38: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'awesome' comparing gender with the

category

'Awesome': gender+category

Ekaterina Vylomova Working with linguistic data

Page 39: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'drunk' comparing gender with the category

'Drunk': gender+category

Stories with "drunk"depend on the age:

epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T,

type='b')

Ekaterina Vylomova Working with linguistic data

Page 40: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Adding context: 'drunk' comparing gender with the category

'Drunk': gender+category

Ekaterina Vylomova Working with linguistic data

Page 41: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Creating a logistic regression model

Regression modelling

Let's create a regression model: predict the frequency of 'drunk'using age and category

drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5))

drunk$Age = as.numeric(drunk$Age)

fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age ,

data=drunk , family=binomial)

summary(fit.glm)

Ekaterina Vylomova Working with linguistic data

Page 42: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Creating a logistic regression model

Regression modelling

Find a function that predicts a word according to the category andage of person

FittedGlmFunc = function(fit , category , age) {

coefs = fit$coef

cat.coef = coefs[[ paste('Category ',category , sep='')]]

prediction = plogis(cat.coef + coefs [['Age']]*age)

return(prediction)

}

Calling the function:

FittedGlmFunc(fit.glm , 'wow', 1)

Ekaterina Vylomova Working with linguistic data

Page 43: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Creating a logistic regression model

Regression modelling

Visualize the data and compare empirical(black) values with�tted(red) data.

par(mfrow=c(2,3))

cats = levels(ep$Category)

for(i in 1:5) {

epPlot(ep , eptok , 'drunk', age=i)

for (j in 1:5) {

val = FittedGlmFunc(fit.glm , cats[j], i)

points(j, val , col='red', pch =19)

}

}

Ekaterina Vylomova Working with linguistic data

Page 44: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Calculating expected value

Regression modelling

Visualize the data and compare empirical(black) values with�tted(red) data.

Ekaterina Vylomova Working with linguistic data

Page 45: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

IMDB data

Analysis of "ADV-ADJ"collocations

Ekaterina Vylomova Working with linguistic data

Page 46: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Data from rating systems

Data

We will use the data from rating systems(Amazon.com,OpenTable.com, Goodreads.com, IMDB.com). Load them:

d = read.csv('ratings -advadj.csv')

head(d)

Ekaterina Vylomova Working with linguistic data

Page 47: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Extract subsets

'Horrid' categories

horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers=

NULL , modifier.types=NULL , ratingmax =0)

nrow(horrid)

head(horrid)

Ekaterina Vylomova Working with linguistic data

Page 48: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Extract subsets

'Absolutely'+'Horrid'

With modi�er:

horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely '

)

nrow(horrid)

head(horrid)

Ekaterina Vylomova Working with linguistic data

Page 49: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Tonality evaluation for adjectives

Probabilities of categories for 'horrid'

horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs

=TRUE)

horrid

Ekaterina Vylomova Working with linguistic data

Page 50: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Tonality

Probabilities vs frequencies

par(mfrow=c(1,2))

ratingPlot(d, 'horrid ', probs=FALSE)

ratingPlot(d, 'horrid ', probs=TRUE)

Question: give an example of adjective which maximizes the medianpoint of the plot.

Ekaterina Vylomova Working with linguistic data

Page 51: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Evaluating expectation

Predict category using adjective

Predict a category based on adjective.Expectation:

sum(horrid$Category * horrid$Pr)

The same does ExpectedCategory function:

ExpectedCategory(horrid)

Adding value to the plot:

ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE)

Ekaterina Vylomova Working with linguistic data

Page 52: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Evaluating expectation

Predict category using adjective

Ekaterina Vylomova Working with linguistic data

Page 53: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Regression model

A model for predicting

Let's create a model to predict probability that a word will be inparticular category

fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$

Count) ~ Category , family=quasibinomial , data=horrid)

fit.horrid

Ekaterina Vylomova Working with linguistic data

Page 54: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Regression model

A model for predicting

Ekaterina Vylomova Working with linguistic data

Page 55: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Regression model

A model for predicting

Improve the model by adding quadratic function

GlmWordQuadratic <-function(pf) {

pf$Category2 = pf$Category ^2

fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 ,

family=quasibinomial , data=pf)

return(fit)

}

par(mfrow=c(2,2))

ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)

, ratingmax=5, ylim=c(0, 0.5))

ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)

, ratingmax =10, ylim=c(0, 0.3))

ratingPlot(d, 'disappointing ', probs=TRUE , models=c(

GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))

ratingPlot(d, 'disappointing ', probs=TRUE , models=c(

GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3))Ekaterina Vylomova Working with linguistic data

Page 56: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Regression model

A model for predicting

Ekaterina Vylomova Working with linguistic data

Page 57: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Vector space models

Vector space models

How to transform words to vectors:

LSA (latent semantic analysis)

MDS (multidimensional scaling)

Ekaterina Vylomova Working with linguistic data

Page 58: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Basics about vectors

Euclidean distance:

EuclideanDist(x , y) =

√√√√ n∑i=1

(xi − yi )2

Vector length:

VectorLength(x) =

√√√√ n∑i=1

(xi )2

Vector normalization - component divided by its length.Cosine between vectors:

CosineDist(x , y) = 1−∑n

i=1(xi ) ∗

∑ni=1

(yi )

VectorLength(x) ∗ VectorLength(y)Ekaterina Vylomova Working with linguistic data

Page 59: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Vector space models

Data from IMDB

Initail data: term x term matrix, xij element of matrix is afrequency of cooccurrence of termi and termj in context(document,sentences, etc.)

source('vsm.R')

# co-occurrence matrix(words appearing in the same context(

phrase , sentence , paragraph))

imdb = Csv2Matrix('imdb -wordword.csv')

imdb [100:110 , 100:110]

Ekaterina Vylomova Working with linguistic data

Page 60: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Semantically related words

Extract semantically related words

df = Neighbors(imdb , 'happy')

head(df)

Ekaterina Vylomova Working with linguistic data

Page 61: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Semantically related words

Problem

a = c(1000 , 2000, 3000)

b = c(1, 2, 3)

a/sum(a)

# 0.1666667 0.3333333 0.5000000

b/sum(b)

# 0.1666667 0.3333333 0.5000000

LengthNorm(a)

# 0.2672612 0.5345225 0.8017837

LengthNorm(b)

> [1] 0.2672612 0.5345225 0.801783

Ekaterina Vylomova Working with linguistic data

Page 62: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

PMI - Pointwise mutual information

How to deal with it? - PMI!

PMI (x , y) = logp(x , y)

p(x) ∗ p(y)PMI normalization:

NPMI (i , j) = pmi(i , j)∗ p(i , j)

p(i , j) + 1∗

min (∑m

k=1p(k , j),

∑nk=1

p(k , j))

min (∑m

k=1p(k , j),

∑nk=1

p(k , j)) + 1

Where p(i,j)=M/sum(M), M - term x term matrix

Ekaterina Vylomova Working with linguistic data

Page 63: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

PMI - Pointwise mutual information

PMI

imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE)

df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc=

CosineDistance)

head(df)

Ekaterina Vylomova Working with linguistic data

Page 64: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Semantic orientation method

Semantic orientation

Describe 2 sets of words S1 è S2 (vector representations)

Choose the distance measure

For a word w : calculate the sum of distances to vectors of S1and S2

The tonality is a di�erence between two sums

Ekaterina Vylomova Working with linguistic data

Page 65: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

Semantic orientation method

Example of semantic orientation method

neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', '

wrong', 'inferior ')

pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate '

, 'correct ', 'superior ')

SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg ,

seeds2=pos , distfunc=CosineDistance)

# 0.8923544

SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg ,

seeds2=pos , distfunc=CosineDistance)

# -0.04741898

Ekaterina Vylomova Working with linguistic data

Page 66: Working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Experience projectExperience projectIMDB: Vector space models

More information

Data & examples

For more detailed examples and tutorials about sentiment analysisgo to Chris Potts tutorials.http://nasslli2012.christopherpotts.net

http://sentiment.christopherpotts.net

Email me if you need any help!

Ekaterina Vylomova Working with linguistic data