text analytics intro

Post on 15-Dec-2014

123 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a simple text analytics intro I put together for people with traditional numeric backgrounds that want to venture into text prediction. Some of this work came out of a competition that Skullcandy helped facilitate.

TRANSCRIPT

Intro 2 text analytics | Ben Taylor @bentaylordata

Text Analytics Are Awesome!

Thank you to our Sponsors!

HIREVUE | TALENT INTERACTION

Agenda

SPAM

Levenshtein distance (word, sentence, cloud)

2

3

4

Text handling, introduction1

Map Reduce / Clustering5

Interview text analytics6

Sentiment

HIREVUE | TALENT INTERACTION

Text handlingInput not expected?

HIREVUE | TALENT INTERACTION

Model

MInput Output

HIREVUE | TALENT INTERACTION

Model

MInput

HIREVUE | TALENT INTERACTION

Model

MInput Output

Stderr: You’re an idiot &

I don’t like you anymore

HIREVUE | TALENT INTERACTION

Input

HIREVUE | TALENT INTERACTION @BENTAYLORDATA

HIREVUE | TALENT INTERACTION

HIREVUE | TALENT INTERACTION

HIREVUE | TALENT INTERACTION @BENTAYLORDATA

Need to map unstructured text to summary metric

HIREVUE | TALENT INTERACTION

SentimentHow are you feeling?

HIREVUE | TALENT INTERACTION

Let’s make this easy.

Problem statement:Expletives + @skullcandy mention? Good or bad?

HIREVUE | TALENT INTERACTION

Negative Sentiment 1048940088:

"I've got two pairs of Ink'd earbuds by @Skullcandy and they both broke in two weeks. I $#@&ing hate @Skullcandy! #$#@&You”

1054044204: “$#@& only one headphone stopped working stupid $#@&ing headphones y is it

only one headphone i blame you @skullcandy”

1376767884: "@skullcandy never buyin another pair of skull candy headphones this is the fourth

pair in the last 2 months that $#@&ed up”

141343855: “My headphones blew $#@& you skullcandy -___-”

16352011: “BAHHHHH My SkullCandys are $#@&ing up AGAIN!”

1376767884: "@skullcandy $#@& skullcandy"

HIREVUE | TALENT INTERACTION

Positive Sentiment 161547390:

"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"

1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"

1117713458: "@skullcandy $#@&in bass is badass",

1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"

1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!!

the bass is truly amazing :)"

132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me

back and I'll hook you up."

HIREVUE | TALENT INTERACTION

Neutral Sentiment

1104061464: "@autoerotique @skullcandy #crushers First pair

died after 2 days. Day 2 for new pair. The Alarm is thrashing my head, un$#@&me these rock”

HIREVUE | TALENT INTERACTION

Conclusion

Sentiment Classification Count

Negative 6

Positive 6

Neutral 1

46% chance tweet is negative, now what?

Welcome to the majority of the sentiment solutions on the market:

Single-word naïve Bayesian classification

HIREVUE | TALENT INTERACTION

Positive Sentiment (second pass) 161547390:

"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"

1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"

1117713458: "@skullcandy $#@&in bass is badass",

1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"

1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly

amazing :)"

132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook

you up.”

1104061464: "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm

is thrashing my head, un$#@&me these rock”

HIREVUE | TALENT INTERACTION

ConclusionSentiment Classification Count

Negative 6

Positive ~0

Neutral ~0

~100% chance tweet is negative with tuple assistance. How to find complex tuples automatically!?

Bayesian bootstrap matrix

Unique words in training cloud

Un

iqu

e w

ord

s i

n t

rain

ing

clo

ud

HIREVUE | TALENT INTERACTION

Basic sentiment output

Credit: Ben Peters

Keyword Negative positive

warranty 28.7 1

cant 11.8 1

back 11.8 1

break 11.8 1

after 11.1 1

what 9.1 1

never 9.1 1

Don’t 9.1 1

second 8.4 1

side 8.4 1

HIREVUE | TALENT INTERACTION

SPAMI can’t handle this

HIREVUE | TALENT INTERACTION

Lost future customer

HIREVUE | TALENT INTERACTION

SPAM examples:

>80%

HIREVUE | TALENT INTERACTION

SPAM list

Keyword spam good

@nikesb 52.0 1

@lrgskate 52.0 1

live 34.0 1

know 1 28.8

have 1 22.3

pair 1 16.3

earbud 16.1 1

Non-ascii-chars 12.4 1

some 1 11.9

check 1 11.6

Credit: Ben Peters

HIREVUE | TALENT INTERACTION

Training…. Where do you get your training set?

What about @#tags? Misspellings? ?

HIREVUE | TALENT INTERACTION

Training…. Where do you get your training set?

What about @#tags? Misspellings? ? SPAM?

HIREVUE | TALENT INTERACTION

Manual trainerhttp://54.186.199.209/

Credit: Ben Peters

HIREVUE | TALENT INTERACTION

LevenshteinNow things are getting interesting

HIREVUE | TALENT INTERACTION

The things we take for grantedYou type: Awsome

Computer: It’s actually spelled Awesome

① kitten → sitten (substitution of "s" for "k")② sitten → sittin (substitution of "i" for "e")③ sittin → sitting (insertion of "g" at the end)

HIREVUE | TALENT INTERACTION

Levenshtein word levelRef:

I am going skiing tomorrowHyp:

I am going skiing on Saturday

HIREVUE | TALENT INTERACTION

Levenshtein word-cloud levelRef:

alphanumeric_sort(word_cloud_1)alphanumeric_sort(unique(word_cloud_1))

Hyp:alphanumeric_sort(word_cloud_2)alphanumeric_sort(unique(word_cloud_2))

>> wer(str1,str1)ans = 0

>> wer(strjoin(sort(strsplit(str1,' ')),' '),str1)ans = 15

HIREVUE | TALENT INTERACTION

MapReduceGreat forText processingi.e. word counts

HIREVUE | TALENT INTERACTION

CLUSTERINGNow things are getting interesting

Group of tweets? Once we have categorized tweets we can build

word clouds!!!

Category A (could be negative sentiment, low selling areas, etc..)

Category B (could be positive sentiment, high selling areas, etc..)

words

words

words

words

words

wordswords

wordswords

words

Levenshtein wordcloud similarity

Levenshtein wordcloud similarity

Cluster 1 example

CampingVirginGamingBattlefield

Cluster 2 example

SkiingwinterStringray

Cluster 3 example

MMABoxingSkateboarding

Twitter Surgery

- =

Training a blacklist filter

Acting…

Getting…

Holding…

Going…

Brings…

Turning..

Blacklist dictionary

top related