text analytics intro

42
Intro 2 text analytics | Ben Taylor @bentaylordata Text Analytics Are Awesome!

Upload: benjamin-taylor

Post on 15-Dec-2014

123 views

Category:

Data & Analytics


2 download

DESCRIPTION

This is a simple text analytics intro I put together for people with traditional numeric backgrounds that want to venture into text prediction. Some of this work came out of a competition that Skullcandy helped facilitate.

TRANSCRIPT

Page 1: Text analytics intro

Intro 2 text analytics | Ben Taylor @bentaylordata

Text Analytics Are Awesome!

Page 2: Text analytics intro

Thank you to our Sponsors!

Page 3: Text analytics intro

HIREVUE | TALENT INTERACTION

Agenda

SPAM

Levenshtein distance (word, sentence, cloud)

2

3

4

Text handling, introduction1

Map Reduce / Clustering5

Interview text analytics6

Sentiment

Page 4: Text analytics intro

HIREVUE | TALENT INTERACTION

Text handlingInput not expected?

Page 5: Text analytics intro

HIREVUE | TALENT INTERACTION

Model

MInput Output

Page 6: Text analytics intro

HIREVUE | TALENT INTERACTION

Model

MInput

Page 7: Text analytics intro

HIREVUE | TALENT INTERACTION

Model

MInput Output

Stderr: You’re an idiot &

I don’t like you anymore

Page 8: Text analytics intro

HIREVUE | TALENT INTERACTION

Input

Page 9: Text analytics intro

HIREVUE | TALENT INTERACTION @BENTAYLORDATA

Page 10: Text analytics intro

HIREVUE | TALENT INTERACTION

Page 11: Text analytics intro

HIREVUE | TALENT INTERACTION

Page 12: Text analytics intro

HIREVUE | TALENT INTERACTION @BENTAYLORDATA

Need to map unstructured text to summary metric

Page 13: Text analytics intro

HIREVUE | TALENT INTERACTION

SentimentHow are you feeling?

Page 14: Text analytics intro

HIREVUE | TALENT INTERACTION

Let’s make this easy.

Problem statement:Expletives + @skullcandy mention? Good or bad?

Page 15: Text analytics intro

HIREVUE | TALENT INTERACTION

Negative Sentiment 1048940088:

"I've got two pairs of Ink'd earbuds by @Skullcandy and they both broke in two weeks. I $#@&ing hate @Skullcandy! #$#@&You”

1054044204: “$#@& only one headphone stopped working stupid $#@&ing headphones y is it

only one headphone i blame you @skullcandy”

1376767884: "@skullcandy never buyin another pair of skull candy headphones this is the fourth

pair in the last 2 months that $#@&ed up”

141343855: “My headphones blew $#@& you skullcandy -___-”

16352011: “BAHHHHH My SkullCandys are $#@&ing up AGAIN!”

1376767884: "@skullcandy $#@& skullcandy"

Page 16: Text analytics intro

HIREVUE | TALENT INTERACTION

Positive Sentiment 161547390:

"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"

1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"

1117713458: "@skullcandy $#@&in bass is badass",

1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"

1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!!

the bass is truly amazing :)"

132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me

back and I'll hook you up."

Page 17: Text analytics intro

HIREVUE | TALENT INTERACTION

Neutral Sentiment

1104061464: "@autoerotique @skullcandy #crushers First pair

died after 2 days. Day 2 for new pair. The Alarm is thrashing my head, un$#@&me these rock”

Page 18: Text analytics intro

HIREVUE | TALENT INTERACTION

Conclusion

Sentiment Classification Count

Negative 6

Positive 6

Neutral 1

46% chance tweet is negative, now what?

Welcome to the majority of the sentiment solutions on the market:

Single-word naïve Bayesian classification

Page 19: Text analytics intro

HIREVUE | TALENT INTERACTION

Positive Sentiment (second pass) 161547390:

"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"

1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"

1117713458: "@skullcandy $#@&in bass is badass",

1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"

1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly

amazing :)"

132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook

you up.”

1104061464: "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm

is thrashing my head, un$#@&me these rock”

Page 20: Text analytics intro

HIREVUE | TALENT INTERACTION

ConclusionSentiment Classification Count

Negative 6

Positive ~0

Neutral ~0

~100% chance tweet is negative with tuple assistance. How to find complex tuples automatically!?

Bayesian bootstrap matrix

Unique words in training cloud

Un

iqu

e w

ord

s i

n t

rain

ing

clo

ud

Page 21: Text analytics intro

HIREVUE | TALENT INTERACTION

Basic sentiment output

Credit: Ben Peters

Keyword Negative positive

warranty 28.7 1

cant 11.8 1

back 11.8 1

break 11.8 1

after 11.1 1

what 9.1 1

never 9.1 1

Don’t 9.1 1

second 8.4 1

side 8.4 1

Page 22: Text analytics intro

HIREVUE | TALENT INTERACTION

SPAMI can’t handle this

Page 23: Text analytics intro

HIREVUE | TALENT INTERACTION

Lost future customer

Page 24: Text analytics intro

HIREVUE | TALENT INTERACTION

SPAM examples:

>80%

Page 25: Text analytics intro

HIREVUE | TALENT INTERACTION

SPAM list

Keyword spam good

@nikesb 52.0 1

@lrgskate 52.0 1

live 34.0 1

know 1 28.8

have 1 22.3

pair 1 16.3

earbud 16.1 1

Non-ascii-chars 12.4 1

some 1 11.9

check 1 11.6

Credit: Ben Peters

Page 26: Text analytics intro

HIREVUE | TALENT INTERACTION

Training…. Where do you get your training set?

What about @#tags? Misspellings? ?

Page 27: Text analytics intro

HIREVUE | TALENT INTERACTION

Training…. Where do you get your training set?

What about @#tags? Misspellings? ? SPAM?

Page 28: Text analytics intro

HIREVUE | TALENT INTERACTION

Manual trainerhttp://54.186.199.209/

Credit: Ben Peters

Page 29: Text analytics intro

HIREVUE | TALENT INTERACTION

LevenshteinNow things are getting interesting

Page 30: Text analytics intro

HIREVUE | TALENT INTERACTION

The things we take for grantedYou type: Awsome

Computer: It’s actually spelled Awesome

① kitten → sitten (substitution of "s" for "k")② sitten → sittin (substitution of "i" for "e")③ sittin → sitting (insertion of "g" at the end)

Page 31: Text analytics intro

HIREVUE | TALENT INTERACTION

Levenshtein word levelRef:

I am going skiing tomorrowHyp:

I am going skiing on Saturday

Page 32: Text analytics intro

HIREVUE | TALENT INTERACTION

Levenshtein word-cloud levelRef:

alphanumeric_sort(word_cloud_1)alphanumeric_sort(unique(word_cloud_1))

Hyp:alphanumeric_sort(word_cloud_2)alphanumeric_sort(unique(word_cloud_2))

>> wer(str1,str1)ans = 0

>> wer(strjoin(sort(strsplit(str1,' ')),' '),str1)ans = 15

Page 33: Text analytics intro

HIREVUE | TALENT INTERACTION

MapReduceGreat forText processingi.e. word counts

Page 34: Text analytics intro

HIREVUE | TALENT INTERACTION

CLUSTERINGNow things are getting interesting

Page 35: Text analytics intro

Group of tweets? Once we have categorized tweets we can build

word clouds!!!

Category A (could be negative sentiment, low selling areas, etc..)

Category B (could be positive sentiment, high selling areas, etc..)

words

words

words

words

words

wordswords

wordswords

words

Page 36: Text analytics intro

Levenshtein wordcloud similarity

Page 37: Text analytics intro

Levenshtein wordcloud similarity

Page 38: Text analytics intro

Cluster 1 example

CampingVirginGamingBattlefield

Page 39: Text analytics intro

Cluster 2 example

SkiingwinterStringray

Page 40: Text analytics intro

Cluster 3 example

MMABoxingSkateboarding

Page 41: Text analytics intro

Twitter Surgery

- =

Page 42: Text analytics intro

Training a blacklist filter

Acting…

Getting…

Holding…

Going…

Brings…

Turning..

Blacklist dictionary