machine learning applications on text data

Using Machine learning and R

Finding Order in the Chaos

Harshad Saykhedkar

The main ideaSource of text and applications

Emails Spam detection

Product descriptions / reviews

Sentiment analysis, recommendation

Blogs / informational content

Content recommendations

Web pages / news articles

Topic identification, trending topics

Tweets / comments / social content

Sentiment analysis, named entity recognition

(Text mining) is a wonderful world. Let's go exploring...!

The main ideaThe main idea

Itinerary

● R you ready ?

● Prep camp

● The wandering traveller

● The seeker

R you ready ?

The main ideaPacking our bags : Checks

● Starting R

● Loading required packages

● Check sessionInfo( )

The main ideaPacking our bags : Datatypes

Atomic Vector

Lists

"Let's try our hands"

The main ideaPacking our bags : Functions

● Expressions which are evaluated

● Can be passed around

● Definitions can be nested

Details not covered : Argument matching, Call by value,

Environments and lexical scoping, Promises etc..

Prep Camp

The main ideaPrep camp : Sentiment Analysis

● Bag of words model

● Simple aggregated score

' terrible service & disorganised '

' OK - some good some bad '

' Great location, fabulous staff '

The main idea

● Part of speech ambiguity

● Further exploration ?

● Equal weightage model

● Double negations ?

Prep camp : Improvements

The Wandering Traveller

The main ideawandering traveller : Unsupervised Learning

Can define distance

Entity as point in space

How to derive this model for text ?

Feature 1

Feature 2

The main ideawandering traveller : Vector Space Model

Word, Phrase, Theme

Comments,Blogs,Tweets

Word, Phrase, Theme

The main ideawandering traveller : TfIdf and other details

" But how to measure the importance of a word for a doc ? "

● Binary : Is the 'word' in the 'doc' ?

● Tf : # times the word in the 'doc' ?

● TfIdf : Penalize the obvious!

The main ideawandering traveller : Hierarchical Clustering

● Define distance measure

● Keep Merging based on similarity

Washing Machine

Washer Dryer

Camera

The main ideawandering traveller : Improvements

● Stemming, lemmatization

● Latent semantic analysis

"Cameras" Vs "Camera"

"Phone" "Touch Screen"

The Seeker

The main ideaSeeker : Supervised Learning

● Labels given with features

● Find rule, classify unobserved case

Feature 1

Feature 2

The main ideaSeeker : Naive Bayes Classifier

● Independence of features

● Train the model on training set

● Test accuracy on a holdout sample

Predicted 0 Predicted 1

Actual 0 F (0, 0) F(0, 1)

Actual 1 F (1, 0) F(1, 1)

Learnings

The main ideaLearnings

● How to cleanup and preprocess data

in text form ?

● How to model the data ?

● How to cluster the data ?

● How to classify the data ?

The main ideaSource of text and applications

Emails Spam detection

Product descriptions / reviews

Sentiment analysis, recommendation

Blogs / informational content

Content recommendations

Web pages / news articles

Topic identification, trending topics

Tweets / comments / social content

Sentiment analysis, named entity recognition

Questions ?

"Avid R learner, trying to apply bunch of these techniques to the digital ads world"

Contact [email protected]

The main ideaAbout me

machine learning applications on text data

Technology

main ideapacking

main ideawandering traveller

main ideaseeker

main ideathe main idea

main ideaabout

main idealearnings

main ideasource of text

main ideaprep camp