machine learning applications on text data
DESCRIPTION
o you get the feeling of ‘the cart before the horse’ on hearing buzz-words like social data mining or sentiment analysis and so on? Fundamental text mining methods are the real ‘workhorses’ behind these buzz-words. This prsentation aims to give understanding of the fundamentals in plain english.TRANSCRIPT
Using Machine learning and R
Finding Order in the Chaos
Harshad Saykhedkar
The main ideaSource of text and applications
Emails Spam detection
Product descriptions / reviews
Sentiment analysis, recommendation
Blogs / informational content
Content recommendations
Web pages / news articles
Topic identification, trending topics
Tweets / comments / social content
Sentiment analysis, named entity recognition
(Text mining) is a wonderful world. Let's go exploring...!
The main ideaThe main idea
Itinerary
● R you ready ?
● Prep camp
● The wandering traveller
● The seeker
R you ready ?
The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
The main ideaPacking our bags : Datatypes
Atomic Vector
Lists
"Let's try our hands"
The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be nested
Details not covered : Argument matching, Call by value,
Environments and lexical scoping, Promises etc..
Prep Camp
The main ideaPrep camp : Sentiment Analysis
● Bag of words model
● Simple aggregated score
' terrible service & disorganised '
' OK - some good some bad '
' Great location, fabulous staff '
The main idea
● Part of speech ambiguity
● Further exploration ?
● Equal weightage model
● Double negations ?
Prep camp : Improvements
The Wandering Traveller
The main ideawandering traveller : Unsupervised Learning
Can define distance
Entity as point in space
How to derive this model for text ?
Feature 1
Feature 2
The main ideawandering traveller : Vector Space Model
Word, Phrase, Theme
Comments,Blogs,Tweets
Word, Phrase, Theme
The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of a word for a doc ? "
● Binary : Is the 'word' in the 'doc' ?
● Tf : # times the word in the 'doc' ?
● TfIdf : Penalize the obvious!
The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
Washing Machine
Washer Dryer
Camera
The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"
"Phone" "Touch Screen"
The Seeker
The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
Feature 1
Feature 2
The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy on a holdout sample
Predicted 0 Predicted 1
Actual 0 F (0, 0) F(0, 1)
Actual 1 F (1, 0) F(1, 1)
Learnings
The main ideaLearnings
● How to cleanup and preprocess data
in text form ?
● How to model the data ?
● How to cluster the data ?
● How to classify the data ?
The main ideaSource of text and applications
Emails Spam detection
Product descriptions / reviews
Sentiment analysis, recommendation
Blogs / informational content
Content recommendations
Web pages / news articles
Topic identification, trending topics
Tweets / comments / social content
Sentiment analysis, named entity recognition
Questions ?
"Avid R learner, trying to apply bunch of these techniques to the digital ads world"
Contact [email protected]
The main ideaAbout me