workshop on machine learning

42
Machine Learning what, why and how

Upload: harshad-saykhedkar

Post on 02-Dec-2014

162 views

Category:

Data & Analytics


4 download

DESCRIPTION

This is a presentation from a introductory machine learning workshop that I conducted. This was held as a part of run up to Fifth Elephant 2014, a conference on big data and analytics.

TRANSCRIPT

Page 1: Workshop on Machine Learning

Machine Learningwhat, why and how

Page 2: Workshop on Machine Learning

me ...

Harshad➔ Senior Data Scientist @ Sokrati ➔ Spent last 4 years trying to understand and

apply machine learning

Sokrati is a digital advertising startup based out of Pune

Page 3: Workshop on Machine Learning

what are we going to do ?

Page 4: Workshop on Machine Learning

get a 10000 feet view ...

Page 5: Workshop on Machine Learning

then go to specifics ...

Page 6: Workshop on Machine Learning

10000 feet view

Page 7: Workshop on Machine Learning

what is ML ?

➔ Too many definitions!➔ Too much debate➔ Analytics vs ML vs data mining vs Big Data vs

Statistics vs next buzzword in the market

Page 8: Workshop on Machine Learning

what is ML ?

Teaching machines to take decisions with the help of data

practical man’s definition of machine learning!

Page 9: Workshop on Machine Learning

bit of history ...

Page 10: Workshop on Machine Learning

That’s too ancient!

That’s not ML...

Page 11: Workshop on Machine Learning

bit of (relevant) history ...

➔ Insurance, Banking industry◆ Credit scoring◆ mathematical models in finance

➔ Artificial Intelligence and other fancy ideas◆ deep blue, Samuel’s checkers machine◆ IBM watson computer

Page 12: Workshop on Machine Learning

Find ‘f’ => endeavour to understand the world!

g( y ) = f( X )

Page 13: Workshop on Machine Learning

the two cultures in ML

Stats Culture

Vs

AI / ML Culture

Page 14: Workshop on Machine Learning

the two cultures in ML

Stats Culture➔ Focus on ‘why this model’➔ Goodness of fit, hypothesis

tests, residuals➔ or MCMC methods, bayesian

modeling➔ regression models, survival

analysis

AI / ML Culture➔ Focus on ‘good predictions’➔ Cross validations, ensemble of

models➔ Focus on underfit vs overfit

analysis➔ neural nets, tree based models

(random forests et. al.)

Page 15: Workshop on Machine Learning

but let’s build bridges

Stats➔ Focus on basics, sound theory➔ Exploration, summaries➔ Models

ML / AI➔ Focus on predictions➔ Model evaluation➔ Feature selection

Computations➔ Focus on application➔ Achieving scale and usability➔ Hadoop , Storm and friends..

Business Knowledge➔ Focus on interpretation➔ Visualizations➔ Creating stories out of data

Page 16: Workshop on Machine Learning

typical ML process

➔ Objective➔ Source data➔ Explore➔ Model➔ Evaluate➔ Apply➔ Validate

Page 17: Workshop on Machine Learning

objectivein brief!

Page 18: Workshop on Machine Learning

bottom line : not in terms of algorithm but outcome

probability of customer churn

group set of emails by topic

predict rainfall

recommend item to a consumer to increase likelihood of click

Page 19: Workshop on Machine Learning

sourcing datato be covered at end!

Page 20: Workshop on Machine Learning

exploreR you ready ?

Page 21: Workshop on Machine Learning

introduction to R

➔ Starting R➔ Data Structures

◆ Atomic Vectors, matrix◆ Lists◆ Data frames

➔ Data types◆ usual suspects in numerics (int, double, character)◆ Factors ◆ logical

Page 22: Workshop on Machine Learning

data frame , the workhorse➔ Load sample data frame➔ Explore data frame (head, tail)➔ Access elements by index

◆ access rows◆ access columns

● single● multiple (by name, by index, by -ve index)

➔ Find metadata◆ names◆ dimensions

➔ Explore using plot (pairs)

Page 23: Workshop on Machine Learning

broadcasting / vectorization

➔ Very important concept➔ Subsetting

◆ vectors◆ data frames

➔ Applying operations◆ operations on entire column

Page 24: Workshop on Machine Learning

data exploration

➔ summarizing using mean➔ quantiles, when mean is not enough

◆ outlier detection➔ functional roots of R : sapply summary➔ summary function

◆ on numerical◆ on factors

➔ plots (basic)➔ histograms➔ correlations

Page 25: Workshop on Machine Learning

modelssimple & real world

Page 26: Workshop on Machine Learning

basic types of models

➔ Supervised learning➔ Unsupervised learning➔ Semi-supervised learning➔ Reinforcement learning

Page 27: Workshop on Machine Learning

linear regression

➔ load sample dataset (cars)➔ build linear regression model➔ understand the output

◆ summary◆ plots

➔ understanding train vs predict cycle◆ most important idea!

Page 28: Workshop on Machine Learning

demo on real world dataset

data exploration, classification

Page 29: Workshop on Machine Learning

hierarchical clustering

➔ basic idea of clustering◆ distance as a proxy for similarity, group by distance◆ group anything as long as distance can be

calculated➔ load and explore eurodist data➔ fit hierarchical cluster➔ plot dendrogram

Page 30: Workshop on Machine Learning

demo on real world dataand the most important idea!

Page 31: Workshop on Machine Learning

concept of vector space model➔ Words as axis➔ Bag of words defines

vector space➔ Document as a point in

space➔ We can

◆ define distance◆ measure similarity

(cosine similarity)◆ group documents

➔ what can be a document ?

Page 32: Workshop on Machine Learning

evaluation

Page 33: Workshop on Machine Learning

evaluation metrics

➔ depends on type of model◆ regression : MAPE, MSSE◆ classification : accuracy, precision, recall, F score◆ clustering : within vs between variance

➔ ML world (ref : two cultures) has much better story

➔ Not enough to perform well on training set

Page 34: Workshop on Machine Learning

brief intro regularization

➔ Bias vs Variance problem➔ We want to be ‘just right’➔ Concept of regularization➔ Intro Cross validation

Page 35: Workshop on Machine Learning

demo of evaluationand fantastic Scikits API

Page 36: Workshop on Machine Learning

sourcing and applyingand the great ML divide

Page 37: Workshop on Machine Learning

the great ML divide

Lab Culture➔ Theory➔ Small Datasets➔ In memory➔ Not live➔ R, Octave,

Python..

Source and Apply➔ The practice➔ Huge datasets➔ Live in production➔ Hadoop and

friends, Python ? , R ?

Page 38: Workshop on Machine Learning

processing data at scale

➔ Data is not available in final form➔ Non standard data

◆ click streams◆ event logs◆ free form text

➔ Process at scale➔ Transform , clean, group in final form

Page 39: Workshop on Machine Learning

5 min intro to Hadoop Ecosystem

➔ Write in assembly◆ java

➔ DSLs◆ Pig◆ hive◆ impala

➔ Functional Languages to rescue!

Page 40: Workshop on Machine Learning

Introduction to Cascalogprocessing data at scale

Page 41: Workshop on Machine Learning

Recap

➔ Pragmatic ML➔ Key phases➔ Supervised learning➔ Unsupervised learning➔ Evaluation➔ Application issues➔ Processing data @ scale

Page 42: Workshop on Machine Learning

Let’s discuss...