machine learning at geeky base 2

94
Machine Learning http://www.bigdata-madesimple.com/ Kan Ouivirach Geeky Base (2015)

Upload: kan-ouivirach-phd

Post on 21-Feb-2017

316 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Machine Learning at Geeky Base 2

Machine Learning

http://www.bigdata-madesimple.com/

Kan Ouivirach

Geeky Base (2015)

Page 2: Machine Learning at Geeky Base 2

About Me

Research & Development Engineer

www.kanouivirach.com

Kan Ouivirach

Page 3: Machine Learning at Geeky Base 2

Outline

• What is Machine Learning?

• Main Types of Learning

• Model Validation, Selection, and Evaluation

• Applied Machine Learning Process

• Cautions

Page 4: Machine Learning at Geeky Base 2

What is Machine Learning?

Page 5: Machine Learning at Geeky Base 2

–Arthur Samuel (1959)

“Field of study that gives computers the ability to learn without being explicitly programmed.”

Page 6: Machine Learning at Geeky Base 2

–Tom Mitchell (1988)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its

performance at tasks in T, as measured by P, improves with experience E.”

Page 7: Machine Learning at Geeky Base 2

Statistics vs. Data Mining vs. Machine Learning vs. …?

Page 8: Machine Learning at Geeky Base 2

Programming vs. Machine Learning?

Page 9: Machine Learning at Geeky Base 2

Programming

“Given a specification of a function f, implement f that meets the specification.”

Machine Learning

“Given example (x, y) pairs, induce f such that y = f(x) for given pairs and generalizes

well for unseen x”

–Peter Norvig (2014)

Page 10: Machine Learning at Geeky Base 2

Why is Machine Learning so hard?

http://veronicaforand.com/

Page 11: Machine Learning at Geeky Base 2

http://www.thinkgeek.com/product/f0ba/

What do you see?

11111110 11100101 00001010

While the computer sees this

Page 12: Machine Learning at Geeky Base 2

Machine Learning and Feature Representation

Learning Algorithm

Input

Feature Representation

Page 13: Machine Learning at Geeky Base 2

Dog and Cat?

http://thisvsthatshow.com/

Page 14: Machine Learning at Geeky Base 2

Applications of Machine Learning

• Search Engines

• Medical Diagnosis

• Object Recognition

• Stock Market Analysis

• Credit Card Fraud Detection

• Speech Recognition

• etc.

Page 15: Machine Learning at Geeky Base 2

Recommendation System on Amazon.com

Page 16: Machine Learning at Geeky Base 2

http://www.npr.org/sections/money/2011/11/15/142366953/the-tuesday-podcast-from-harvard-economist-to-casino-ceo

Ceasars Entertainment CorporationGary Loveman

Page 17: Machine Learning at Geeky Base 2

God’s EyeFast & Furious 7

http://www.standbyformindcontrol.com/2015/04/furious-7-gets-completely-untethered/

Page 18: Machine Learning at Geeky Base 2

PREDdictive POLicing - type of crime, place of crime, and time of crime

http://www.predpol.com/

Page 19: Machine Learning at Geeky Base 2

Speech Recognition from Microsoft

Page 20: Machine Learning at Geeky Base 2

Robot Localization

https://github.com/mjl/particle_filter_demo

Page 21: Machine Learning at Geeky Base 2

Machine Learning Tasks

Page 22: Machine Learning at Geeky Base 2

Classification

Regression

Similarity Matching

ClusteringCo-Occurrence Grouping

Profiling

Link Prediction

Data Reduction

Causal Modeling

Page 23: Machine Learning at Geeky Base 2

Main Types of Learning

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

Page 24: Machine Learning at Geeky Base 2

Supervised Learning

y = f(x)

Given x, y pairs, find a function f that will map new x to a proper y.

Page 25: Machine Learning at Geeky Base 2

Supervised Learning Problems

• Regression

• Classification

Page 26: Machine Learning at Geeky Base 2

Regression

Page 27: Machine Learning at Geeky Base 2

Linear Regression

y = wx + b

Page 28: Machine Learning at Geeky Base 2

http://thisvsthatshow.com/

Classification

Page 29: Machine Learning at Geeky Base 2

k-Nearest Neighbors

http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/

Page 30: Machine Learning at Geeky Base 2

Perceptron

Processor

Input 0

Input 1

Output

One or more inputs, a processor, and a single output

Page 31: Machine Learning at Geeky Base 2

Perceptron Algorithm

Processor

12

4

Output

0.5

-1

(12 x 0.5) + (4 x -1)

sign(2)

+1

Page 32: Machine Learning at Geeky Base 2

Perceptron’s Goal

https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/

w0x0 + w1x1

Page 33: Machine Learning at Geeky Base 2

How Perceptron Learning Works

https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/

Page 34: Machine Learning at Geeky Base 2

Let’s implement k-Nearest Neighbors!

Page 35: Machine Learning at Geeky Base 2

Probability Theoryhttps://seisanshi.wordpress.com/tag/probability/

Page 36: Machine Learning at Geeky Base 2

Calculating Conditional Probability

• Probability that I eat bread for breakfast, P(A), is 0.6.

• Probability that I eat steak for lunch, P(B), is 0.5.

• Given I eat steak for lunch, the probability that I eat bread for breakfast, P(A | B), is 0.7.

• What is P(B | A)?

• What about when A and B are independent?

Page 37: Machine Learning at Geeky Base 2

A2A1 A3 An

Ck

. . .

P(Ck | A1, …, An) = P(Ck) * P(A1, …, An | Ck) / P(A1, …, An)

P(Ck | A1, …, An) P(Ck) * Prod P(Ai | C)

with independence assumption, we then have

Naive Bayes

Page 38: Machine Learning at Geeky Base 2

Naive Bayes

No. Content Spam?

1 Party Yes

2 Sale Discount Yes

3 Party Sale Discount Yes

4 Python Party No

5 Python Programming No

Page 39: Machine Learning at Geeky Base 2

Naive Bayes

P(Spam | Party, Programming) = P(Spam) * P(Party | Spam) * P(Programming | Spam)

P(NotSpam | Party, Programming) = P(NotSpam) * P(Party | NotSpam) * P(Programming | NotSpam)

We want to find if “Party Programming” is spam or not?

We need to know

P(Spam), P(NotSpam)

P(Party | Spam), P(Party | NotSpam)

P(Programming | Spam), P(Programming | NotSpam)

Page 40: Machine Learning at Geeky Base 2

Naive Bayes

No. Content Spam?1 Party Yes2 Sale Discount Yes3 Party Sale Discount Yes4 Python Party No5 Python Programming No

P(Spam) = ? P(NotSpam) = ?

P(Party | Spam) = ? P(Party | NotSpam) = ?

P(Programming | Spam) = ? P(Programming | NotSpam) = ?

Page 41: Machine Learning at Geeky Base 2

Naive Bayes

No. Content Spam?1 Party Yes2 Sale Discount Yes3 Party Sale Discount Yes4 Python Party No5 Python Programming No

P(Spam) = 3/5 P(NotSpam) = 2/5

P(Party | Spam) = 2/3 P(Party | NotSpam) = 1/2

P(Programming | Spam) = 0 P(Programming | NotSpam) = 1/2

Page 42: Machine Learning at Geeky Base 2

Naive Bayes

P(Spam | Party, Programming) = 3/5 * 2/3 * 0 = 0

P(NotSpam | Party, Programming) = 2/5 * 1/2 * 1/2 = 0.1

P(NotSpam | Party, Programming) > P(Spam | Party, Programming)

“Party Programming” is NOT a spam.

Page 43: Machine Learning at Geeky Base 2

Decision Tree

Outlook

Humidity Wind

SunnyOvercast

Rain

Yes

High Normal Strong Weak

No Yes No Yes

Day Outlook Temp Humidity WInd Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Mild High Strong Yes

D4 Rain Cool Normal Strong No

Play tennis?

Page 44: Machine Learning at Geeky Base 2

Support Vector Machines

x

y

Page 45: Machine Learning at Geeky Base 2

Support Vector Machines

x

y

Current Coordinate System

x

z

New Coordinate System

“Kernel Trick”

Page 46: Machine Learning at Geeky Base 2

Support Vector Machines

http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/

3 support vectors

Page 47: Machine Learning at Geeky Base 2

Unsupervised Learning

f(x)

Given x, find a function f that gives a compact description of x.

Page 48: Machine Learning at Geeky Base 2

Unsupervised Learning

• k-Means Clustering

• Hierarchical Clustering

• Gaussian Mixture Models (GMMs)

Page 49: Machine Learning at Geeky Base 2

k-Means Clustering

http://stackoverflow.com/questions/24645068/k-means-clustering-major-understanding-issue/24645894#24645894

Page 50: Machine Learning at Geeky Base 2

Recommendation

Page 51: Machine Learning at Geeky Base 2

Should I recommend “The Last Which Hunter” to Roofimon? (User-Based)

The Hunger Game

Warcraft The Beginning

The Good Dinosaur

The Last Witch Hunter

Kan 5 4 1 3

Roofimon 5 4 3 ?

Juacompe 1 3 3

John 4 1What should the rating be?

Find the most similar user to Roofimon

Page 52: Machine Learning at Geeky Base 2

Should I recommend “The Last Which Hunter” to Roofimon? (Item-Based)

The Hunger Game

Warcraft The Beginning

The Good Dinosaur

The Last Witch Hunter

Kan 5 4 1 3

Roofimon 5 4 3 ?

Juacompe 1 3 3

John 4 1Find the most similar item to The Last Witch Hunter

What should the rating be?

Page 53: Machine Learning at Geeky Base 2

Should I recommend “The Last Which Hunter” to Roofimon? (Matrix Factorization)

The Hunger Game

Warcraft The Beginning

The Good Dinosaur

The Last Witch Hunter

Roofimon 5 4 3 ?

User Scary Kiddy

Roofimon 2 5

Movie Scary Kiddy

TLWH 3/4 1/4

(2 x 3/4) + (5 x 1/4) = 2.75

Page 54: Machine Learning at Geeky Base 2

Anomaly Detection

http://modernfarmer.com/2013/11/farm-pop-idioms/

Page 55: Machine Learning at Geeky Base 2

http://boxesandarrows.com/designing-screens-using-cores-and-paths/

Page 56: Machine Learning at Geeky Base 2

Let’s try k-Means!

Page 57: Machine Learning at Geeky Base 2

1D k-Means Clustering

• Given these items: {2, 4, 10, 12, 3, 20, 30, 11, 25}

• Given these initial centroids: m1 = 2 and m2 = 4

• Find me the final clusters!

Initialize Assign Update Centroids Converge? Done

Yes

No

Page 58: Machine Learning at Geeky Base 2

Recap: Supervised vs. Unsupervised?

Page 59: Machine Learning at Geeky Base 2

Reinforcement Learning

y = f(x)

Given x and z, find a function f that generates y.

z

Page 60: Machine Learning at Geeky Base 2

Flappy Bird Hack using Reinforcement Learninghttp://sarvagyavaish.github.io/FlappyBirdRL/

Page 61: Machine Learning at Geeky Base 2
Page 62: Machine Learning at Geeky Base 2

Model Validation

Page 63: Machine Learning at Geeky Base 2

I’ve got a perfect classifiers!

https://500px.com/photo/65907417/like-a-frog-trapped-inside-a-coconut-shell-by-ellena-susanti

Page 64: Machine Learning at Geeky Base 2

http://blog.csdn.net/love_tea_cat/article/details/25972921

Overfitting (High Variance)

Normal fit Overfitting

Page 65: Machine Learning at Geeky Base 2

http://blog.csdn.net/love_tea_cat/article/details/25972921

Underfitting (High Bias)

Normal fit Underfitting

Page 66: Machine Learning at Geeky Base 2

How to Avoid Overfitting and Underfitting

• Using more data does NOT always help.

• Recommend to

• find a good number of features;

• perform cross validation;

• use regularization when overfitting is found.

Page 67: Machine Learning at Geeky Base 2

Model Selection

Page 68: Machine Learning at Geeky Base 2

Model Selection

• Use cross validation to find the best parameters for the model.

Page 69: Machine Learning at Geeky Base 2

Model Evaluation

Page 70: Machine Learning at Geeky Base 2

Metrics

• Accuracy

• True Positive, False Positive, True Negative, False Negative

• Precision and Recall

• F1 Score

• etc.

Page 71: Machine Learning at Geeky Base 2

Let’s evaluate this Giving Cats system!

Page 72: Machine Learning at Geeky Base 2

Give me cats!

3 True Positives1 False Positive

2 False Negatives

4 True Negatives

System

User

Page 73: Machine Learning at Geeky Base 2

Precision and Recall

http://en.wikipedia.org/wiki/Precision_and_recall

Page 74: Machine Learning at Geeky Base 2

False Positive or False Negative?

Page 75: Machine Learning at Geeky Base 2

Metrics Summary

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Page 76: Machine Learning at Geeky Base 2

Applied Machine Learning Process

http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

Page 77: Machine Learning at Geeky Base 2

Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000)

https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

Page 78: Machine Learning at Geeky Base 2

Define the Problem

https://youmustdesireit.wordpress.com/2014/03/05/developing-and-nurturing-creative-problem-solving/

Page 79: Machine Learning at Geeky Base 2

Prepare Data

http://vpnexpress.net/big-data-use-a-vpn-block-data-collection/

Page 80: Machine Learning at Geeky Base 2

Spot Check Algorithms

https://www.flickr.com/photos/withassociates/4385364607/sizes/l/

Page 81: Machine Learning at Geeky Base 2

If two models fit the data equally well, choose the simpler one.

Page 82: Machine Learning at Geeky Base 2

Improve Results

http://www.mobilemechanicprosaustin.com/

Page 83: Machine Learning at Geeky Base 2

Present Results

http://www.langevin.com/blog/2013/04/25/5-tips-for-projecting-confidence/presentation-skills-2/

Page 84: Machine Learning at Geeky Base 2

http://newventurist.com/

• Curse of dimensionality

• Correlation does NOT imply causation.

• Learn many models, not just ONE.

• More data beats a cleaver algorithm.

• Data alone are not enough.

A Few Useful Things You Need to Know about Machine Learning, Pedro Domigos (2012)

Some Cautions

Page 85: Machine Learning at Geeky Base 2

–John G. Richardson

“Learning Best Through Experience”

https://studio.azureml.net/

Page 86: Machine Learning at Geeky Base 2

Machine Learning and Feature Representation

Learning Algorithm

Input

— Feature engineering is the key. —

Feature Representation

Page 87: Machine Learning at Geeky Base 2

Garbage In - Garbage Out

http://blog.marksgroup.net/2013/05/zoho-crm-garbage-in-garbage-out-its.html

Page 88: Machine Learning at Geeky Base 2

Example of Feature Engineering

Width (m) Length (m) Cost (baht)

100 100 1,200,000

500 50 1,300,000

100 80 1,000,000

400 100 1,500,000

Are the data good to model the area’s cost?

Size (m x m) Cost (baht)

100,000 1,200,000

25,000 1,300,000

8,000 1,000,000

400,00 1,500,000

Engineer features.

They look better here.

Page 89: Machine Learning at Geeky Base 2

Can we do better?

Page 90: Machine Learning at Geeky Base 2

Deep Learning at Microsoft’s Speech Group

Page 91: Machine Learning at Geeky Base 2

Recommended Books

Page 92: Machine Learning at Geeky Base 2

http://www.barnstable.k12.ma.us/domain/210

Page 93: Machine Learning at Geeky Base 2
Page 94: Machine Learning at Geeky Base 2

https://github.com/zkan/intro-to-machine-learning