data science in industry - applying machine learning to real-world challenges

91
Data Science in Industry Applying Machine Learning to Real-world Challenges

Upload: yuchen-zhao

Post on 15-Jul-2015

375 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Data Science in Industry Applying Machine Learning to

Real-world Challenges

Page 2: Data Science in Industry - Applying Machine Learning to Real-world Challenges

About me - Yuchen Zhao

principal data scientist at

Page 3: Data Science in Industry - Applying Machine Learning to Real-world Challenges

obtained Ph.D. indata mining and machine learning

Page 4: Data Science in Industry - Applying Machine Learning to Real-world Challenges

worked in both academia and industry

Page 5: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 6: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Not just a researcher,but a coder & hacker

Page 7: Data Science in Industry - Applying Machine Learning to Real-world Challenges

What is data science?

Page 8: Data Science in Industry - Applying Machine Learning to Real-world Challenges

data is everywhere...

Page 9: Data Science in Industry - Applying Machine Learning to Real-world Challenges

data science helps

extract knowledge from data...

Page 10: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Data scientists investigate complex data problems

Page 11: Data Science in Industry - Applying Machine Learning to Real-world Challenges

find and interpret rich data sources

Page 12: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Visualize the data

Page 13: Data Science in Industry - Applying Machine Learning to Real-world Challenges

get insights from data

Page 14: Data Science in Industry - Applying Machine Learning to Real-world Challenges

from insights….

Page 15: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Questions?

Page 16: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 17: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 18: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Now is the fun part...

Page 19: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Data Science techniques!

Page 20: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Data science 101

● regression● classification● clustering● ranking (not covered in this lecture)● recommendation (not covered in this lecture)

Page 21: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Regression

Page 22: Data Science in Industry - Applying Machine Learning to Real-world Challenges

What is regression?

Page 23: Data Science in Industry - Applying Machine Learning to Real-world Challenges

A bit formal definition….

models a functional relationship between

an input variable x and

a response variable y

Page 24: Data Science in Industry - Applying Machine Learning to Real-world Challenges

x

y

Page 25: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 26: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 27: Data Science in Industry - Applying Machine Learning to Real-world Challenges

find the equation

Page 28: Data Science in Industry - Applying Machine Learning to Real-world Challenges

What else can regression do?

Page 29: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Predicting who may change jobs!

Page 31: Data Science in Industry - Applying Machine Learning to Real-world Challenges

x

y

Recap - regression

Page 32: Data Science in Industry - Applying Machine Learning to Real-world Challenges

classification

Page 33: Data Science in Industry - Applying Machine Learning to Real-world Challenges

identify to which of a set of categories a new data point belongs

Page 34: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Spam or Not spam?

Page 35: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Credit approve or not?

Page 36: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Optical character recognition

Page 37: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Document classification

Page 38: Data Science in Industry - Applying Machine Learning to Real-world Challenges

SVM

Page 39: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Decision tree

Page 40: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Use classification to...

find who you are in social networks

Page 41: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 42: Data Science in Industry - Applying Machine Learning to Real-world Challenges

classification

Page 43: Data Science in Industry - Applying Machine Learning to Real-world Challenges

classification

Page 44: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Missing data

Outdated data

Non-standard data

Page 45: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Why we want to classify?

Page 46: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Understanding users’ social roles is crucial to many

social network applications

Page 47: Data Science in Industry - Applying Machine Learning to Real-world Challenges

including advertising targeting,

marketing, personalization,

recommendation, etc.

Page 48: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Finding out who you really are...

Page 49: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 50: Data Science in Industry - Applying Machine Learning to Real-world Challenges

manually labeling is time-consuming

and error prone

Page 51: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Human learning

Machine learning

Page 52: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 53: Data Science in Industry - Applying Machine Learning to Real-world Challenges

SVM

Page 54: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Decision tree

Page 55: Data Science in Industry - Applying Machine Learning to Real-world Challenges

How accurate can we get?

Page 56: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 57: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Can we further improve?

Page 58: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Clustering

Page 59: Data Science in Industry - Applying Machine Learning to Real-world Challenges

grouping a set of data points

Page 60: Data Science in Industry - Applying Machine Learning to Real-world Challenges

data points in the same group ( cluster) are more similar to each other

than to those in other groups (clusters)

Page 61: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 62: Data Science in Industry - Applying Machine Learning to Real-world Challenges

k-means clustering algorithm

Page 63: Data Science in Industry - Applying Machine Learning to Real-world Challenges

k clusters

Page 64: Data Science in Industry - Applying Machine Learning to Real-world Challenges

k = 3

Page 65: Data Science in Industry - Applying Machine Learning to Real-world Challenges

step 1:randomly select k points

as centroids

Page 66: Data Science in Industry - Applying Machine Learning to Real-world Challenges

3 random centroids

Page 67: Data Science in Industry - Applying Machine Learning to Real-world Challenges

step 2:assign every data point to

the nearest centroid

Page 68: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 69: Data Science in Industry - Applying Machine Learning to Real-world Challenges

step 3:calculate mean of each cluster

as the new centroid

Page 70: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 71: Data Science in Industry - Applying Machine Learning to Real-world Challenges

repeatassign clusters based on

the new centroids

Page 72: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 73: Data Science in Industry - Applying Machine Learning to Real-world Challenges

How to use clustering to solve big data problem?

Page 74: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Machine data is massive

Page 75: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 76: Data Science in Industry - Applying Machine Learning to Real-world Challenges

1 Tb/day is normal

Page 77: Data Science in Industry - Applying Machine Learning to Real-world Challenges

no one has time to read all data...

Page 78: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Clustering comes to rescue!

Page 79: Data Science in Industry - Applying Machine Learning to Real-world Challenges
Page 80: Data Science in Industry - Applying Machine Learning to Real-world Challenges

clustering algorithm summarizesbig data to a few groups

Page 81: Data Science in Industry - Applying Machine Learning to Real-world Challenges

each group representsa number of similar data points

Page 82: Data Science in Industry - Applying Machine Learning to Real-world Challenges

investigating data pointsone by one

Page 83: Data Science in Industry - Applying Machine Learning to Real-world Challenges

just investigating the clusters!

Page 84: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Things to considerin practice...

Page 85: Data Science in Industry - Applying Machine Learning to Real-world Challenges

scalability

Page 86: Data Science in Industry - Applying Machine Learning to Real-world Challenges

velocity

Page 87: Data Science in Industry - Applying Machine Learning to Real-world Challenges

variety

Page 88: Data Science in Industry - Applying Machine Learning to Real-world Challenges

real-time

Page 89: Data Science in Industry - Applying Machine Learning to Real-world Challenges

What’s next?

Page 90: Data Science in Industry - Applying Machine Learning to Real-world Challenges

Recap

● regression

● classification

● clustering

Page 91: Data Science in Industry - Applying Machine Learning to Real-world Challenges

This presentation was initially created for a guest lecture at Utah State University for teaching and education purposes.

Thanks!