good enough analytics

68
Good Enough Analytics by Kai Xin

Upload: kai-xin-thia

Post on 26-Jan-2015

105 views

Category:

Technology


1 download

DESCRIPTION

Presented @ Bigdata Singapore Meetup. Good Enough Analytics is a methodology I am working on to achieve decent analytical results at a reasonable cost. Warning: For the consumption of Data Nerds Only. For 99% of normal humans, these slides are snooze inducing =P.

TRANSCRIPT

Page 1: Good Enough Analytics

Good Enough Analyticsby Kai Xin

Page 2: Good Enough Analytics
Page 3: Good Enough Analytics
Page 4: Good Enough Analytics

The Good Enough StuffAnalytical Tools

Page 5: Good Enough Analytics

Analytical Tools are like spoons

Page 6: Good Enough Analytics

Analytical Tools are like spoons

Page 7: Good Enough Analytics
Page 8: Good Enough Analytics
Page 9: Good Enough Analytics

Usefulness

Page 10: Good Enough Analytics

Usefulness

Point of stupidity

Page 11: Good Enough Analytics

Usefulness

Point of stupidity

Page 12: Good Enough Analytics

Usefulness

Point of stupidity

Point of stupidity

Page 13: Good Enough Analytics

Point of stupidity

What is stupid today, might not be stupid tomorrow

Page 14: Good Enough Analytics

Good Enough AnalyticsBig data analytics using cost efficient tools

Page 15: Good Enough Analytics

The Good Enough StuffEnsembles of good enough models

Page 16: Good Enough Analytics

Point of stupidity: The perfect model4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3

A “perfect” model is too complex, too costly to build, too hard to maintain and not

flexible to change.

Page 17: Good Enough Analytics

“There are known knowns; there are things we know that we know.

There are known unknowns; there are things that we now know we don't know.

But there are also unknown unknowns;there are things we do not know we don't know.”

By Donald Rumsfeld, United States Secretary of Defense and Potential Data Scientist

Why the perfect model is stupid

Page 18: Good Enough Analytics

“In statistics and machine learning, ensemble

methods use multiple models to obtain better predictive performance than could be obtained

from any of the constituent models”

Good Enough Analytics: Ensembles4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6

+1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4

+1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6

Page 19: Good Enough Analytics

scholarpedia.orgRefer to References

Page 20: Good Enough Analytics

scholarpedia.orgRefer to References

Page 21: Good Enough Analytics

scholarpedia.orgRefer to References

Page 22: Good Enough Analytics

The Serious Stuff…beyond theorycraft

Page 23: Good Enough Analytics

Simple Ensembles – GLM Bootstrap aggregating (bagging)

predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)

train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)

Page 24: Good Enough Analytics

Simple Ensembles – Gradient Boosting Machines

gbmMod<-gbm(eqn, train,n.trees=10000, shrinkage=0.002, distribution="gaussian", interaction.depth=7, bag.fraction=0.9,

n.minobsinnode = 50 )

Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined

by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training

data for each consecutive classifier.

Page 25: Good Enough Analytics

Simple Ensembles - Random Forest

rf <- foreach(ntree=rep(333,3), .combine=combine, .packages='randomForest')%dopar%

randomForest(train[,3:length(train)], train$Act, ntree=ntree, do.trace=1000, mtry=round(colNumber/3), replace=FALSE, nodesize = 5, na.action=na.omit)

Page 26: Good Enough Analytics

Ensemble of Ensembles

1. Mean(RF+GBM+BagGLM)2. Median(RF+GBM+BagGLM)3. 0.4*RF+0.4*GBM+0.2*BagGLM

Page 27: Good Enough Analytics

Ensembles – Why it mattersImprove accuracyEnsembles tend to yield better results than its constituent models when there is a significant diversity among the models

Developing multiple simple model is faster attempting to develop the perfect model

More resistance to over fitting Less reliant on any single model

Concurrent developmentDifferent models can be run and developed on different instances/machines by different data scientist

Page 28: Good Enough Analytics

Ensembles – point of stupidity

Netflix prize 1 million dollar winner: Ensemble of 107 models for 10% improvementToo complicated, costly and inflexible to change

Actual deployment: Ensemble of 2 models for 8.43% improvement Moral of story:Good Enough Ensemble is good enough

Page 29: Good Enough Analytics

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models

Page 30: Good Enough Analytics

The Good Enough StuffData Optimization

Page 31: Good Enough Analytics

Data cleaning vs Data optimization

Important but I assume you know

Done AFTER data cleaning

Page 32: Good Enough Analytics

Kaggle Medical Drug Competition

15 sets of dataEach data set:

1,000 to 2,000 Attributes500 to 20,000 Rows

Qn: Identify rogue drugs

Page 33: Good Enough Analytics

Point of stupidity: Trying to run analysis on all attributes

Drug Rogue %

Company Color Component 1

Component 2…2000

A 0.0400 XYZ Red 200 30

B 0.0002 XYZ Green 920 50

C 0.8000 XYZ Blue 30 1000

D ? XYZ Red 340 800

Page 34: Good Enough Analytics

Drug Rogue %

Company Color Component 1

Component 2…2000

A 0.0400 XYZ Red 200 30

B 0.0002 XYZ Green 920 50

C 0.8000 XYZ Blue 30 1000

D ? XYZ Red 340 800

Not all attributes are born equalNo

Variance Irrelevant Too many attributes

Page 35: Good Enough Analytics

Drug Rogue %

Company

A 0.0400 XYZ

B 0.0002 XYZ

C 0.8000 XYZ

D ? XYZ

R code: Library(caret)healthdata[nearZeroVar(healthdata, freqCut = 95/5, uniqueCut = 10)]<-list(NULL)

<- this attribute does not help in differentiating

between the drugs

Remove no variance / near zero variance attributes

Page 36: Good Enough Analytics

Drug Rogue %

Color

A 0.0400 Red

B 0.0002 Green

C 0.8000 Blue

D ? Red

R code for Random Forest: importanceScore <- importance(myMod)

R code for GBM: importanceScore <- summary.gbm (myMod, ntree)

<- this attribute has no relevance to % rouge drug

Remove not important attributes

Page 37: Good Enough Analytics

Drug Rogue % Component 1

Component 2…2000

A 0.0400 200 30

B 0.0002 920 50

C 0.8000 30 1000

D ? 340 800

R code:pc <- prcomp(train[, 2:length(train)],tol=0.12)

<- too many attributes takes very long to run

analysis

Attribute reduction using Principal Component Analysis

Page 38: Good Enough Analytics

Andrew Ng: Always try analysis without PCA first.

X XXXX X

Attribute 1

Attribute 2

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

Page 39: Good Enough Analytics

Andrew Ng: Always try analysis without PCA first.

X XXXX XPrincipal Component

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

Page 40: Good Enough Analytics

X

X

X

X

X

X

Attribute 1

Attribute 2

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

Page 41: Good Enough Analytics

The 1D red line and points are now representative of the 2D graph

Principal Component

Attribute reduction using Principal Component Analysis

0

00

00

0

Andrew Ng: Machine Learning CourseRefer to References

Page 42: Good Enough Analytics

Data Optimization – Why it matters

Performance Improvement (importance,nearZeroVar)

Cut down attributes which are useless or not “good enough”. More accurate and complex models can be built on attributes that matters.

Cost Savings (PCA)

Less data needs to be processed, faster turnover for models and results.

Page 43: Good Enough Analytics

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data

Page 44: Good Enough Analytics

The Good Enough StuffScaling on cloud

Page 45: Good Enough Analytics

Why use Cloud

How often do you really need a multimillion machine to be on standby 24/7 to churn data?

Do you really need real time analytics or is hourly/daily/weekly/monthly report good enough?

Page 46: Good Enough Analytics

Cloud – Why it mattersExcellent bang for the buck<$5/hr to rent million dollar worth of power. No need to purchase/maintain hardware. Scale on demand

Great for Ensemble ModelingYou can start multiple instance, each instance running one simple model and ensemble them

But beware of data security and privacy lawsNot suitable for all kinds of data/application For example, Amazon Web Service is HIPAA compliant but Rackspace is not.

Page 47: Good Enough Analytics

Name Age Income Postal

Peter 23 $2,000 400573

Sally 11 $0 520028

Paul 70 $500 521201

Mark 30 $8,000 247392

Prepare data for the cloud

Page 48: Good Enough Analytics

Name Age Age Group

Income Income Range

Postal Postal Area

Peter 23 Youth $2,000 $1,000-$3,000

400*** Eunos

Sally 11 Child $0 $0 520*** Simei

Paul 70 Senior $500 $1-$1,000 521*** Tampines

Mark 30 Adult $8,000 >$5,000 247*** Tanglin

Prepare data for the cloud

RemoveIdentity

Use general category

Reference: Dr. Yap Ghim Eng (A*Star)

Use range category Masking Rollup

Page 49: Good Enough Analytics

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

Page 50: Good Enough Analytics

The Good Enough Stuff…that we have no time for

Amazon Web Service

Page 51: Good Enough Analytics

sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blassudo yum install -y lapack-devel blas-devel

wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gztar -xf R-2.15.2.tar.gzcd R-2.15.2./configure --with-x=nosudo makePATH=$PATH:~/R-2.15.2/bin/cd ..

wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/downloadtar -xzf numpy-1.6.2.tar.gzcd numpy-1.6.2sudo python setup.py installcd ..

wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/downloadtar -xzf scipy-0.11.0.tar.gzcd scipy-0.11.0sudo python setup.py installcd ..

wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817tar -xzf nose-1.1.2.tar.gzcd nose-1.1.2sudo python setup.py install

Basic code to setup Amazon instance for analytics

=after sudo-ing and running R, type=install.packages('gbm')install.packages('randomForest')

To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"

Page 52: Good Enough Analytics

Amazon EC2 Spot InstanceCluster Compute Eight Extra Large60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet$0.27 per hour

High-Memory Quadruple Extra Large Instance 68.4 GiB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform$0.14 per hour

Page 53: Good Enough Analytics

Weakness of Spot Instance

Bidding system. If your bid < spot instance price, instance will be terminated.

Solutions:1) Put master on normal cloud instance

and slave on spot instance2) Heartbeat + Queue with Checkpoint

Page 54: Good Enough Analytics

The Good Enough Stuff…that we have no time for

PCA with KNN

Page 55: Good Enough Analytics

library(FNN)train <- read.csv("train.csv", header=TRUE)test <- read.csv("test.csv", header=TRUE)

pc <- prcomp(train[, 2:length(train)],tol=0.12)mydata <- data.frame(label = train[, "label"], pc$x)labels <- mydata[,1]mydata2 <- mydata[,-1]test.p <- predict(pc, newdata = test)

results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1, algorithm="cover_tree")]write(results, file="knn_PCA.csv", ncolumns=1)

Principal Component Analysis - With K-Nearest Neighbor

Page 56: Good Enough Analytics

The Good Enough Stuff…that we have no time for

Data Chunking

Page 57: Good Enough Analytics

Data Chunking– Revolution R

Loosely based on NoSQL

The XDF format is a binary file format that stores data in blocks and processes data in chunks (groups of blocks) for efficient reading of arbitrary columns and contiguous rows

Use a format called XDF

For more details, visit RevR website

Page 58: Good Enough Analytics

Data Chunking– Why it matters# Chunk 6.5GB worth of data onto HDD in XDFrxImport(inData = trainFile, outFile = “trainingData.xdf”)

#revR created methods like rxGlm to run huge Poisson regression directly on XDF file myPos <- rxGlm(amount2 ~ Mailed+Donated+RR,data="trainingData", family=poisson())*This cannot be done using normal R on my laptop, as R tries to load entire dataset into memory

Page 59: Good Enough Analytics

RAM: Fast but expansive

SSD: ~4x faster than normal HDD when chunking

Data Chunking– Speeding it up using SSD instead of normal HDD

Page 60: Good Enough Analytics

The Good Enough Stuff…that we have no time for

Multicore

Page 61: Good Enough Analytics

Multicore Processing – Revolution Rlibrary(foreach)library(doSNOW)cluster <-makeCluster(3, type = "SOCK")registerDoSNOW(cluster)setMKLthreads(1)

predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)

train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)

Page 62: Good Enough Analytics

Multicore Processing – Why it matters

License Cost (Usually charge by per CPU)1 CPU with 4 core = 1 single user license

Distributed 4 CPUS with 1 core each = 4 license or group license

Performance Improvement~2 x performance for 3 core vs 1 core

Page 63: Good Enough Analytics

Visualization

Page 64: Good Enough Analytics

Good Enough ReferencesRandom Forest•Obtaining knowledge from a random forest•Suggestions for speeding up Random Forests•Random Forest with classes that are very unbalanced

GBM•Define boosting•Generalized Boosted Models:A guide to the gbm package•What are some useful guidelines for GBM parameters?•R gbm logistic regression•How to win the KDD Cup Challenge with R and gbm

Ensembles•Ensemble learning introduction•Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets•Resources for learning how to implement ensemble methods•Ensemble methods•Intro to ensemble learning in R•Predictive analytics & decision tree

Page 66: Good Enough Analytics

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

Page 67: Good Enough Analytics

Qns? Email me @ [email protected] ProfileKaggle Profile

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

Asia?

Page 68: Good Enough Analytics

•Slide 2: http://3.bp.blogspot.com/-nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG•Slide 3: http://www.salesmanagementmastery.com/wp-content/uploads/2010/09/money-flying.jpg•Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg•Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html•Slide 7: http://2.bp.blogspot.com/-Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg•Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg•Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg•Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank-icon.jpg•Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page•Slide 19-21: www.scholarpedia.org•Slide 23/25: www.wikipedia.org•Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg•Slide 63: www.kaggle.com

Photo Credits