la hug 2012 02-07
DESCRIPTION
Hadoop User Group talk in L.A. (2012)TRANSCRIPT
![Page 1: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/1.jpg)
Beating up on Bayesian Bandits
![Page 2: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/2.jpg)
Mahout
• Scalable Data Mining for Everybody
![Page 3: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/3.jpg)
What is Mahout
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVD, frequent item-set, math)
![Page 4: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/4.jpg)
What is Mahout?
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVM, frequent item-set, math)
![Page 5: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/5.jpg)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
![Page 6: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/6.jpg)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
![Page 7: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/7.jpg)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!
![Page 8: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/8.jpg)
An Example
![Page 9: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/9.jpg)
And Another
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
![Page 10: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/10.jpg)
Feature Encoding
![Page 11: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/11.jpg)
Hashed Encoding
![Page 12: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/12.jpg)
Feature Collisions
![Page 13: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/13.jpg)
How it Works
• We are given “features”– Often binary values in a vector
• Algorithm learns weights– Weighted sum of feature * weight is the key
• Each weight is a single real value
![Page 14: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/14.jpg)
A Quick Diversion
• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?
![Page 15: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/15.jpg)
A First Conclusion
• Probability as expressed by humans is subjective and depends on information and experience
![Page 16: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/16.jpg)
A Second Conclusion
• A single number is a bad way to express uncertain knowledge
• A distribution of values might be better
![Page 17: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/17.jpg)
I Dunno
![Page 18: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/18.jpg)
5 and 5
![Page 19: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/19.jpg)
2 and 10
![Page 20: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/20.jpg)
The Cynic Among Us
![Page 21: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/21.jpg)
A Second Diversion
![Page 22: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/22.jpg)
Two-armed Bandit
![Page 23: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/23.jpg)
Which One to Play?
• One may be better than the other• The better machine pays off at some rate• Playing the other will pay off at a lesser rate– Playing the lesser machine has “opportunity cost”
• But how do we know which is which?– Explore versus Exploit!
![Page 24: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/24.jpg)
Algorithmic Costs
• Option 1– Explicitly code the explore/exploit trade-off
• Option 2– Bayesian Bandit
![Page 25: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/25.jpg)
Bayesian Bandit
• Compute distributions based on data• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
![Page 26: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/26.jpg)
![Page 27: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/27.jpg)
![Page 28: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/28.jpg)
The Basic Idea
• We can encode a distribution by sampling• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response models
![Page 29: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/29.jpg)
Deployment with Storm/MapR
All state managed transactionally in MapR file system
![Page 30: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/30.jpg)
Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
Hadoop
![Page 31: LA HUG 2012 02-07](https://reader036.vdocuments.site/reader036/viewer/2022062418/556a7424d8b42a7c758b45d5/html5/thumbnails/31.jpg)
Find Out More
• Me: [email protected] [email protected] [email protected]
• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning