intro to mahout
DESCRIPTION
A short introduction to Mahout during SCISR meetup http://bit.ly/scisrTRANSCRIPT
Ofer VugmanMay 2012
Agenda and such…
What is ML (Machine Learning) ML Common Use Cases Mahout Overview Algorithms in Mahout Mahout Commercial Use Mahout Summary
What is ML
“Machine Learning is programming computers to optimize a
performance criterion using example data or past experience”
Intro. To Machine Learning by E. Alpaydin
ML Common Use Cases
Recommendation
ML Common Use Cases
Classification
ML Common Use Cases
Clustering
ML Common Libraries
Mahout Overview – What ?
A mahout is a person who keeps and drives an elephant
Mahout Overview – What ?
A scalable machine learning library
Mahout Overview – What ?
Began life at 2008 as a subproject of Apache’s Lucene project
On 2010 Mahout became a top-level Apache project in its own right
Implemented in Java Built upon Apache’s Hadoop (Look !
An Elephant !)
Mahout Overview – Why ?
Many open source ML libraries either: Lack community Lack documentation and examples Lack scalability Lack the Apache license Are research oriented Not well tested Not built over existing production
quality libraries
Mahout Overview – Why ?
Scalability Scalable to reasonably large datasets
(core algorithms implemented in Map/Reduce, runnable on Hadoop)
Scalable to support your business case (Apache License)
Scalable community
Mahout Overview – Why ?
Built over existing production quality libraries
Mahout Overview – Use Cases
Mahout currently supports mainly four use cases:1. Recommendation2. Clustering3. Classification4. Frequent Itemset Mining
Mahout Overview - Technical
System Requirements Linux (or Cygwin on Windows) Java 1.6.x or greater Maven 2.0.11 or greater to build the
source code Hadoop 0.2 or greater*
* Not all algorithms are implemented to work on Hadoop clusters
Algorithms in Mahout
We’ll focus on one example: Collaborative Filtering (Recommenders)
Yet there are many (many !!) more, you can find them all on https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Algorithms Examples – Recommendation
Help users find items they might like based on historical preferences
Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”
Algorithms Examples – Recommendation
Alice
Bob
Peter
5 1 4
? 2 5
4 3 2
Algorithms Examples – Recommendation
Algorithm Neighborhood-based approach Works by finding similarly rated items in
the user-item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient)
Estimates a user's preference towards an item by looking at his/her preferences towards similar items
Algorithms Examples – Recommendation
Prediction: Estimate Bob's preference towards “The Matrix”1. Look at all items that
a) are similar to “The Matrix“ b) have been rated by Bob
=> “Alien“, “Inception“
2. Estimate the unknown preference with a weighted sum
Algorithms Examples – Recommendation
MapReduce phase 1 Map – Make user the key
(Alice, Matrix, 5)
(Alice, Alien, 1)
(Alice, Inception, 4)
(Bob, Alien, 2)
(Bob, Inception, 5)
(Peter, Matrix, 4)
(Peter, Alien, 3)
(Peter, Inception, 2)
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
Algorithms Examples – Recommendation
MapReduce phase 1 Reduce – Create inverted index
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)
Algorithms Examples – Recommendation
MapReduce phase 2 Map – Isolate all co-occurred ratings (all
cases where a user rated both items)Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alien, Inception (1,4)
Alien, Inception (2,5)
Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)
Algorithms Examples – Recommendation
MapReduce phase 2 Reduce – Compute similarities
Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alien, Inception (1,4)
Alien, Inception (2,5)
Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
Matrix, Alien (-0.47)
Matrix, Inception (0.47)
Alien, Inception(-0.63)
Algorithms Examples – Recommendation
Alice
Bob
Peter
5 1 4
2 5
4 3 2
1.5
Mahout Commercial Use
Commercial use
Mahout Resources
Mahout website - http://mahout.apache.org/
Introducing Apache Mahout – http://www.ibm.com/developerworks/java/library/j-mahout/
“Mahout In Action” by Sean Owen and Robin Anil
Mahout Summary
ML is all over the web today Mahout is about scalable machine
learning Mahout has functionality for many of
today’s common machine learning tasks
MapReduce magic in action
Mahout Summary
Thank you and good night