apache mahout industrial strength machine learning may 2008

12
Apache Mahout Industrial Strength Machine Learning May 2008

Upload: melvyn-wilcox

Post on 03-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Mahout Industrial Strength Machine Learning May 2008

Apache Mahout

Industrial Strength Machine LearningMay 2008

Page 2: Apache Mahout Industrial Strength Machine Learning May 2008

Current Situation• Large volumes of data are now available• Platforms now exist to run computations over

large datasets (Hadoop, HBase)• Sophisticated analytics are needed to turn data

into information people can use• Active research community and proprietary

implementations of “machine learning” algorithms

• The world needs scalable implementations of ML under open license - ASF

Page 3: Apache Mahout Industrial Strength Machine Learning May 2008

Where is ML Used Today

• Internet search clustering• Knowledge management systems• Social network mapping• Taxonomy transformations• Marketing analytics• Recommendation systems• Log analysis & event filtering• SPAM filtering, fraud detection

Page 4: Apache Mahout Industrial Strength Machine Learning May 2008

History of Mahout

• Summer 2007– Developers needed scalable ML– Mailing list formed

• Community formed– Apache contributors– Academia & industry– Lots of initial interest

• Project formed under Apache Lucene– January 25, 2008

Page 5: Apache Mahout Industrial Strength Machine Learning May 2008

Who We Are (so far)

Grant Ingersoll Karl Wettin

Isabel DrostTed DunningJeff Eastman

Dawid Weiss

Otis Gospodnetic

Erik Hatcher

Sean Owen

Ozgur Yilmazel

Page 6: Apache Mahout Industrial Strength Machine Learning May 2008

Current Code Base• Matrix & Vector library– Memory resident sparse & dense implementations

• Clustering– Canopy– K-Means– Mean Shift

• Collaborative Filtering– Taste

• Utilities– Distance Measures– Parameters

Page 7: Apache Mahout Industrial Strength Machine Learning May 2008

Example: K-Means

• Given K, assign the first K random points to be the initial cluster centers

• Assign subsequent points to the closest cluster using the supplied distance measure

• Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta

• Run a final pass over the points to cluster them for output

Page 8: Apache Mahout Industrial Strength Machine Learning May 2008

K-Means Map/Reduce Design• Driver

– Runs multiple iteration jobs using mapper+combiner+reducer– Runs final clustering job using only mapper

• Mapper– Configure: Single file containing encoded Clusters– Input: File split containing encoded Vectors– Output: Vectors keyed by nearest cluster

• Combiner– Input: Vectors keyed by nearest cluster– Output: Cluster centroid vectors keyed by “cluster”

• Reducer (singleton)– Input: Cluster centroid vectors– Output: Single file containing Vectors keyed by cluster

Page 9: Apache Mahout Industrial Strength Machine Learning May 2008

K-Means Hadoop Implementation• KMeansDriver

– runJob()– runIteration()– isConverged()– runCluster()

• KMeansMapper– configure()– map()

• KMeansCombiner– configure()– reduce()

• KMeansReducer– configure()– reduce()

• Cluster• configure()• formatCluster()• decodeCluster()• addPoint()• computeCentroid()• accessors

Page 10: Apache Mahout Industrial Strength Machine Learning May 2008

Under Development

• Naïve Bayes• Perceptron• PLSI/EM• Genetic Programming• Dirichlet Process Clustering• Clustering Examples• Hama (Incubator) for very large arrays

Page 11: Apache Mahout Industrial Strength Machine Learning May 2008

GSoC @ Mahout• Many interesting submissions• 4 projects approved for Mahout

(http://code.google.com/soc/2008/asf/about.html)– “Mahout: Parallel implementation of [NB/SOM/RF]

machine learning algorithms”, Farid Bourennani– “Implementing Logistic Regression in Mahout”, Yun

Jiang– “Codename Mahout.GA for mahout-machine-

learning”, Abdel Hakim Deneche– “To implement Complementary Naïve Bayes and

Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

Page 12: Apache Mahout Industrial Strength Machine Learning May 2008

Conclusion

• This is just the beginning• High demand for scalable machine learning• Contributors needed who have– Interest, enthusiasm & programming ability– Test driven development readiness– Comfort with the scary math (or bravery)– Interest and/or proficiency with Hadoop– Some large data sets you want to analyze– Access to clusters that we could use for testing