intro to mahout -- dc hadoop

27
Intro to Apache Mahout Grant Ingersoll Lucid Imagination http:// www.lucidimagination.com

Upload: grant-ingersoll

Post on 06-May-2015

12.152 views

Category:

Technology


4 download

DESCRIPTION

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

TRANSCRIPT

Page 1: Intro to Mahout -- DC Hadoop

Intro to Apache Mahout

Grant IngersollLucid Imagination

http://www.lucidimagination.com

Page 2: Intro to Mahout -- DC Hadoop

Anyone Here Use Machine Learning?

• Any users of:– Google?• Search?• Priority Inbox?

– Facebook?

– Twitter?

– LinkedIn?

Page 3: Intro to Mahout -- DC Hadoop

Topics

• Background and Use Cases• What can you do in Mahout?• Where’s the community at?• Resources• K-Means in Hadoop (time permitting)

Page 4: Intro to Mahout -- DC Hadoop

Definition

• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E. Alpaydin

• Subset of Artificial Intelligence• Lots of related fields:

– Information Retrieval– Stats– Biology– Linear algebra– Many more

Page 5: Intro to Mahout -- DC Hadoop

Common Use Cases

• Recommend friends/dates/products• Classify content into predefined groups• Find similar content• Find associations/patterns in actions/behaviors• Identify key topics/summarize text– Documents and Corpora

• Detect anomalies/fraud• Ranking search results• Others?

Page 6: Intro to Mahout -- DC Hadoop

Apache Mahout

• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org

• Why Mahout?– Many Open Source ML libraries either:

• Lack Community• Lack Documentation and Examples• Lack Scalability• Lack the Apache License• Or are research-oriented

Definition: http://dictionary.reference.com/browse/mahout

Page 7: Intro to Mahout -- DC Hadoop

What does scalable mean to us?• Goal: Be as fast and efficient as possible given

the intrinsic design of the algorithm– Some algorithms won’t scale to massive machine

clusters– Others fit logically on a Map Reduce framework

like Apache Hadoop– Still others will need different distributed

programming models– Others are already fast (SGD)

• Be pragmatic

Page 8: Intro to Mahout -- DC Hadoop

Sampling of Who uses Mahout?

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Page 9: Intro to Mahout -- DC Hadoop

What Can I do with Mahout Right Now?

3C + FPM + O = Mahout

Page 10: Intro to Mahout -- DC Hadoop

Collaborative Filtering

• Extensive framework for collaborative filtering (recommenders)

• Recommenders– User based– Item based

• Online and Offline support– Offline can utilize Hadoop

• Many different Similarity measures– Cosine, LLR, Tanimoto, Pearson, others

Page 11: Intro to Mahout -- DC Hadoop

Clustering

• Document level– Group documents based

on a notion of similarity– K-Means, Fuzzy K-

Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)

– All Map/Reduce– Distance Measures

• Manhattan, Euclidean, other

• Topic Modeling – Cluster words across

documents to identify topics

– Latent Dirichlet Allocation (M/R)

Page 12: Intro to Mahout -- DC Hadoop

Categorization• Place new items into

predefined categories:– Sports, politics,

entertainment– Recommenders

• Implementations– Naïve Bayes (M/R)– Compl. Naïve Bayes (M/R)– Decision Forests (M/R)– Linear Regression (Seq. but

Fast!)

•See Chapter 17 of Mahout in Action for Shop It To Me use case:• http://awe.sm/5FyNe

Page 13: Intro to Mahout -- DC Hadoop

Freq. Pattern Mining

• Identify frequently co-occurrent items

• Useful for:– Query

Recommendations• Apple -> iPhone, orange,

OS X

– Related product placement• Basket Analysis

• Map/Reduce

http://www.amazon.com

Page 14: Intro to Mahout -- DC Hadoop

Other

• Primitive Collections!

• Collocations (M/R)

• Math library– Vectors, Matrices, etc.

• Noise Reduction via Singular Value Decomp (M/R)

Page 15: Intro to Mahout -- DC Hadoop

Prepare Data from Raw content

• Data Sources:– Lucene integration

• bin/mahout lucene.vector …

– Document Vectorizer• bin/mahout seqdirectory …• bin/mahout seq2sparse …

– Programmatically• See the Utils module in Mahout and the Iterator<Vector>

classes

– Database– File system

Page 16: Intro to Mahout -- DC Hadoop

How to: Command Line

• Most algorithms have a Driver program– $MAHOUT_HOME/bin/mahout.sh helps with most tasks

• Prepare the Data– Different algorithms require different setup

• Run the algorithm– Single Node– Hadoop

• Print out the results or incorporate into application– Several helper classes:

• LDAPrintTopics, ClusterDumper, etc.

Page 17: Intro to Mahout -- DC Hadoop

What’s Happening Now?

• Unified Framework for Clustering and Classification

• 0.5 release on the horizon (May?)• Working towards 1.0 release by focusing on:– Tests, examples, documentation– API cleanup and consistency

• Gearing up for Google Summer of Code– New M/R work for Hidden Markov Models

Page 18: Intro to Mahout -- DC Hadoop

Summary

• Machine learning is all over the web today

• Mahout is about scalable machine learning

• Mahout has functionality for many of today’s common machine learning tasks

• Many Mahout implementations use Hadoop

Page 19: Intro to Mahout -- DC Hadoop

Resources• http://mahout.apache.org• http://cwiki.apache.org/MAHOUT• {user|dev}@mahout.apache.org• http://svn.apache.org/repos/asf/mahout/trunk• http://hadoop.apache.org

Page 20: Intro to Mahout -- DC Hadoop

Resources

• “Mahout in Action” – Owen, Anil, Dunning and Friedman– http://awe.sm/5FyNe

• “Introducing Apache Mahout” – http://www.ibm.com/developerworks/java/library/j-mahout/

• “Taming Text” by Ingersoll, Morton, Farris• “Programming Collective Intelligence” by Toby Segaran• “Data Mining - Practical Machine Learning Tools and

Techniques” by Ian H. Witten and Eibe Frank• “Data-Intensive Text Processing with MapReduce” by Jimmy

Lin and Chris Dyer

Page 21: Intro to Mahout -- DC Hadoop

K-Means• Clustering Algorithm– Nicely parallelizable!

http://en.wikipedia.org/wiki/K-means_clustering

Page 22: Intro to Mahout -- DC Hadoop

K-Means in Map-Reduce• Input:

– Mahout Vectors representing the original content– Either:

• A predefined set of initial centroids (Can be from Canopy)• --k – The number of clusters to produce

• Iterate– Do the centroid calculation (more in a moment)

• Clustering Step (optional)• Output

– Centroids (as Mahout Vectors)– Points for each Centroid (if Clustering Step was taken)

Page 23: Intro to Mahout -- DC Hadoop

Map-Reduce Iteration

• Each Iteration calculates the Centroids using:– KMeansMapper– KMeansCombiner– KMeansReducer

• Clustering Step– Calculate the points for each Centroid using:– KMeansClusterMapper

Page 24: Intro to Mahout -- DC Hadoop

KMeansMapper

• During Setup:– Load the initial Centroids (or the

Centroids from the last iteration)• Map Phase– For each input

• Calculate it’s distance from each Centroid and output the closest one

• Distance Measures are pluggable– Manhattan, Euclidean, Squared

Euclidean, Cosine, others

Page 25: Intro to Mahout -- DC Hadoop

KMeansReducer

• Setup:– Load up clusters– Convergence information– Partial sums from

KMeansCombiner (more in a moment)

• Reduce Phase– Sum all the vectors in the

cluster to produce a new Centroid

– Check for Convergence• Output cluster

Page 26: Intro to Mahout -- DC Hadoop

KMeansCombiner

• Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

Page 27: Intro to Mahout -- DC Hadoop

KMeansClusterMapper

• Some applications only care about what the Centroids are, so this step is optional

• Setup:– Load up the clusters and the DistanceMeasure

used• Map Phase– Calculate which Cluster the point belongs to– Output <ClusterId, Vector>