apache mahout: driving the yellow elephant

28
Apache Mahout – Driving the Yellow Elephant Grant Ingersoll TriHUG http://www.trihug.org

Upload: grant-ingersoll

Post on 10-May-2015

8.369 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Apache Mahout: Driving the Yellow Elephant

Apache Mahout – Driving the Yellow ElephantGrant Ingersoll

TriHUG http://www.trihug.org

Page 2: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Anyone Here Use Machine Learning?

•Any users of:o Google?

Search?

Priority Inbox?

o Facebook?

o Twitter?

o LinkedIn?

Page 3: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Topics

•What is Machine Learning?

•ML Use Cases

•What is Mahout?

•A Word on Scaling

•What can I do with it right now?

•Mahout and Hadoop: An Example

Page 4: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Amazon.com

What is Machine Learning?

Google News

Page 5: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Really it’s…

•“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”o Intro. To Machine Learning by E. Alpaydin

•Subset of Artificial Intelligence

•Lots of related fields:o Information Retrieval

o Stats

o Biology

o Linear algebra

o Many more

Page 6: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Common Use Cases

•Recommend friends/dates/products

•Classify content into predefined groups

•Find similar content based on object properties

•Find associations/patterns in actions/behaviors

•Identify key topics in large collections of text

•Detect anomalies in machine output

•Ranking search results

•Others?

Page 7: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Apache Mahout

•An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licenseo http://mahout.apache.org

•Why Mahout?o Many Open Source ML libraries either:

Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented

http://dictionary.reference.com/browse/mahout

Page 8: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Who uses Mahout?

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Page 9: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

What does scalable mean?

•Ted Dunning (Mahout committer):o As data grows linearly, either scale linearly in time or in machines

o 2X data requires 2X time or 2X machines (or less!)

•Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmo Some algorithms won’t scale to massive machine clusters

o Others fit logically on a Map Reduce framework like Apache Hadoop

o Still others will need different distributed programming models

o Be pragmatic

Page 10: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

What Can I do with Mahout Right Now?

Page 11: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Recommendations

•Extensive framework for collaborative filtering

•Recommenderso User based

o Item based

•Online and Offline supporto Offline can utilize Hadoop

•Many different Similarity measureso Cosine, LLR, Tanimoto, Pearson, others

It’s Valentine’s Day soon!

Page 12: Apache Mahout: Driving the Yellow Elephant

Clustering

• Document levelo Group documents based

on a notion of similarityo K-Means, Fuzzy K-Means,

Dirichlet, Canopy, Mean-Shift

o Distance Measures Manhattan, Euclidean,

other

• Topic Modeling o Cluster words across

documents to identify topics

o Latent Dirichlet Allocation

Page 13: Apache Mahout: Driving the Yellow Elephant

Categorization

• Place new items into predefined categories:o Sports, politics, entertainment

o Recommenders

• Implementationso Naïve Bayes

o Compl. Naïve Bayes

o Decision Forests

o Linear Regression

•See Chapter 17 of Mahout in Action for Shop It To Me use case:• http://awe.sm/5FyNe

Page 14: Apache Mahout: Driving the Yellow Elephant

Freq. Pattern Mining

• Identify frequently co-occurrent items

• Useful for:o Query Recommendations

Apple -> iPhone, orange, OS X

o Related product placement Basket Analysis

http://www.amazon.com

Page 15: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Evolutionary

•Map-Reduce ready fitness functions for genetic programming

•Integration with Watchmakero http://watchmaker.uncommons.org/index.php

•Problems solved:o Traveling salesman

o Class discovery

o Many others

•Caveat: Hasn’t received as much attention as others

Page 16: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Other

•Primitive Collections!

•Math libraryo Vectors, Matrices, etc.

•Noise Reduction via Singular Value Decomposition

•Export from Lucene/Solr and other formats

Page 17: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Mahout and Hadoop

•Most Mahout implementations are built on Map-Reduceo Many also have sequential implementations

o Linear Regression is blazingly fast without needing M/R

•Let’s look at how K-Means is implemented in Mahout

Page 18: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

K-Means

•Clustering Algorithmo Nicely parallelizable!

http://en.wikipedia.org/wiki/K-means_clustering

Page 19: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

K-Means in Map-Reduce

•Input:o Mahout Vectors representing the original content

o Either: A predefined set of initial centroids (Can be from Canopy)

--k – The number of clusters to produce

•Iterateo Do the centroid calculation (more in a moment)

•Clustering Step (optional)

•Outputo Centroids (as Mahout Vectors)

o Points for each Centroid (if Clustering Step was taken)

Page 20: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Map-Reduce Iteration

•Each Iteration calculates the Centroids using:o KMeansMapper

o KMeansCombiner

o KMeansReducer

•Clustering Stepo Calculate the points for each Centroid using:

o KMeansClusterMapper

Page 21: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

KMeansMapper

•During Setup:o Load the initial Centroids (or the Centroids from

the last iteration)

•Map Phaseo For each input

Calculate it’s distance from each Centroid and output the closest one

•Distance Measures are pluggableo Manhattan, Euclidean, Squared Euclidean,

Cosine, others

Page 22: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

KMeansReducer

•Setup:o Load up clusters

o Convergence information

o Partial sums from KMeansCombiner (more in a moment)

•Reduce Phaseo Sum all the vectors in the

cluster to produce a new Centroid

o Check for Convergence

•Output cluster

Page 23: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

KMeansCombiner

•A Combiner is like a Map-side Reducer which helps save on IO

•Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

Page 24: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

KMeansClusterMapper

•Some applications only care about what the Centroids are, so this step is optional

•Setup:o Load up the clusters and the DistanceMeasure used

•Map Phaseo Calculate which Cluster the point belongs to

o Output <ClusterId, Vector>

Page 25: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Summary

•Machine learning is all over the web today

•Mahout is about scalable machine learning

•Mahout has functionality for many of today’s common machine learning tasks

•Many Mahout implementations use Hadoop

•KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce

Page 26: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Resources

•http://mahout.apache.org•http://cwiki.apache.org/MAHOUT•{user|dev}@mahout.apache.org

•http://svn.apache.org/repos/asf/mahout/trunk

•http://hadoop.apache.org

Page 27: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

Resources

•“Mahout in Action” by Owen, Anil, Dunning and Friedmano http://awe.sm/5FyNe

•“Introducing Apache Mahout” o http://www.ibm.com/developerworks/java/library/j-mahout/

•“Programming Collective Intelligence” by Toby Segaran

•“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

Page 28: Apache Mahout: Driving the Yellow Elephant

Lucid Imagination, Inc.

References•HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg

• Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg

•Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg

•Google News: http://news.google.com

•Amazon.com: http://www.amazon.com

• Facebook: http://www.facebook.com

• Couple: http://www.vlemx.com/

•Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/o http://www.theregister.co.uk/2006/08/15/beer_diapers/

•DMOZ: http://www.dmoz.org

• Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html