orchestrating the intelligent web with apache mahout
DESCRIPTION
Presentation on Apache Mahout at Linux Conference Australia 2011TRANSCRIPT
Orchestrating the Intelligent Web with Apache Mahout
Presented by Aneesha BakhariaTwitter: aneesha
Email: [email protected]
What is Apache Mahout?
• Open source • Machine Learning Java library• Scalable (Apache Hadoop) • Framework for developing, testing and
deploying large-scale algorithms
http://mahout.apache.org/
What’s in a Name?
• Mahout is Hindi for Elephant Driver
What is Apache Mahout?
• Framework– Vector Math/Matrices (eg SVD)– Collections– Hadoop
• Algorithms– Classification, Clustering, etc
• Your Application???– You can orchestrate the intelligent web!!!
A New Breed of Developer
• Key Skills– Databases– Programming– Networking– Security
• …but now also– distributed data processing is fast becoming an
essential part the developer’s toolbox.
You never know where you will use Probability and
Statistics!!!!Video snippet from Equilibrium:
http://en.wikipedia.org/wiki/Equilibrium_%28film%29
You never know what you will discover!!!!
Where people swear in the United States?
http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/
Algorithms is Apache Mahout
• Recommendation (collaborative filtering)• Clustering• Classification • Evolutionary Algorithms
Algorithms is Apache Mahout
• Top 10 algorithms in data mining
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.
k-Means, Apriori (fp-growth), kNN, Naive Bayes, SVM (coming)
Already supported
Requirements
• Java 1.6java -version
• Maven 2.2mvn -- version
• Hadoop 0.2
Running Mahout
• Command line launcherbin/mahout (This shows the list of algorithms)Valid program names are:
canopy: : Canopy clustering cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering recommenditembased: : Compute recommendations using item-based collaborative filtering …..
Running Mahout
• Run any algorithm eg kmeans locallybin/mahout kmeans –help
Job-Specific Options: --input (-i) input --output (-o) output --distanceMeasure (-dm) eg SquaredEuclidean --numClusters (-k) k
Running Mahout
• Scale outRuns on cluster as per conf files in Hadoop directory
• export HADOOP_HOME = /pathto/hadoop-0.20.2/
• Need to use the driver classesKMeansDriver.runjob(Path input, Path output ...)
Clustering
• Unsupervised Machine Learning technique• Organise items in to clusters/groups based
upon similarity• Good for finding patterns and exploring data
Clustering
• Lots of Algorithms:k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation
• Similarity Distance Measures– Euclidean– Cosine– Tanimoto– Manhattan
Vectors
• DocumentsBag of wordsword1 => 10word2 => 2word3 => 4Resulting vector [10.0, 2.0, 4.0, .... ]
Range of Vectorization Tools
• Collate multiple words (n-grams)• Normalization• TF-IDF• Stop word removal
kmeans Example
• Set of text files in a directory• Use seqdirectory to convert files to vectors
bin/mahout seqdirectory -i <input> -o <seq-output>• Use seq2sparse to convert to sparse vector
bin/mahout seq2sparse -i seq-output -o <vector-output>• Run kmeans with k=5
bin/mahout kmeans -i<vector-output> -c <cluster-temp> -o <cluster-output> -k 5
• View outputbin/mahout clusterdump
Easy enough, but
• How do you know k?• Data Exploration is required to find the • k for your purposes• Similarity distance for your purpose
• Role for the Data Scientist• Explore, Model, Test and Evaluate
Recommender Engines
• Encounter the most• Recommend products (books, movies, etc)
based upon past actions• Infer tastes and preferences to identify
unknown items of interest
Recomendation
• Algorithms:user and item recommendation
• Framework for storage, online and offline computation
• Similarity Measures– Cosine– Tanimoto– Pearson
Frequent Pattern Mining
• Discover interesting patterns based upon how items occur in a sequence
• ExampleSales Transactions (Bread, Milk and Eggs)(Nappies, Beer)
• Parallel FPGrowth Algorithm
Classification
• Set of classes/categories (observed pattern)• Decide if a new input matches a category• Supervised technique – need training• Eg spam or not
Classification
• Algorithms:Naive Bayes, Random Forest Decision Tree, SVM coming
• Learn a model from a manually trained dataset
• Predict the class of an unseen object based on features
Latent Dirichlet Allocation
– Convert text to term-document matrix– LDA produces • word-theme mapping• theme-document mapping• Allows topic overlap
– Need to specify number of Topics (k)
Latent Dirichlet Allocation
• LDA
• Tweet 1• Tweet 2• Tweet 3
Word 1 Word 2 Word n
Doc 1 1 0 2
Doc 2 0 1 0
Doc 3 0 1 1
Term-Document Matrix
Specify No Themes (k)
Word 1
Word 2
Word n
Topic 1 0.5 0 1
Topic 2 0 0.5 0
Topic to Word Mapping
Topic 1 Topic 2
Doc 1 1 0
Doc 2 0 1
Doc 3 0 1
X
Tweet to Topic Mapping
Latent Dirichlet Allocation
– Run LDAbin/mahout lda -input <PATH> output <PATH> –numTopics 20‐
– View Topicsbin/mahout LDAPrintTopics input <PATH>‐output <PATH> dictonaryType sequencefile‐ ‐
Suggesting Twitter Lists
– Twitter introduced Lists group people you follow so you can see only their timeline of tweets
– Build an application that could recommend people that should be grouped in the same list.
– LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.
Suggesting Twitter Lists
– Twitter API Tasks• Get list of people that a user follows• Retrieve tweets for each person• Save Lists back to Twitter
– Data Processing• Combine all tweets for a person• Remove stop words• Stem words• Create a user-word matrix
Suggesting Twitter Lists
– Web UI• Authenticate to Twitter• Display suggested lists (based on estimate of k)
(Could also display the important tweets that place the person in the group?)• Allow users to change k
ie decide on the number of Lists• Allow group re-organisation with jquery sortables
Gently Getting into Machine Learning and Data Mining
• Programming Collective Intelligenceby Toby Segaram
• Mahout in Actionby Owen, Anil, Dunning and Friedman
Summary
• Mahout offers good abstraction for building intelligent web applications
• Skills in data analysis and exploration are now more important than ever
• Mahout is a good platform for distributed algorithm development
Fascinating Algorithms
• My Top 3 algorithms– Some interesting and some disturbing and
interesting at the same time
Fascinating Algorithms
• No 3 – Identifying Manipulated Imageshttp://www.technologyreview.com/computing/20423/page1/
Fascinating Algorithms
• No 2 – Seam CarvingContent Aware ResizingExample http://swieskowski.net/carve/
Disturbing Algorithms
• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf
Disturbing Algorithms
• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf
Disturbing Algorithms
• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf
Image from Shrek Copyright Dreamworks
Discussion/Questions
• What will you build?