xebia knowledge exchange (mars 2011) - machine learning with apache mahout
DESCRIPTION
TRANSCRIPT
![Page 1: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/1.jpg)
Machine Learning with Apache Mahout
Classification, Clustering and Recommendation
3/3/2011 Michaël Figuière
![Page 2: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/2.jpg)
Machine Learning
![Page 3: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/3.jpg)
Machine Learning
Artificial Intelligence
Machine Learning
Machine Learning is a subset of Artificial
Intelligence
![Page 4: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/4.jpg)
NoSQL, Search and Machine Learning
NoSQL, Search and Machine Learning greatly complete
each other !MachineLearning
SearchNoSQL
![Page 5: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/5.jpg)
Machine Learning algorithms
• Recommentations
• Classification
• Clustering
• Patterns mining, evolutionary algorithms, ...
Advice user with recommended items
Automatically classify documents based on a given set of examples
Automatically discover groups within a set of documents
![Page 6: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/6.jpg)
Recommendation - User based
Amazon suggests articles bought
by similar customers
![Page 7: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/7.jpg)
Recommendation - Item based
On the article page Amazon leverages item based recommendation
![Page 8: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/8.jpg)
Similarities between users
A B D E FC
1 2
1
Here we observes that users 1 and 2 have similar tastes
![Page 9: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/9.jpg)
Recommendation use cases
• Advice user with items on e-commerce websites
• Advice user with feature he may be interested in on a Web application
• Filter and adapt scoring of results of a search engine
And increase revenue
As most features are usually unknown
Based on similar users clicks, ...
![Page 10: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/10.jpg)
Classification
Mails classified as spams by GMail
![Page 11: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/11.jpg)
Classification use cases
• Automatically attach tags to documents
• Extract suspicious documents
Based on existing manual tagging, wikipedia, ...
Spam, corrupted documents, ...
![Page 12: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/12.jpg)
Clustering
Trendy topics discovered by Google News
![Page 13: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/13.jpg)
Clustering with K-Means
AB
DE
F
C
![Page 14: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/14.jpg)
Clustering with K-Means
AB
DE
F
C
Cluster centerswith random initial position
![Page 15: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/15.jpg)
Clustering with K-Means
AB
C
DE
F
Data are attached to the nearest cluster center
![Page 16: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/16.jpg)
Clustering with K-Means
AB
DE
F
C
Cluster centers are moved in order to minimize the sum
of distances
![Page 17: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/17.jpg)
Clustering with K-Means
AB
DE
F
C
The data point C is then attached to the first center as it has
become the nearest
![Page 18: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/18.jpg)
Clustering use cases
• Finds key topics in a set of documents
• Finds some typical behaviors within a set of users
News feeds, business documents, ...
Visit frequency, buying habits, ...
![Page 19: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/19.jpg)
Apache Mahout
![Page 20: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/20.jpg)
In few words
• Implementation of machine learning algorithms in Java
• Most of them come in a MapReduce implementation for Hadoop
• Still quite young but growing fast
• Intended to be for Machine Learning what Lucene is for Information Retrieval
Continuously growing collection of algorithms
Scalable to huge datasets
Started in early 2009
![Page 21: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/21.jpg)
Documentation
![Page 22: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/22.jpg)
Recommendation example
DataModel model = new FileDataModel(new File("data.csv"));
UserSimilarity simil = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil);
List<RecommendedItem> recommendations = recommender.recommend(1, 1);
The code for a basic recommendation is pretty straightforward !
![Page 23: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/23.jpg)
Classification with Mahout
Trainingalgorithm
Trainingexamples
New data
Model
Model Decision
Copy
![Page 24: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/24.jpg)
Clustering with Mahout
ClusteringalgorithmDocuments List of
clusters
![Page 25: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/25.jpg)
Relevance evaluation
Data used for training
Data used to evaluate relevance of an algorithm and its settings
Entire dataset
![Page 26: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/26.jpg)
A search engine use case
![Page 27: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/27.jpg)
A Search Engine
Search
![Page 28: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/28.jpg)
A Search Engine
SearchMyCustomer
![Page 29: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/29.jpg)
A Search Engine
SearchMyCustomer
Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Document
Phone Call
![Page 30: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/30.jpg)
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
![Page 31: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/31.jpg)
A more complex Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
![Page 32: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/32.jpg)
Indexing Pipeline with Mahout
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
Classifier
Classifier
Mahout
![Page 33: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/33.jpg)
Query pipeline
Query
Results
Analyzer
SearchIndex
Lucene
![Page 34: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/34.jpg)
Query pipeline with Mahout
Using Mahout recommendations
Query
Results
Analyzer
Analyzer
CustomScoring
SearchIndex
Lucene
![Page 35: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/35.jpg)
Conclusion
• Machine learning brings a lot of valuable features for enterprises
• Mahout is growing fast and is becoming a great choice for Java apps
• Business people are not used to that kind of use cases
Revenue increasing, better productivity, user adoption, ...
With easy integration to business applications
Collaboration with technical folks is mandatory
![Page 36: Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout](https://reader034.vdocuments.site/reader034/viewer/2022042613/54b777d14a7959df648b46bc/html5/thumbnails/36.jpg)
Questions / Answers
?@mfiguiere
blog.xebia.fr