real world machine learning with java for fumankaitori.com
TRANSCRIPT
Real World Machine Learning in Java 8 at
Fumankaitori.comMathieu Dumoulin, Chief Data Scientist fumankaitori.com,
Data Science Team manager at en-japan
Today’s menu● About me and 不満買取センータ
● The business problem: Post pricing
● Project Overview○ Why use ML
○ How to use ML in projects
○ How we used ML in this project
● Results
● Live code (depends on time)
● Conclusion
Presentation goals● Machine learning is possible by any Java Engineer
● Java is a great programming language for real-
world machine learning systems
● New ML APIs make it easy to focus on the problem
and the data, and get a well-performing model “for
free”
● You don’t need a ph.D. to use machine learning,
just some self-study, good tools and libraries and
build experience one project at a time
About me
Google map for Quebec City here!
My Work: Java SE, Hadoop Engineer, Data Scientist
● Launched in Mar 2015. Provide web/Android/iOS applications.
● An application to collect data about people's dissatisfactions.
● Features:○ Users can post any dissatisfaction of any products/services.
○ Users get points as a reward for their posts. And the point is exchangeable with coupon code of EC sites.
● 250,000 users with 1,500,000 posts (accumulated)(end of Nov 2015)
Problem statement: post point value prediction● Fuman user posts have a money value ● We want to give more points for “good”
posts● At first, operations staff checked all
posts, but they can’t check 10,000 posts each day...
We made rules, but point value was worse:
● Rules can’t check the content of the posts● Rules always miss something● Making hundreds or thousands of rules by
hand is ridiculous
ML is the best solution for 不満買取センター
● ML Problem: Estimate the point value of a user posts (0-25)
● Project goal: Estimate the value of posts with less than 5 points
difference from human judgement
● Data: All user posts and user profile data
● Data with known output (labels): staff already set points for 200k
posts manually
This is a classic case of supervised learning (Wiki). Another reference from Microsoft
Prediction of a price requires to build a Regression model because the prediction is a number, as opposed to a classification problem which predicts which of two classes each post would belong to.
Real world ML project overview● Machine Learning Workflow● Data Scientist and Java Engineer roles● Java for production ML● Java 8 benefits● Our point prediction system details● Results
Machine Learning Workflow
Load data
Extract Features
Train Model
Evaluate vs. business goal
Load new data
Extract Features
Predict using model
Act on prediction
data, labels (known result)
feature vectors, labels
prediction, labels
data
feature vectors
predictions
data
iterate
best model
the same
Workflow for machine learning system1. Set a goal with business
value2. Get data (fuman user
posts) with a price already set
3. Transform data for input into machine learning algorithm
4. Train and evaluate machine learning model until reach goal
5. Deploy best model
Data Scientist’s role1. Set a goal with business
value2. Get data (fuman user
posts) with a price already set
3. Transform data for input into machine learning algorithm
4. Train and evaluate machine learning model until reach goal
5. Deploy best model
Choose features
Build many models
Software Engineer’s role
Implement and integrate into production system
1. Set a goal with business value
2. Get data (fuman user posts) with a price already set
3. Transform data for input into machine learning algorithm
4. Train and evaluate machine learning model until reach goal
5. Deploy best model
Get data from data sourceImplement production code
But we don’t have a data scientist...
You can outsource!
Java for production ML● Easy integration with Java applications
● Fast (vs. Python or R)
● Easy to program (vs. C++)
● Most common enterprise programming language, IDE support and excellent support libraries
● Lots of state of the art machine learning libraries have a Java API
Machine Learning libraries
Benefits of Java 8● Java 8’s functional style is a very good match with ML operations
a. Feature extraction: data in → transform → data out
● Java 8’s streams and Lambdas
a. Code is easier to understand and less verbose
● Easy parallel code
a. Faster “for free”
Post point prediction system: step by step
Feature Extraction
Fuman DB
Prediction Service
● Train/Test split● Categorical features
transformation● Select best features● Try many algorithms● Tune algorithms● Evaluate models● REST Prediction API
Iterate until results meet business goals
CSV format
DR Prediction API
posts, label
Feature Extraction details● We added character and words statistics about each fuman user post
○ Number of hiragana, katakana, kanji, alphabet characters and words○ Number of words, length of words○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a
post● User profile information
○ age, gender, job category, etc.● Bag-of-word models:
○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …)○ Part-of-speech (名詞、動詞、形容詞、…)
○ Word types features (hiragana word, katakana word, kanji word, …)
マックのポテト揚げたてでお願いしたのに、揚げたてじゃ
なかった。
Feature Extraction: Example
Feature Example: MeCab analyzerマックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。
マック 名詞,固有名詞,一般,*,*,*,マック,マック,マック
の 助詞,連体化,*,*,*,*,の,ノ,ノポテト 名詞,一般,*,*,*,*,ポテト,ポテト,ポテト
揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
で 助詞,格助詞,一般,*,*,*,で,デ,デお願い 名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シた 助動詞,*,*,*,特殊・タ,基本形,た,タ,タのに 助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ
、 記号,読点,*,*,*,*,、,、,、揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
じゃ 助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ
なかっ 助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ。 記号,句点,*,*,*,*,。,。,。EOS
Feature Extraction: ExampleCharacter counts
Hiragana: 20
Katakana: 6
Kanji: 3
Alpha: 0
Digits: 0
Marks (!,?): 0
Token type counts
Hiragana: 8
Katakana: 2
Kanji: 3
Alpha: 0
Digits: 0
Marks: 0
Token length
1: 5
2: 2
3: 4
4: 2
5+: 0
Training and evaluation of our model
We reached the project goal!
● DataRobot’s best model○ eXtreme Gradient Boosted Trees○ RMSE: 3.54○ MSE: 12.53
Business result:
● Higher quality evaluation than rules● Operation staff don’t need to manually check posts● We can validate points every day
Our result: 3.5 point difference from human judgement
Deployment issues● Problem: The Prediction API was very slow (>1s / post) so we
had to run it as a batch process each night. ● We want: Make predictions locally with low latency, without losing
the good prediction performance we already have.
We solved this problem using the excellent open source, distributed machine learning library H2O by H2o.ai.
Co-founder: Cliff Click, who made the Java HotSpot Server Compiler
Post point prediction system: Current system
Feature Extraction
Fuman DB
Prediction Service
Prediction POJO
● Train/Test split● Categorical features
transformation● Distributed, fast and state
of the art algorithms● POJO prediction class
generation
CSV formatposts, label
Fuman Webappget new post values
make feature vectors
Train Production Model: H2O
Overview: Making Predictions● Use the prediction POJO generated
by H2O● For each new post query Prediction
Service○ Convert to vector (Double[] for H2O)
○ Get prediction from prediction POJO (Double value, round to integer)
○ Update database with predicted price
We reached the business goal!
Project goal: Get similar performance from H2O as from DataRobot
H2O is not ideal to explore different models and features, but for production, it is FAST with similar predictive performance. It is implemented in pure Java (Github).
● H2O: Train a new model for production
○ GBM (Gradient Boosting Machine)○ MSE: 12.8
● DataRobot’s best model○ eXtreme Gradient Boosted Trees○ RMSE: 3.54○ MSE: 12.53
Real world ML loves Java!● Java is a top choice for making production machine
learning systems
● Benefits of Java 8 makes Java fun and relevant again
● Integration in a Java web application was not hard
● Java is not a good choice for experimentation
○ Start with a Python prototype with Scikit-learn
○ Use a Machine Learning service like DataRobot.com
You can use ML in your projects!● Web API services are like a personal data
scientist○ No need for Data Scientist for simple use of ML○ But harder dataset will need expertise
● Real world ML projects needs Engineers: ○ Get data to train a good model (log files, sales results,
mail campaign results,…)○ Transform data into input for ML library or web service○ Deploy and integrate into production
● Most steps are just normal programming○ Get data from DB○ Transform data into a CSV○ Call a REST API or Java POJO to make predictions○ Integrate with the system that needs predictions
Questions?
Live code
Feature engineering with streams and lambdasThe goal is to take raw data from the DB and create arrays of numerical or categorical features.
1. Get Fuman user post data from DB -> UserPost2. Learn the vocabulary of all user posts word types3. Create the dataset:
a. For each post,i. Add the statistics featuresii. Add the word types features
4. Transform to csv output (for DataRobot)
Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in retrospect, a specialized vector library would have been better, I think. Weka is a terrible production library