real world machine learning with java for fumankaitori.com

36
Real World Machine Learning in Java 8 at Fumankaitori.com Mathieu Dumoulin, Chief Data Scientist fumankaitori.com, Data Science Team manager at en-japan

Upload: mathieu-dumoulin

Post on 07-Jan-2017

8.561 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: Real world machine learning with Java for Fumankaitori.com

Real World Machine Learning in Java 8 at

Fumankaitori.comMathieu Dumoulin, Chief Data Scientist fumankaitori.com,

Data Science Team manager at en-japan

Page 2: Real world machine learning with Java for Fumankaitori.com

Today’s menu● About me and 不満買取センータ

● The business problem: Post pricing

● Project Overview○ Why use ML

○ How to use ML in projects

○ How we used ML in this project

● Results

● Live code (depends on time)

● Conclusion

Page 3: Real world machine learning with Java for Fumankaitori.com

Presentation goals● Machine learning is possible by any Java Engineer

● Java is a great programming language for real-

world machine learning systems

● New ML APIs make it easy to focus on the problem

and the data, and get a well-performing model “for

free”

● You don’t need a ph.D. to use machine learning,

just some self-study, good tools and libraries and

build experience one project at a time

Page 4: Real world machine learning with Java for Fumankaitori.com

About me

Page 5: Real world machine learning with Java for Fumankaitori.com

Google map for Quebec City here!

Page 6: Real world machine learning with Java for Fumankaitori.com

My Work: Java SE, Hadoop Engineer, Data Scientist

Page 7: Real world machine learning with Java for Fumankaitori.com

● Launched in Mar 2015. Provide web/Android/iOS applications.

● An application to collect data about people's dissatisfactions.

● Features:○ Users can post any dissatisfaction of any products/services.

○ Users get points as a reward for their posts. And the point is exchangeable with coupon code of EC sites.

● 250,000 users with 1,500,000 posts (accumulated)(end of Nov 2015)

Page 8: Real world machine learning with Java for Fumankaitori.com

Problem statement: post point value prediction● Fuman user posts have a money value ● We want to give more points for “good”

posts● At first, operations staff checked all

posts, but they can’t check 10,000 posts each day...

We made rules, but point value was worse:

● Rules can’t check the content of the posts● Rules always miss something● Making hundreds or thousands of rules by

hand is ridiculous

Page 9: Real world machine learning with Java for Fumankaitori.com

ML is the best solution for 不満買取センター

● ML Problem: Estimate the point value of a user posts (0-25)

● Project goal: Estimate the value of posts with less than 5 points

difference from human judgement

● Data: All user posts and user profile data

● Data with known output (labels): staff already set points for 200k

posts manually

This is a classic case of supervised learning (Wiki). Another reference from Microsoft

Prediction of a price requires to build a Regression model because the prediction is a number, as opposed to a classification problem which predicts which of two classes each post would belong to.

Page 10: Real world machine learning with Java for Fumankaitori.com

Real world ML project overview● Machine Learning Workflow● Data Scientist and Java Engineer roles● Java for production ML● Java 8 benefits● Our point prediction system details● Results

Page 11: Real world machine learning with Java for Fumankaitori.com

Machine Learning Workflow

Load data

Extract Features

Train Model

Evaluate vs. business goal

Load new data

Extract Features

Predict using model

Act on prediction

data, labels (known result)

feature vectors, labels

prediction, labels

data

feature vectors

predictions

data

iterate

best model

the same

Page 12: Real world machine learning with Java for Fumankaitori.com

Workflow for machine learning system1. Set a goal with business

value2. Get data (fuman user

posts) with a price already set

3. Transform data for input into machine learning algorithm

4. Train and evaluate machine learning model until reach goal

5. Deploy best model

Page 13: Real world machine learning with Java for Fumankaitori.com

Data Scientist’s role1. Set a goal with business

value2. Get data (fuman user

posts) with a price already set

3. Transform data for input into machine learning algorithm

4. Train and evaluate machine learning model until reach goal

5. Deploy best model

Choose features

Build many models

Page 14: Real world machine learning with Java for Fumankaitori.com

Software Engineer’s role

Implement and integrate into production system

1. Set a goal with business value

2. Get data (fuman user posts) with a price already set

3. Transform data for input into machine learning algorithm

4. Train and evaluate machine learning model until reach goal

5. Deploy best model

Get data from data sourceImplement production code

Page 15: Real world machine learning with Java for Fumankaitori.com

But we don’t have a data scientist...

Page 16: Real world machine learning with Java for Fumankaitori.com

You can outsource!

Page 17: Real world machine learning with Java for Fumankaitori.com

Java for production ML● Easy integration with Java applications

● Fast (vs. Python or R)

● Easy to program (vs. C++)

● Most common enterprise programming language, IDE support and excellent support libraries

● Lots of state of the art machine learning libraries have a Java API

Page 18: Real world machine learning with Java for Fumankaitori.com

Machine Learning libraries

Page 19: Real world machine learning with Java for Fumankaitori.com

Benefits of Java 8● Java 8’s functional style is a very good match with ML operations

a. Feature extraction: data in → transform → data out

● Java 8’s streams and Lambdas

a. Code is easier to understand and less verbose

● Easy parallel code

a. Faster “for free”

Page 20: Real world machine learning with Java for Fumankaitori.com

Post point prediction system: step by step

Feature Extraction

Fuman DB

Prediction Service

● Train/Test split● Categorical features

transformation● Select best features● Try many algorithms● Tune algorithms● Evaluate models● REST Prediction API

Iterate until results meet business goals

CSV format

DR Prediction API

posts, label

Page 21: Real world machine learning with Java for Fumankaitori.com

Feature Extraction details● We added character and words statistics about each fuman user post

○ Number of hiragana, katakana, kanji, alphabet characters and words○ Number of words, length of words○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a

post● User profile information

○ age, gender, job category, etc.● Bag-of-word models:

○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …)○ Part-of-speech (名詞、動詞、形容詞、…)

○ Word types features (hiragana word, katakana word, kanji word, …)

Page 22: Real world machine learning with Java for Fumankaitori.com

マックのポテト揚げたてでお願いしたのに、揚げたてじゃ

なかった。

Feature Extraction: Example

Page 23: Real world machine learning with Java for Fumankaitori.com

Feature Example: MeCab analyzerマックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。

マック 名詞,固有名詞,一般,*,*,*,マック,マック,マック

の 助詞,連体化,*,*,*,*,の,ノ,ノポテト 名詞,一般,*,*,*,*,ポテト,ポテト,ポテト

揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ

で 助詞,格助詞,一般,*,*,*,で,デ,デお願い 名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ

し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シた 助動詞,*,*,*,特殊・タ,基本形,た,タ,タのに 助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ

、 記号,読点,*,*,*,*,、,、,、揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ

じゃ 助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ

なかっ 助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ

た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ。 記号,句点,*,*,*,*,。,。,。EOS

Page 24: Real world machine learning with Java for Fumankaitori.com

Feature Extraction: ExampleCharacter counts

Hiragana: 20

Katakana: 6

Kanji: 3

Alpha: 0

Digits: 0

Marks (!,?): 0

Token type counts

Hiragana: 8

Katakana: 2

Kanji: 3

Alpha: 0

Digits: 0

Marks: 0

Token length

1: 5

2: 2

3: 4

4: 2

5+: 0

Page 25: Real world machine learning with Java for Fumankaitori.com

Training and evaluation of our model

Page 26: Real world machine learning with Java for Fumankaitori.com

We reached the project goal!

● DataRobot’s best model○ eXtreme Gradient Boosted Trees○ RMSE: 3.54○ MSE: 12.53

Business result:

● Higher quality evaluation than rules● Operation staff don’t need to manually check posts● We can validate points every day

Our result: 3.5 point difference from human judgement

Page 27: Real world machine learning with Java for Fumankaitori.com

Deployment issues● Problem: The Prediction API was very slow (>1s / post) so we

had to run it as a batch process each night. ● We want: Make predictions locally with low latency, without losing

the good prediction performance we already have.

We solved this problem using the excellent open source, distributed machine learning library H2O by H2o.ai.

Co-founder: Cliff Click, who made the Java HotSpot Server Compiler

Page 28: Real world machine learning with Java for Fumankaitori.com

Post point prediction system: Current system

Feature Extraction

Fuman DB

Prediction Service

Prediction POJO

● Train/Test split● Categorical features

transformation● Distributed, fast and state

of the art algorithms● POJO prediction class

generation

CSV formatposts, label

Fuman Webappget new post values

make feature vectors

Page 29: Real world machine learning with Java for Fumankaitori.com

Train Production Model: H2O

Page 30: Real world machine learning with Java for Fumankaitori.com

Overview: Making Predictions● Use the prediction POJO generated

by H2O● For each new post query Prediction

Service○ Convert to vector (Double[] for H2O)

○ Get prediction from prediction POJO (Double value, round to integer)

○ Update database with predicted price

Page 31: Real world machine learning with Java for Fumankaitori.com

We reached the business goal!

Project goal: Get similar performance from H2O as from DataRobot

H2O is not ideal to explore different models and features, but for production, it is FAST with similar predictive performance. It is implemented in pure Java (Github).

● H2O: Train a new model for production

○ GBM (Gradient Boosting Machine)○ MSE: 12.8

● DataRobot’s best model○ eXtreme Gradient Boosted Trees○ RMSE: 3.54○ MSE: 12.53

Page 32: Real world machine learning with Java for Fumankaitori.com

Real world ML loves Java!● Java is a top choice for making production machine

learning systems

● Benefits of Java 8 makes Java fun and relevant again

● Integration in a Java web application was not hard

● Java is not a good choice for experimentation

○ Start with a Python prototype with Scikit-learn

○ Use a Machine Learning service like DataRobot.com

Page 33: Real world machine learning with Java for Fumankaitori.com

You can use ML in your projects!● Web API services are like a personal data

scientist○ No need for Data Scientist for simple use of ML○ But harder dataset will need expertise

● Real world ML projects needs Engineers: ○ Get data to train a good model (log files, sales results,

mail campaign results,…)○ Transform data into input for ML library or web service○ Deploy and integrate into production

● Most steps are just normal programming○ Get data from DB○ Transform data into a CSV○ Call a REST API or Java POJO to make predictions○ Integrate with the system that needs predictions

Page 34: Real world machine learning with Java for Fumankaitori.com

Questions?

Page 35: Real world machine learning with Java for Fumankaitori.com

Live code

Page 36: Real world machine learning with Java for Fumankaitori.com

Feature engineering with streams and lambdasThe goal is to take raw data from the DB and create arrays of numerical or categorical features.

1. Get Fuman user post data from DB -> UserPost2. Learn the vocabulary of all user posts word types3. Create the dataset:

a. For each post,i. Add the statistics featuresii. Add the word types features

4. Transform to csv output (for DataRobot)

Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in retrospect, a specialized vector library would have been better, I think. Weka is a terrible production library