hivemall dbtechshowcase 20160713 #dbts2016
TRANSCRIPT
MachineLearningMadeEasybyusingHivemall
ResearchEngineerMakotoYUI@myui
bit.ly/hivemall
12016/07/13 DB tech showcase
➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service
(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project
and Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST➢XML native database and Parallel Database systems
WhoamI?
2
ExternalIntegrations
SQL
Server
CRM
RDBMS
App log
Sensor
Apache log
ERP
HiveBatch
AdhocPresto
API
ODBCJDBC
PUSH
Treasure Agent
BI tools
Data analysis
Data Collectors
Embedded
Embulk
Mobile SDK
JS SDK
Treasure Data Cloud Service
Machine Learning
900,000Records stored
per sec.
3
0
2000
4000
6000
8000
10000
12000
(単位
)10億レコード
サービス開始
SeriesAFunding
100社導入
Gartner社「CoolVendorinBigData」に選定される
10兆件
5兆レコード
数字でみるトレジャーデータ (2014年10月):40万レコード 毎秒インポートされるデータの数10兆レコード以上 インポートされたデータの数120億 アドテク業界のお客様1社によって毎日送られてくるデータ
Data Imported to Treasure Data
4
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. How to use Hivemall
Agenda
5
What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
6
WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award7
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression
List of supported Algorithms
8
List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
9
List of Algorithms for Recommendation
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
10
Other Supported Algorithms
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
11
Ø CTR prediction of Ad click logs• Freakout Inc., Fan communication, and more• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
http://www.slideshare.net/masakazusano75/sano-hmm-2015051212
ØGender prediction of Ad click logs• Scaleout Inc. and Fan commucations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
13
Industry use cases of HivemallØ Value prediction of Real estates
• Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14
Source: http://itnp.net/article/2016/02/18/2286.html
Industry use cases of Hivemall
15
ØChurn Detection• OISIX
Industry use cases of Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16
Copyright ©2015 Treasure Data. All Rights Reserved.
17
会員サービスの解約予測
•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた
•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減
•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に
•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現
•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成
•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現
Web
Mobile
属性情報
行動ログ
クレーム情報
流入元
利用サービス情報
直接施策
間接施策
ポイント付与 ケアコール
成功体験への誘導UI 変更
予測に使うデータ
ØRecommendation• Portal site
Industry use cases of Hivemall
18
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. How to use Hivemall
Agenda
19
WhyHivemall
1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.
2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.
That’swhyIbuildHivemall.20
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
MachineLearning
file
21
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
MachineLearning22
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Do not scaleHave to learn R/Python APIs
23
HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I ❤ scalableSQL query
24
Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming
ScalaShell(REPL)H2O Rprogramming
GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)
C++APIprogrammingCommandLine
SurveyonexistingMLframeworks
ExistingdistributedmachinelearningframeworksareNOTeasytouse
25
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)
✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop 26
HivemallonApacheSpark
Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6
27
1. What is Hivemall
2. Why Hivemall
3. How to use Hivemall
Agenda
28
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation 29
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
30
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
31
create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
32
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
33
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
34
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
35
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
36
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
37
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
bit.ly/hivemall-rtp
38
Export Prediction Model to a RDBMS
Any RDBMS
TD exportPeriodical export is very easyin Treasure Data
103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855
39
PredictionModel
Real-timePredictiononMySQL
SIGMOID(x) = 1.0 / (1.0 + exp(-x))
PredictionModel Label
Feature Vector
SELECT sigmoid(sum(t.value * m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)
Online prediction on MySQL
Index lookups are veryefficient in RDBMSs
40
RandomForest in Hivemall
Ensemble of Decision Trees
41
Training of RandomForest
42
Prediction of RandomForest
43
44
https://console.treasuredata.com/jobs/75633717
Conclusion
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind
Do not require coding, packaging, compiling or introducing a new programming language or APIs.
Hivemall’s Positioning
TreasureDataprovidesML-as-a-ServiceusingthelatestversionofHivemall
45
WesupportmachinelearninginCloud
Anyfeaturerequest?Or,questions?
46