hivemall dbtechshowcase 20160713 #dbts2016

MachineLearningMadeEasybyusingHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

bit.ly/hivemall

12016/07/13 DB tech showcase

➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service

(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project

and Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST➢XML native database and Parallel Database systems

WhoamI?

2

ExternalIntegrations

SQL

Server

CRM

RDBMS

App log

Sensor

Apache log

ERP

HiveBatch

AdhocPresto

API

ODBCJDBC

PUSH

Treasure Agent

BI tools

Data analysis

Data Collectors

Embedded

Embulk

Mobile SDK

JS SDK

Treasure Data Cloud Service

Machine Learning

900,000Records stored

per sec.

3

0

2000

4000

6000

8000

10000

12000

(単位

)10億レコード

サービス開始

SeriesAFunding

100社導入

Gartner社「CoolVendorinBigData」に選定される

10兆件

５兆レコード

数字でみるトレジャーデータ (2014年10月):40万レコード毎秒インポートされるデータの数10兆レコード以上インポートされたデータの数120億アドテク業界のお客様1社によって毎日送られてくるデータ

Data Imported to Treasure Data

4

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. How to use Hivemall

Agenda

5

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

6

WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)

bit.ly/hivemall-award7

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

List of supported Algorithms

8

List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

9

List of Algorithms for Recommendation

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

10

Other Supported Algorithms

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

11

Ø CTR prediction of Ad click logs• Freakout Inc., Fan communication, and more• Replaced Spark MLlib w/ Hivemall at company X

Industry use cases of Hivemall

http://www.slideshare.net/masakazusano75/sano-hmm-2015051212

ØGender prediction of Ad click logs• Scaleout Inc. and Fan commucations

http://eventdots.jp/eventreport/458208


13

Industry use cases of HivemallØ Value prediction of Real estates

• Livesense

http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14

Source: http://itnp.net/article/2016/02/18/2286.html


15

ØChurn Detection• OISIX


http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16

Copyright ©2015 Treasure Data. All Rights Reserved.

17

会員サービスの解約予測

•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた

•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減

•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に

•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現

•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成

•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現

Web

Mobile

属性情報

行動ログ

クレーム情報

流入元

利用サービス情報

直接施策

間接施策

ポイント付与ケアコール

成功体験への誘導UI 変更

予測に使うデータ

ØRecommendation• Portal site


18

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)


Agenda

19

WhyHivemall

1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.

2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.

That’swhyIbuildHivemall.20

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

MachineLearning

file

21



RawData




file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

MachineLearning22



RawData




file

Do not scaleHave to learn R/Python APIs

23

HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS

RawData




Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I ❤ scalableSQL query

24

Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming

ScalaShell(REPL)H2O Rprogramming

GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)

C++APIprogrammingCommandLine

SurveyonexistingMLframeworks

ExistingdistributedmachinelearningframeworksareNOTeasytouse

25

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)

✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop 26

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

27

1. What is Hivemall

2. Why Hivemall


Agenda

28

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation 29

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

30

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

FeatureEngineering

31

create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

32

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Training

33

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

34

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

35

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Prediction

36

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

37

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS


FeatureVector

FeatureVector

Label

Exportpredictionmodels

bit.ly/hivemall-rtp

38

Export Prediction Model to a RDBMS

Any RDBMS

TD exportPeriodical export is very easyin Treasure Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

39

PredictionModel

Real-timePredictiononMySQL

SIGMOID(x) = 1.0 / (1.0 + exp(-x))


Feature Vector

SELECT sigmoid(sum(t.value * m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)

Online prediction on MySQL

Index lookups are veryefficient in RDBMSs

40

RandomForest in Hivemall

Ensemble of Decision Trees

41

Training of RandomForest

42

Prediction of RandomForest

43

44

https://console.treasuredata.com/jobs/75633717

Conclusion

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

TreasureDataprovidesML-as-a-ServiceusingthelatestversionofHivemall

45

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?

46