hivemall dbtechshowcase 20160713 #dbts2016

46
Machine Learning Made Easy by using Hivemall Research Engineer Makoto YUI @myui <[email protected]> bit.ly/hivemall 1 2016/07/13 DB tech showcase

Upload: makoto-yui

Post on 16-Apr-2017

368 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: Hivemall dbtechshowcase 20160713 #dbts2016

MachineLearningMadeEasybyusingHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

bit.ly/hivemall

12016/07/13 DB tech showcase

Page 2: Hivemall dbtechshowcase 20160713 #dbts2016

➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service

(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project

and Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST➢XML native database and Parallel Database systems

WhoamI?

2

Page 3: Hivemall dbtechshowcase 20160713 #dbts2016

ExternalIntegrations

SQL

Server

CRM

RDBMS

App log

Sensor

Apache log

ERP

HiveBatch

AdhocPresto

API

ODBCJDBC

PUSH

Treasure Agent

BI tools

Data analysis

Data Collectors

Embedded

Embulk

Mobile SDK

JS SDK

Treasure Data Cloud Service

Machine Learning

900,000Records stored

per sec.

3

Page 4: Hivemall dbtechshowcase 20160713 #dbts2016

0

2000

4000

6000

8000

10000

12000

(単位

)10億レコード

サービス開始

SeriesAFunding

100社導入

Gartner社「CoolVendorinBigData」に選定される

10兆件

5兆レコード

数字でみるトレジャーデータ (2014年10月):40万レコード 毎秒インポートされるデータの数10兆レコード以上 インポートされたデータの数120億 アドテク業界のお客様1社によって毎日送られてくるデータ

Data Imported to Treasure Data

4

Page 5: Hivemall dbtechshowcase 20160713 #dbts2016

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. How to use Hivemall

Agenda

5

Page 6: Hivemall dbtechshowcase 20160713 #dbts2016

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

6

Page 7: Hivemall dbtechshowcase 20160713 #dbts2016

WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)

bit.ly/hivemall-award7

Page 8: Hivemall dbtechshowcase 20160713 #dbts2016

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

List of supported Algorithms

8

Page 9: Hivemall dbtechshowcase 20160713 #dbts2016

List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

9

Page 10: Hivemall dbtechshowcase 20160713 #dbts2016

List of Algorithms for Recommendation

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

10

Page 11: Hivemall dbtechshowcase 20160713 #dbts2016

Other Supported Algorithms

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

11

Page 12: Hivemall dbtechshowcase 20160713 #dbts2016

Ø CTR prediction of Ad click logs• Freakout Inc., Fan communication, and more• Replaced Spark MLlib w/ Hivemall at company X

Industry use cases of Hivemall

http://www.slideshare.net/masakazusano75/sano-hmm-2015051212

Page 13: Hivemall dbtechshowcase 20160713 #dbts2016

ØGender prediction of Ad click logs• Scaleout Inc. and Fan commucations

http://eventdots.jp/eventreport/458208

Industry use cases of Hivemall

13

Page 14: Hivemall dbtechshowcase 20160713 #dbts2016

Industry use cases of HivemallØ Value prediction of Real estates

• Livesense

http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14

Page 15: Hivemall dbtechshowcase 20160713 #dbts2016

Source: http://itnp.net/article/2016/02/18/2286.html

Industry use cases of Hivemall

15

Page 16: Hivemall dbtechshowcase 20160713 #dbts2016

ØChurn Detection• OISIX

Industry use cases of Hivemall

http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16

Page 17: Hivemall dbtechshowcase 20160713 #dbts2016

Copyright ©2015 Treasure Data. All Rights Reserved.

17

会員サービスの解約予測

•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた

•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減

•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に

•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現

•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成

•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現

Web

Mobile

属性情報

行動ログ

クレーム情報

流入元

利用サービス情報

直接施策

間接施策

ポイント付与 ケアコール

成功体験への誘導UI 変更

予測に使うデータ

Page 18: Hivemall dbtechshowcase 20160713 #dbts2016

ØRecommendation• Portal site

Industry use cases of Hivemall

18

Page 19: Hivemall dbtechshowcase 20160713 #dbts2016

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. How to use Hivemall

Agenda

19

Page 20: Hivemall dbtechshowcase 20160713 #dbts2016

WhyHivemall

1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.

2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.

That’swhyIbuildHivemall.20

Page 21: Hivemall dbtechshowcase 20160713 #dbts2016

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

MachineLearning

file

21

Page 22: Hivemall dbtechshowcase 20160713 #dbts2016

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

MachineLearning22

Page 23: Hivemall dbtechshowcase 20160713 #dbts2016

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

file

Do not scaleHave to learn R/Python APIs

23

Page 24: Hivemall dbtechshowcase 20160713 #dbts2016

HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I ❤ scalableSQL query

24

Page 25: Hivemall dbtechshowcase 20160713 #dbts2016

Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming

ScalaShell(REPL)H2O Rprogramming

GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)

C++APIprogrammingCommandLine

SurveyonexistingMLframeworks

ExistingdistributedmachinelearningframeworksareNOTeasytouse

25

Page 26: Hivemall dbtechshowcase 20160713 #dbts2016

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)

✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop 26

Page 27: Hivemall dbtechshowcase 20160713 #dbts2016

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

27

Page 28: Hivemall dbtechshowcase 20160713 #dbts2016

1. What is Hivemall

2. Why Hivemall

3. How to use Hivemall

Agenda

28

Page 29: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation 29

Page 30: Hivemall dbtechshowcase 20160713 #dbts2016

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

30

Page 31: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

31

Page 32: Hivemall dbtechshowcase 20160713 #dbts2016

create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

32

Page 33: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

33

Page 34: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

34

Page 35: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

35

Page 36: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

36

Page 37: Hivemall dbtechshowcase 20160713 #dbts2016

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

37

Page 38: Hivemall dbtechshowcase 20160713 #dbts2016

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

bit.ly/hivemall-rtp

38

Page 39: Hivemall dbtechshowcase 20160713 #dbts2016

Export Prediction Model to a RDBMS

Any RDBMS

TD exportPeriodical export is very easyin Treasure Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

39

PredictionModel

Page 40: Hivemall dbtechshowcase 20160713 #dbts2016

Real-timePredictiononMySQL

SIGMOID(x) = 1.0 / (1.0 + exp(-x))

PredictionModel Label

Feature Vector

SELECT sigmoid(sum(t.value * m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)

Online prediction on MySQL

Index lookups are veryefficient in RDBMSs

40

Page 41: Hivemall dbtechshowcase 20160713 #dbts2016

RandomForest in Hivemall

Ensemble of Decision Trees

41

Page 42: Hivemall dbtechshowcase 20160713 #dbts2016

Training of RandomForest

42

Page 43: Hivemall dbtechshowcase 20160713 #dbts2016

Prediction of RandomForest

43

Page 44: Hivemall dbtechshowcase 20160713 #dbts2016

44

https://console.treasuredata.com/jobs/75633717

Page 45: Hivemall dbtechshowcase 20160713 #dbts2016

Conclusion

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

TreasureDataprovidesML-as-a-ServiceusingthelatestversionofHivemall

45

Page 46: Hivemall dbtechshowcase 20160713 #dbts2016

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?

46