hadoopcon'16, taipei @myui
TRANSCRIPT
![Page 1: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/1.jpg)
Hivemall:MachineLearningLibraryforApacheHive/Spark
ResearchEngineerMakotoYUI(油井誠)@myui
12016/09/09HadoopCon16,Taipei
![Page 2: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/2.jpg)
Ø 2015.04~ ResearchEngineeratTreasureData,Inc.• MymissionisdevelopingML-as-a-ServiceinaHadoop-as-
a-servicecompany
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.産業技術総合研究所• DevelopedHivemallasapersonalresearchproject
Ø 2009.03Ph.D.inComputerSciencefromNAIST• MajoredinParallelDataProcessing,notMLthen
Ø VisitingscholarinCWI,AmsterdamandUniv.Edinburgh
Littleaboutme..
2016/09/09HadoopCon16,Taipei 2
![Page 3: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/3.jpg)
2016/09/09HadoopCon16,Taipei 3
Hiro YoshikawaCEO
Kaz OtaCTO
Sada FuruhashiChief Architect
Open source business veteran
Founder - world’s largest Hadoop group
Invented Fluentd, Messagepack
TODAY100+ Employees, 30M+ funding
2015 New office in Seoul, Korea
2013 New office in Tokyo, Japan
2012 Founded in Mountain View, CA
InvestorsJerry YangYahoo! Founder
Bill TaiAngel Investor
Yukihiro MatsumotoRuby Inventor
Sierra Ventures - Tim GuleriEntrerprise Software
Scale Ventures - Andy Vitus B2B SaaS
TreasureData
![Page 4: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/4.jpg)
2016/09/09HadoopCon16,Taipei 4
WeOpen-source!TDinvented..
Streaming log collector Bulk data import/export efficient binary serialization
Streaming Query ProcessorMachine learning on Hadoop
digdag.io
Workflow engine (Beta)
![Page 5: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/5.jpg)
2016/09/09HadoopCon 16,Taipei 5
Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection
Point
Ourtechnologyusers
![Page 6: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/6.jpg)
2016/09/09HadoopCon 16,Taipei 6
Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection
Point
Ourtechnologyusers
![Page 7: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/7.jpg)
2016/09/09HadoopCon16,Taipei 7
TreasureData’sSolution
![Page 8: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/8.jpg)
2016/09/09HadoopCon16,Taipei 8
BigDataStatsinTD
![Page 9: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/9.jpg)
Ad-tech
IoT
三菱重工
Agency/Trading Desk DMP / DSP Ad-Network
Diverse Corporate Identity Manual 02
コーポレートカラー
千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。
千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。
繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。
■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%
■ PANTONE / プロセスカラー555EC
■ RGB / モニターR : 0 G : 80 B : 60
背景と干渉する場合に使用するボックスロゴ
背景と干渉する場合に使用するボックスロゴ 白黒
白黒のみの場合
EC Media Game/SNS
Gaminge-Commerce InternetService
Retail Finance TechnologyTelecommunicationMaker
Otherdomain
OurCustomers
2016/09/09HadoopCon16,Taipei 9
![Page 10: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/10.jpg)
Ad-tech
IoT
三菱重工
Agency/Trading Desk DMP / DSP Ad-Network
Diverse Corporate Identity Manual 02
コーポレートカラー
千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。
千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。
繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。
■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%
■ PANTONE / プロセスカラー555EC
■ RGB / モニターR : 0 G : 80 B : 60
背景と干渉する場合に使用するボックスロゴ
背景と干渉する場合に使用するボックスロゴ 白黒
白黒のみの場合
EC Media Game/SNS
Gaminge-Commerce InternetService
Retail Finance TechnologyTelecommunicationMaker
Otherdomain
OurCustomers
2016/09/09HadoopCon16,Taipei 10
![Page 11: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/11.jpg)
1. WhatisHivemall(introduction)
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 11
![Page 12: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/12.jpg)
WhatisHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2
12
https://github.com/myui/hivemall
2016/09/09HadoopCon16,Taipei
![Page 13: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/13.jpg)
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
Hivemall’s TechnologyStack
AmazonS3
2016/09/09HadoopCon16,Taipei 13
![Page 14: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/14.jpg)
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
142016/09/09HadoopCon16,Taipei
![Page 15: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/15.jpg)
ListofsupportedAlgorithms
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
15
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
2016/09/09HadoopCon16,Taipei
![Page 16: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/16.jpg)
ListofAlgorithmsforRecommendation
16
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
2016/09/09HadoopCon16,Taipei
![Page 17: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/17.jpg)
OtherSupportedAlgorithms
17
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
2016/09/09HadoopCon16,Taipei
![Page 18: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/18.jpg)
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
IndustryusecasesofHivemall
182016/09/09HadoopCon16,Taipei
![Page 19: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/19.jpg)
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
IndustryusecasesofHivemall
19
Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)
2016/09/09HadoopCon16,Taipei
minne.com
![Page 20: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/20.jpg)
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
IndustryusecasesofHivemall
202016/09/09HadoopCon16,Taipei
![Page 21: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/21.jpg)
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
• Userscorecalculation• Algrorithm:Regression• Klout
IndustryusecasesofHivemall
21
bit.ly/klout-hivemall
2016/09/09HadoopCon16,Taipei
Influencermarketing
klout.com
![Page 22: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/22.jpg)
OISIX,aleadingfooddeliveryservicecompanyinJapan,usedHivemall’s LogisticRegressiontogetchurnprobability
2016/09/09HadoopCon16,Taipei 22
ChurnDetectionofMonthlyPaymentService
ChurnratedroppedalmostbyhalfbygivinggiftpointstocustomersbeingpredictedtoleaveJ
![Page 23: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/23.jpg)
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 23
![Page 24: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/24.jpg)
2016/09/09HadoopCon16,Taipei
Motivation– WhyanewMLframework?
Mahout?
VowpalWabbit?(w/Hadoopstreaming)
SparkMLlib?
0xdataH2O? ClouderaOryx?
MachineLearningframeworksoutthere thatrunwithHadoop
QuickPoll:Howmanypeopleinthisroomareusingthem?
24
![Page 25: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/25.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
Extract-Transform-Load
MachineLearning
file
2016/09/09HadoopCon16,Taipei 25
height:173cmweight:60kg
age:34gender:man
…
![Page 26: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/26.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kg
age:34gender:man
…
Extract-Transform-Load
file
Needtodoexpensivedatapreprocessing
(Joins,Filtering,andFormattingofDatathatdoesnotfitinmemory)
MachineLearning2016/09/09HadoopCon16,Taipei 26
![Page 27: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/27.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
Extract-Transform-Load
file
DonotscaleHavetolearnR/PythonAPIs
height:173cmweight:60kg
age:34gender:man
…
2016/09/09HadoopCon16,Taipei 27
![Page 28: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/28.jpg)
Hivemall’s Vision:MLonSQL(again)
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
2016/09/09HadoopCon16,Taipei 28
![Page 29: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/29.jpg)
29
HivemallonApacheSpark
Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6
2016/09/09HadoopCon16,Taipei
![Page 30: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/30.jpg)
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 30
![Page 31: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/31.jpg)
ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)
HowHivemallworksintraining
+1,<1,2>..+1,<1,7,9>
-1,<1,3,9>..+1,<3,8>
tuple<label,array<features>>
tuple<feature,weights>
Predictionmodel
UDTF
Relation<feature,weights>
param-mix param-mix
Trainingtable
Shufflebyfeature
train train
● Resulting prediction model is a relation of feature and its weight
● # of mapper and reducers are configurable
UDTFisafunctionthatreturnsarelation
ParallelismisPowerful
2016/09/09HadoopCon16,Taipei 31
![Page 32: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/32.jpg)
32
train train
+1,<1,2>..
+1,<1,7,9>
-1,<1,3,9>..
+1,<3,8>
tuple<label,featues>
array<weight>
Trainingtable
-1,<2,7,9>..
+1,<3,8>
MIX
-1,<2,7,9>..
+1,<3,8>
train train
array<weight>
Parameteraveraging(bagging)
2016/09/09HadoopCon16,Taipei
![Page 33: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/33.jpg)
AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT*
FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining
) tCLUSTER BY rand()
2016/09/09HadoopCon16,Taipei 33
![Page 34: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/34.jpg)
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 34
![Page 35: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/35.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation352016/09/09HadoopCon16,Taipei
![Page 36: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/36.jpg)
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
362016/09/09HadoopCon16,Taipei
![Page 37: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/37.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
372016/09/09HadoopCon16,Taipei
![Page 38: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/38.jpg)
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom
e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
382016/09/09HadoopCon16,Taipei
![Page 39: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/39.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
392016/09/09HadoopCon16,Taipei
![Page 40: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/40.jpg)
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
402016/09/09HadoopCon16,Taipei
![Page 41: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/41.jpg)
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
412016/09/09HadoopCon16,Taipei
![Page 42: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/42.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
422016/09/09HadoopCon16,Taipei
![Page 43: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/43.jpg)
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECT
t.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
432016/09/09HadoopCon16,Taipei
![Page 44: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/44.jpg)
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
44
bit.ly/hivemall-rtp
2016/09/09HadoopCon16,Taipei
![Page 45: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/45.jpg)
RandomForestinHivemall
EnsembleofDecisionTrees
2016/09/09HadoopCon16,Taipei 45
![Page 46: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/46.jpg)
TrainingofRandomForest
2016/09/09HadoopCon16,Taipei 46
![Page 47: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/47.jpg)
PredictionofRandomForest
2016/09/09HadoopCon16,Taipei 47
![Page 48: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/48.jpg)
1. WhatisHivemall
2. WhyHivemall(motivationsetc.)
3. HivemallInternals
4. HowtouseHivemall
5. Futureroadmap
Agenda
2016/09/09HadoopCon16,Taipei 48
![Page 49: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/49.jpg)
49
FutureofHivemall
HivemallwillbecomeApacheHivemall(?)Nowonvotingthough..
2016/09/09HadoopCon16,Taipei
![Page 50: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/50.jpg)
50
ApacheIncubationstatus
2016/09/09HadoopCon16,Taipei
![Page 51: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/51.jpg)
•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>
51
Initialcommitters
2016/09/09HadoopCon16,Taipei
![Page 52: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/52.jpg)
Champion
NominatedMentors
52
Projectmentors
• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember
• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember
2016/09/09HadoopCon16,Taipei
![Page 53: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/53.jpg)
• PossiblyenterApacheIncubatorsoon• IPclearanceandproject/repositorysitesetup•Contributionguideline•CreatewhouseHivemalllist•Moredocumentations!SepttoNov• InitialApacheReleasewillbeDec(orlateNov?)
53
Roadmap
2016/09/09HadoopCon16,Taipei
![Page 54: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/54.jpg)
ü HivemallonSpark2.0w/Dataframesupportü XGBoost support
54
ComingNewFeatures- alreadymergedinMaster
2016/09/09HadoopCon16,Taipei
PleaseReferbit.ly/hivemall-xgboost
fordetail
![Page 55: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/55.jpg)
ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
55
ComingNewFeatures- alreadymergedinMaster
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2016/09/09HadoopCon16,Taipei
![Page 56: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/56.jpg)
ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
56
ComingNewFeatures- alreadymergedinMaster
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2016/09/09HadoopCon16,Taipei
![Page 57: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/57.jpg)
ü VariousEvaluationMetrics•PR#326
57
ComingNewFeatures- alreadymergedinMaster
2016/09/09HadoopCon16,Taipei
![Page 58: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/58.jpg)
• v0.5-beta{1,2}release(Oct-Nov)üone-hotencodingü Field-awareFactorizationMachinesü Kernelized PassiveAggressiveüGeneralizedLinearModelü OptimizerframeworkincludingADAMü L1/L2regularization
ü GradientTreeBoostingü OnlineLDA
58
Otherundergoingnewfeatures
2016/09/09HadoopCon16,Taipei
![Page 59: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/59.jpg)
ConclusionandTakeaway
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
59
Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind
Donotrequirecoding,packaging,compilingorintroducinganewprogramminglanguageor APIs.
Hivemall’s Positioning
WewelcomeyourcontributionstoApacheHivemallJ
2016/09/09HadoopCon16,Taipei
![Page 60: HadoopCon'16, Taipei @myui](https://reader033.vdocuments.site/reader033/viewer/2022052606/58f0da821a28ab1d6b8b45b1/html5/thumbnails/60.jpg)
60
Anyfeaturerequestorquestions?
#hivemall
2016/09/09HadoopCon16,Taipei