Download - Dots20161029 myui
ApacheHivemall:MachineLearningLibraryforApacheHive/Spark/Pig
ResearchEngineerMakotoYUI@myui
12016/10/29@Dots
Ø 2015.04~ ResearchEngineeratTreasureData,Inc.• MymissionisdevelopingML-as-a-ServiceinaHadoop-as-
a-servicecompany
Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.• DevelopedHivemallasapersonalresearchproject
Ø 2009.03Ph.D.inComputerSciencefromNAIST• MajoredinParallelDataProcessing,notMLthen
Ø VisitingscholarinCWI,AmsterdamandUniv.Edinburgh
Littleaboutme…
2016/10/29@Dots 2
2016/10/29@Dots 3
Hiro YoshikawaCEO
Kaz OtaCTO
Sada FuruhashiChief Architect
Open source business veteran
Founder - world’s largest Hadoop group
Invented Fluentd, Messagepack
TODAY100+ Employees, 30M+ funding
2015 New office in Seoul, Korea
2013 New office in Tokyo, Japan
2012 Founded in Mountain View, CA
InvestorsJerry YangYahoo! Founder
Bill TaiAngel Investor
Yukihiro MatsumotoRuby Inventor
Sierra Ventures - Tim GuleriEntrerprise Software
Scale Ventures - Andy Vitus B2B SaaS
TreasureData
2016/10/29@Dots 5
WeOpen-source!TDinvented..
Streaming log collector Bulk data import/export efficient binary serialization
Streaming Query ProcessorMachine learning on Hadoop
digdag.io
Workflow engine (Beta)
1. WhatisHivemall(introduction)
2. HowtouseHivemall
3. Roadmapandcomingnewfeatures
Agenda
2016/10/29@Dots 7
2016/10/29@Dots 8
HivemallenteredApacheIncubatoronSept13,2016🎉
hivemall.incubator.apache.org
@ApacheHivemall
•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>
9
Initialcommitters
2016/10/29@Dots
Champion
NominatedMentors
10
Projectmentors
• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember
• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember
2016/10/29@Dots
WhatisApacheHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs
112016/10/29@Dots
Multi/Crossplatform Versatile Scalable Ease-of-use
Hivemalliseasyandscalable…
ClassificationwithMahout
CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
MLmadeeasyforSQLdevelopers
Borntobeparallelandscalable
ThisSQLqueryautomaticallyrunsinparallelonHadoopcluster
122016/10/29@Dots
Ease-of-use
Scalable
2016/10/29@Dots 13
Hivemallisamulti/cross-platformMLlibrary
HiveQL SparkSQL/Dataframe API PigLatin
HivemallisMulti/Crossplatform..
Multi/Crossplatform
predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive
2016/10/29@Dots 19
Versatile
HivemallisaVersatilelibrary..
ü HivemallisnotonlyforMachineLearning
ü Hivemallprovidesbunchofgenericutilityfunctions
EachorganizationhasownsetsofUDFsfordatapreprocessing!
Don’tRepeatYourself!Don’tRepeatYourself!
2016/10/29@Dots 20
Hivemallgenericfunctions
ArrayandMap
Bitandcompress
StringandNLP
WewelcomecontributingyourgenericUDFstoHivemall!
ListofsupportedAlgorithms
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
21
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
2016/10/29@Dots
ListofAlgorithmsforRecommendation
22
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
2016/10/29@Dots
2016/10/29@Dots 23
student class score
1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
student class score3 a 902 a 801 b 706 b 60
Listtop-2studentsforeachclass
2016/10/29@Dots 24
student class score
1 b 702 a 803 a 904 b 505 a 706 b 60
Listtop-2studentsforeachclass
SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank
FROMtable)tWHERErank<=2
Top-kqueryprocessing
2016/10/29@Dots 25
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Listtop-2studentsforeachclass
SELECTeach_top_k(2,class,score,class,student
)as(rank,score,class,student)FROM(SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass
)t
Top-kqueryprocessing
2016/10/29@Dots 26
Top-kqueryprocessingbyRANKOVER()
partitionbyclass
Node1
Sortbyclass,score
rankover()
rank>=2
2016/10/29@Dots 27
Top-kqueryprocessingbyEACH_TOP_K
distributedbyclass
Node1
Sortbyclass
each_top_k
OUTPUTonlyKitems
2016/10/29@Dots 28
ComparisonbetweenRANKandEACH_TOP_K
distributedbyclass
Sortbyclass
each_top_k
Sortbyclass,score
rankover()
rank>=2
SORTINGISHEAVY
NEEDTOPROCESSALL
OUTPUTonlyKitems
Each_top_k isveryefficientwherethenumberofclassislarge
BoundedPriorityQueueisutilized
PerformancereportedbyTDcustomer
2016/10/29@Dots 29
•1,000studentsineachclass•20 millionclasses
RANKover()querydoesnotfinishesin24hoursLEACH_TOP_Kfinishesin2hoursJ
Referfordetailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908
OtherSupportedAlgorithms
30
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
2016/10/29@Dots
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
IndustryusecasesofHivemall
312016/10/29@Dots
http://www.slideshare.net/eventdotsjp/hivemall
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
IndustryusecasesofHivemall
322016/10/29@Dots
minne.com
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
IndustryusecasesofHivemall
332016/10/29@Dots
• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
• ValuepredictionofRealestates• Algorithm:Regression• Livesense
• Userscorecalculation• Algrorithm:Regression• Klout
IndustryusecasesofHivemall
34
bit.ly/klout-hivemall
2016/10/29@Dots
Influencermarketing
klout.com
OISIX,aleadingfooddeliveryservicecompanyinJapan,usedHivemall’s LogisticRegressiontogetchurnprobability
2016/10/29@Dots 35
ChurnDetectionofMonthlyPaymentService
ChurnratedroppedalmostbyhalfbygivinggiftpointstocustomersbeingpredictedtoleaveJ
1. WhatisHivemall(introduction)
2. HowtouseHivemall
3. Roadmapandcomingnewfeatures
Agenda
2016/10/29@Dots 36
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation372016/10/29@Dots
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
382016/10/29@Dots
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
402016/10/29@Dots
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom
e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
412016/10/29@Dots
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
422016/10/29@Dots
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
432016/10/29@Dots
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
442016/10/29@Dots
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
452016/10/29@Dots
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECT
t.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
462016/10/29@Dots
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
47
bit.ly/hivemall-rtp
2016/10/29@Dots
ExportPredictionModeltoaRDBMS
AnyRDBMS
TDexportPeriodicalexportisvery easy
inTreasureData
103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855
48
PredictionModel
2016/10/29@Dots
Real-timePredictiononMySQL
PredictionModel Label
FeatureVector
SELECTsigmoid(sum(t.value*m.weight))asprob
FROMtesting_explodedtLEFTOUTERJOINprediction_modelmON(t.feature=m.feature)
IndexlookupsareveryefficientinRDBMSs!492016/10/29@Dots
1. WhatisHivemall(introduction)
2. HowtouseHivemall
3. Roadmapandcomingnewfeatures
Agenda
2016/10/29@Dots 54
• IPclearanceandproject/repositorysitesetup• Createcontributionguidelines• Moverepositoryfromgithub toASF
• Addmoretestsanddocumentations• InitialApacheReleasewillbeDecorJan
55
Roadmap
2016/10/29@Dots
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2016/10/29@Dots 56
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Anomaly/Change-pointDetectionbyChangeFinder
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2016/10/29@Dots 57
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Anomaly/Change-pointDetectionbyChangeFinder
2016/10/29@Dots 58
T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.
Change-pointdetectionbySingularSpectrumTransformation
LessHyper-parametersthanChangeFinderJ
2016/10/29@Dots 60
FeatureEngineering– FeatureBinning
Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution
MapAgesinto3bins
2016/10/29@Dots 63
FeatureTransformation– Onehot encoding
Mapsacategoricalvariabletoauniquenumberstartingfrom1
ü Spark2.0 Dataframe supportü XGBoost Integrationü Field-awareFactorizationMachinesü GeneralizedLinearModel• OptimizerframeworkincludingADAM• L1/L2regularization
2016/10/29@Dots 64
Othernewfeaturestocome
ConclusionandTakeaway
Hivemallisamachinelearninglibrarythatis…
2016/10/29@Dots 65
WewelcomeyourcontributionstoApacheHivemallJ
Multi/Crossplatform Versatile Scalable Ease-of-use
hivemall.incubator.apache.org
Ø ForDataEngineerswhoneedMLØ DeepLearningisoutofscopeØ Recommendationishigh-priorityforus
Hivemall’s Positioning