production grade data science for hadoop
TRANSCRIPT
Production GradeData Science for Hadoop
Villu RuusmannOpenscoring OÜ
About openscoring.io
"Standards-based, Open-source Middleware for Predictive Analytics Applications"
aka
"Rapid deployment of R and Scikit-Learn models on JVM"
2/25
"From Laboratory to Factory"
3/25
From grams (to kilograms) to megagrams
Scaling the vessel / HardwareScaling the chemical reaction / Software and processes
4/25
Scalability through re-engineering
5/25
Broader objectives
● Platform○ Portability of applications
● Application○ Central governance and dissemination of models
● Model○ "Decisioning as a Service"
● Decision○ Traceability, reproducibility, explainability
6/25
R and Scikit-Learn
PMMLPFA
Java and C
Domain-Specific Languages
General-Purpose Languages
7/25
The PMML connection
8/25
X model producersY model consumers(ten years into past & future)
1 model API
9/25
Model API pipeline
Conversion Deployment
Ephermeal:
Persistent, asset-like:
Conversion Maintenance Deployment
10/25
Conversion into PMML
Capturing and expressing the essentials of the modeling workflow using PMML vocabulary:
Input → Feature vector → Response vector → Output
Connecting stable data schemas
11/25
12/25
13/25
Standardized representation
Rada, cforest, gbm, randomForest, xgb.Booster
Scikit-LearnAdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
<MiningModel function="classification"> <Segmentation multipleModelMethod="weightedAverage" > <Segment id="1" weight="1"> <True/> <TreeModel> <Node> ... </Node> </TreeModel> </Segment> ... </Segmentation></MiningModel>
14/25
Supra-standardized representation
Model model = MiningModelUtil.createClassifierEnsemble( MultipleModelMethodType.WEIGHTED_AVERAGE, Arrays.asList( PMMLUtil.loadModel("xgboost.pmml"), PMMLUtil.loadModel("keras-mlp.pmml"), PMMLUtil.loadModel("sklearn-rf.pmml") ), Arrays.asList(5d/9d, 2d/9d, 2d/9d));PMMLUtil.storeModel(model, "kaggle-submission.pmml");
15/25
State machine for model maintenance
EnhancedEnhanced
EnhancedStandardizedRaw Enhanced Optimized
16/25
Model as a function
Target = f(Active1, Active2, .., Activen)
Outputfeature = ffeature(Target)
17/25
Model metadata API
Evaluator evaluator = getEvaluator();
List<FieldName> activeFields = evaluator.getActiveFields();for(FieldName activeField : activeFields){ DataField dataField = evaluator.getDataField(activeField); // Inspect data type, operational type, value space etc.}
18/25
Model evaluation API
Evaluator evaluator = getEvaluator();
while(!done){ Map<FieldName, ?> arguments = readInRecord(); Map<FieldName, ?> results = evaluator.evaluate(arguments); writeOutRecord(results);}
19/25
http://github.com/jpmml/jpmml-${framework}
Volume
Velo
city
{ REST }
20/25
JPMML-Cascading
Evaluator evaluator = getEvaluator();
PMMLPlanner pmmlPlanner = new PMMLPlanner(evaluator);pmmlPlanner.setHeadName("input");pmmlPlanner.setTailName("output");
FlowDef flowDef = ...;flowDef.addAssemblyPlanner(pmmlPlanner);
21/25
JPMML-Pig
grunt> REGISTER jpmml-pig-distributable-1.0.jar;
grunt> DEFINE my_udf org.jpmml.pig.PMMLFunc('model.pmml');
grunt> output = FOREACH input GENERATE my_udf(*);
22/25
JPMML-Spark
Evaluator evaluator = getEvaluator();
PMMLFunction pmmlFunction = new PMMLFunction(evaluator);
JavaRDD<Row> input = ...;JavaRDD<Row> output = input.map(pmmlFunction);
23/25
Thoughts on API-driven architecture
● Based on a relevant standard○ "Conventions over configuration"
● High(est) abstraction level○ Productivity○ Maintainability
● End-to-end value proposition○ Separation of concerns
24/25