production grade data science for hadoop

25
Production Grade Data Science for Hadoop Villu Ruusmann Openscoring OÜ

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

296 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Production Grade Data Science for Hadoop

Production GradeData Science for Hadoop

Villu RuusmannOpenscoring OÜ

Page 2: Production Grade Data Science for Hadoop

About openscoring.io

"Standards-based, Open-source Middleware for Predictive Analytics Applications"

aka

"Rapid deployment of R and Scikit-Learn models on JVM"

2/25

Page 3: Production Grade Data Science for Hadoop

"From Laboratory to Factory"

3/25

Page 4: Production Grade Data Science for Hadoop

From grams (to kilograms) to megagrams

Scaling the vessel / HardwareScaling the chemical reaction / Software and processes

4/25

Page 5: Production Grade Data Science for Hadoop

Scalability through re-engineering

5/25

Page 6: Production Grade Data Science for Hadoop

Broader objectives

● Platform○ Portability of applications

● Application○ Central governance and dissemination of models

● Model○ "Decisioning as a Service"

● Decision○ Traceability, reproducibility, explainability

6/25

Page 7: Production Grade Data Science for Hadoop

R and Scikit-Learn

PMMLPFA

Java and C

Domain-Specific Languages

General-Purpose Languages

7/25

Page 8: Production Grade Data Science for Hadoop

The PMML connection

8/25

Page 9: Production Grade Data Science for Hadoop

X model producersY model consumers(ten years into past & future)

1 model API

9/25

Page 10: Production Grade Data Science for Hadoop

Model API pipeline

Conversion Deployment

Ephermeal:

Persistent, asset-like:

Conversion Maintenance Deployment

10/25

Page 11: Production Grade Data Science for Hadoop

Conversion into PMML

Capturing and expressing the essentials of the modeling workflow using PMML vocabulary:

Input → Feature vector → Response vector → Output

Connecting stable data schemas

11/25

Page 12: Production Grade Data Science for Hadoop

12/25

Page 13: Production Grade Data Science for Hadoop

13/25

Page 14: Production Grade Data Science for Hadoop

Standardized representation

Rada, cforest, gbm, randomForest, xgb.Booster

Scikit-LearnAdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier

<MiningModel function="classification"> <Segmentation multipleModelMethod="weightedAverage" > <Segment id="1" weight="1"> <True/> <TreeModel> <Node> ... </Node> </TreeModel> </Segment> ... </Segmentation></MiningModel>

14/25

Page 15: Production Grade Data Science for Hadoop

Supra-standardized representation

Model model = MiningModelUtil.createClassifierEnsemble( MultipleModelMethodType.WEIGHTED_AVERAGE, Arrays.asList( PMMLUtil.loadModel("xgboost.pmml"), PMMLUtil.loadModel("keras-mlp.pmml"), PMMLUtil.loadModel("sklearn-rf.pmml") ), Arrays.asList(5d/9d, 2d/9d, 2d/9d));PMMLUtil.storeModel(model, "kaggle-submission.pmml");

15/25

Page 16: Production Grade Data Science for Hadoop

State machine for model maintenance

EnhancedEnhanced

EnhancedStandardizedRaw Enhanced Optimized

16/25

Page 17: Production Grade Data Science for Hadoop

Model as a function

Target = f(Active1, Active2, .., Activen)

Outputfeature = ffeature(Target)

17/25

Page 18: Production Grade Data Science for Hadoop

Model metadata API

Evaluator evaluator = getEvaluator();

List<FieldName> activeFields = evaluator.getActiveFields();for(FieldName activeField : activeFields){ DataField dataField = evaluator.getDataField(activeField); // Inspect data type, operational type, value space etc.}

18/25

Page 19: Production Grade Data Science for Hadoop

Model evaluation API

Evaluator evaluator = getEvaluator();

while(!done){ Map<FieldName, ?> arguments = readInRecord(); Map<FieldName, ?> results = evaluator.evaluate(arguments); writeOutRecord(results);}

19/25

Page 20: Production Grade Data Science for Hadoop

http://github.com/jpmml/jpmml-${framework}

Volume

Velo

city

{ REST }

20/25

Page 21: Production Grade Data Science for Hadoop

JPMML-Cascading

Evaluator evaluator = getEvaluator();

PMMLPlanner pmmlPlanner = new PMMLPlanner(evaluator);pmmlPlanner.setHeadName("input");pmmlPlanner.setTailName("output");

FlowDef flowDef = ...;flowDef.addAssemblyPlanner(pmmlPlanner);

21/25

Page 22: Production Grade Data Science for Hadoop

JPMML-Pig

grunt> REGISTER jpmml-pig-distributable-1.0.jar;

grunt> DEFINE my_udf org.jpmml.pig.PMMLFunc('model.pmml');

grunt> output = FOREACH input GENERATE my_udf(*);

22/25

Page 23: Production Grade Data Science for Hadoop

JPMML-Spark

Evaluator evaluator = getEvaluator();

PMMLFunction pmmlFunction = new PMMLFunction(evaluator);

JavaRDD<Row> input = ...;JavaRDD<Row> output = input.map(pmmlFunction);

23/25

Page 24: Production Grade Data Science for Hadoop

Thoughts on API-driven architecture

● Based on a relevant standard○ "Conventions over configuration"

● High(est) abstraction level○ Productivity○ Maintainability

● End-to-end value proposition○ Separation of concerns

24/25

Page 25: Production Grade Data Science for Hadoop

Q&A

[email protected]

http://openscoring.iohttp://github.com/jpmml

25/25