production grade data science for hadoop

Post on 16-Apr-2017

296 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Production GradeData Science for Hadoop

Villu RuusmannOpenscoring OÜ

About openscoring.io

"Standards-based, Open-source Middleware for Predictive Analytics Applications"

aka

"Rapid deployment of R and Scikit-Learn models on JVM"

2/25

"From Laboratory to Factory"

3/25

From grams (to kilograms) to megagrams

Scaling the vessel / HardwareScaling the chemical reaction / Software and processes

4/25

Scalability through re-engineering

5/25

Broader objectives

● Platform○ Portability of applications

● Application○ Central governance and dissemination of models

● Model○ "Decisioning as a Service"

● Decision○ Traceability, reproducibility, explainability

6/25

R and Scikit-Learn

PMMLPFA

Java and C

Domain-Specific Languages

General-Purpose Languages

7/25

The PMML connection

8/25

X model producersY model consumers(ten years into past & future)

1 model API

9/25

Model API pipeline

Conversion Deployment

Ephermeal:

Persistent, asset-like:

Conversion Maintenance Deployment

10/25

Conversion into PMML

Capturing and expressing the essentials of the modeling workflow using PMML vocabulary:

Input → Feature vector → Response vector → Output

Connecting stable data schemas

11/25

12/25

13/25

Standardized representation

Rada, cforest, gbm, randomForest, xgb.Booster

Scikit-LearnAdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier

<MiningModel function="classification"> <Segmentation multipleModelMethod="weightedAverage" > <Segment id="1" weight="1"> <True/> <TreeModel> <Node> ... </Node> </TreeModel> </Segment> ... </Segmentation></MiningModel>

14/25

Supra-standardized representation

Model model = MiningModelUtil.createClassifierEnsemble( MultipleModelMethodType.WEIGHTED_AVERAGE, Arrays.asList( PMMLUtil.loadModel("xgboost.pmml"), PMMLUtil.loadModel("keras-mlp.pmml"), PMMLUtil.loadModel("sklearn-rf.pmml") ), Arrays.asList(5d/9d, 2d/9d, 2d/9d));PMMLUtil.storeModel(model, "kaggle-submission.pmml");

15/25

State machine for model maintenance

EnhancedEnhanced

EnhancedStandardizedRaw Enhanced Optimized

16/25

Model as a function

Target = f(Active1, Active2, .., Activen)

Outputfeature = ffeature(Target)

17/25

Model metadata API

Evaluator evaluator = getEvaluator();

List<FieldName> activeFields = evaluator.getActiveFields();for(FieldName activeField : activeFields){ DataField dataField = evaluator.getDataField(activeField); // Inspect data type, operational type, value space etc.}

18/25

Model evaluation API

Evaluator evaluator = getEvaluator();

while(!done){ Map<FieldName, ?> arguments = readInRecord(); Map<FieldName, ?> results = evaluator.evaluate(arguments); writeOutRecord(results);}

19/25

http://github.com/jpmml/jpmml-${framework}

Volume

Velo

city

{ REST }

20/25

JPMML-Cascading

Evaluator evaluator = getEvaluator();

PMMLPlanner pmmlPlanner = new PMMLPlanner(evaluator);pmmlPlanner.setHeadName("input");pmmlPlanner.setTailName("output");

FlowDef flowDef = ...;flowDef.addAssemblyPlanner(pmmlPlanner);

21/25

JPMML-Pig

grunt> REGISTER jpmml-pig-distributable-1.0.jar;

grunt> DEFINE my_udf org.jpmml.pig.PMMLFunc('model.pmml');

grunt> output = FOREACH input GENERATE my_udf(*);

22/25

JPMML-Spark

Evaluator evaluator = getEvaluator();

PMMLFunction pmmlFunction = new PMMLFunction(evaluator);

JavaRDD<Row> input = ...;JavaRDD<Row> output = input.map(pmmlFunction);

23/25

Thoughts on API-driven architecture

● Based on a relevant standard○ "Conventions over configuration"

● High(est) abstraction level○ Productivity○ Maintainability

● End-to-end value proposition○ Separation of concerns

24/25

Q&A

villu@openscoring.io

http://openscoring.iohttp://github.com/jpmml

25/25

top related