automating machine learning workflows: a report from the trenches - jose a. ortega ruiz @ papis...

Post on 08-Jan-2017

410 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automating Machine LearningFeatures and Workflows

jao@bigml.com

PAPIs Connect Valencia, 2016

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Machine Learning as a System Service

The goal

Machine Learning as a systemlevel service

The means

I APIs: ML building blocks

I Abstraction layer overfeature engineering

I Abstraction layer overalgorithms

I Automation

Machine Learning Workflows

Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)

Machine Learning Workflows for real

Jeannine Takaki, Microsoft Azure Team

Machine Learning Automation Todayfrom bigml.api import BigML

api = BigML()

project = api.create_project({’name’: ’ToyBoost’})

orig_source =

api.create_source(source,

{"name": "ToyBoost",

"project": project[’resource’]})

api.ok(orig_source)

orig_dataset =

api.create_dataset(orig_source, {"name": "Boost"})

api.ok(orig_dataset)

trainset = api.get_dataset(trainset)

for loop in range(0,10):

api.ok(trainset)

model = api.create_model(trainset, {

"name": "ToyBoost - Model%d" % loop,

"objective_fields": ["letter"],

"excluded_fields": ["weight"],

"weight_field": "100011"})

api.ok(model)

batchp =

api.create_batch_prediction(model, trainset, {

"name": "ToyBoost - Result%d" % loop,

"all_fields": True,

"header": True})

api.ok(batchp)

batchp = api.get_batch_prediction(batchp)

batchp_dataset =

api.get_dataset(batchp[’object’])

trainset = api.create_dataset(batchp_dataset, {})

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Machine Learning Automation Tomorrow

Solution: Domain-specific languages

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Domain-specific Expressions (JSON)

["if", ["missing?", "height"],

["random-value", "height"],

["field", "height"]]

["window", "income", 10]

["within-percentiles?", "age", 0.5, 0.95]

["cond", [">", ["field", "score"], ["mean", "score"]], "above average",

["=", ["field", "score"], ["mean", "score"]], "below average",

"mediocre"]

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Abstraction via the Language

;; (if (missing? "height")

;; (random-value "height")

;; (field "height"))

(ensure-value "height")

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

;; (cond (> (field "score") (mean "score")) "above average"

;; (= (field "score") (mean "score")) "below average"

;; "mediocre")

(discretize "score" "above above" "below average" "mediocre")

Abstraction via the User Interface

Remote for efficiency and reuse, local for discoverability

Flatline: A DSL for Feature Enginering

I Domain-specific: new fields from an input sliding window asdeclarative expressions

I Simple syntax: JSON → s-expressions

I Efficient: full server-side implementation

I Discoverable: in-browser client-side implementation

I Reusable: the same expressions usable from any languagebinding.

I Bonus: applicable to filtering

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Machine Learning Workflows

A DSL for Machine LearningWorkflows?

Machine Learning Workflows

A DSL for Machine LearningWorkflows? Absolutely!

Machine Learning Workflows

Same problems, only worse. . .

Complexity Hairy logic and control-flow

Reuse More complex algorithms and behaviour very hard toport to other languages

Scalability Lots of iterations and intermediate resources veryhard to make efficient on the client side

Machine Learning Workflows

WhizzML, same solution, only better. . .

WhizzML: A sexp-based, domain-specific language

(define apple

"https://s3.amazonaws.com/bigml-public/csv/nasdaq_aapl.csv")

(define source (create-and-wait-source {"remote" apple

"name" "whizz"}))

(define dataset (create-and-wait-dataset {"source" source}))

(define anomaly (create-and-wait-anomaly {"dataset" dataset}))

(define input {"Open" 275 "High" 300 "Low" 250})

(define score

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input}))

(get (fetch score) "score")

WhizzML vs Flatline (as languages)

A better language:

I Better data structures (dictionaries, sets. . . )

I Better control-flow: (tail) recursion, iteration, loops

I Better abstraction: procedures

WhizzML: Lambda Abstraction

Abstraction

(define (score-stock name input)

(let (base "https://s3.amazonaws.com/bigml-public/csv"

stock (str base "/" name)

source (create-and-wait-source {"remote" stock})

dataset (create-and-wait-dataset {"source" source})

anomaly (create-and-wait-anomaly {"dataset" dataset}))

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input})))

WhizzML: Reusable Procedures

Abstraction

(score-stock "aapl" {"Open" 275 "High" 300 "Low" 250})

WhizzML: Server-side fortes

A better server-side:

I Better reusability: scripts, executions and libraries asfirst-class ML resources

I Higher efficiency gains: automatic parallelism

I More opportunities for UI extensions

WhizzML Source Code as a Machine Learning Resource

{"library":{

"imports":["12343addb343f2890f23492d"],

"source_code": "(define (mu2) (mu (g 3 8)))",

"exports": [{"name": "mu2", "signature": []}]}}

{"script":{

"parameters": [{"name": "remote_uri", "type": "string"},

{"name": "timeout", "type": "number",

"default": 10000}],

"source_code":

"(define id (create-source {\"remote\" remote_uri}))

(wait id timeout)",

"outputs": [{"name": "id", "type": "source-id"}]}}

Rich metadata, reuse and shareability of WhizzML code

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde",

"username": "bittwidler",

"tlp": 4,

"resource_limits": {"total": 50,

"source": 10,

"dataset": 5,

"model": 10},

"max_exection_time": 3600,

"max_execution_steps": 10000,

"max_recursion_depth": 1024}}

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde","username": "bittwidler","tlp": 4,"resource_limits": {"total": 50,

"source": 10,"dataset": 5,"model": 10},

"max_exection_time": 3600,"max_execution_steps": 10000,"max_recursion_depth": 1024}}

WhizzML: Client-side fortes

A better client-side:

I Better interactive experience: read-eval-print loop

I Scripts usable from the user’s machine

I Interoperability: Java, JavaScript and NodeJS REPLs

I Challenge: behaviourial coherence between server and clientsides

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Challenges

Solved

I Local REPL and remote shared implementation

I Automatic parallelization

I Error reporting

I Traceability: stack traces and stepwise execution

Open

I Better error management (dynamic typing, type inferencer)

I Resumable workflows

I Data locality: optimizing repeated access to the same datasets

top related