pipelineai + tensorflow ai + spark ml + kuberenetes + istio + aws sagemaker + google cloud ml +...

HIGH PERFORMANCE MODEL SERVING WITH KUBERNETES AND ISTIO…

…AND AWS SAGEMAKER, GOOGLE CLOUD ML, AZURE ML!

CHRIS FREGLYFOUNDER @ PIPELINE.AI

RECENT PIPELINE.AI NEWS

Sept 2017

Dec 2017

INTRODUCTIONS: ME§ Chris Fregly, Founder & Engineer @PipelineAI

§ Formerly Netflix, Databricks, IBM Spark Tech

§ Advanced Spark and TensorFlow Meetup§ Please Join Our 60,000+ Global Members!!

Contact [email protected]

@cfregly

Global Locations* San Francisco* Chicago* Austin* Washington DC* Dusseldorf* London

INTRODUCTIONS: YOU§ Software Engineer, DevOps Engineer, Data {Scientist, Engineer}

§ Interested in Optimizing and Deploying TF Models to Production

§ Nice to Have a Working Knowledge of TensorFlow (Not Required)

PIPELINE.AI IS 100% OPEN SOURCE

§ https://github.com/PipelineAI/pipeline/

§ Please Star 🌟 this GitHub Repo!

§ Some VC’s Value GitHub Stars @ $1,500 Each (?!)

PIPELINE.AI OVERVIEW450,000 Docker Downloads60,000 Users Registered for GA60,000 Meetup Members40,000 LinkedIn Followers2,200 GitHub Stars12 Enterprise Beta Users

WHY HEAVY FOCUS ON MODEL SERVING?Model Training

Batch & BoringOffline in Research Lab

Pipeline Ends at Training

No Insight into Live Production

Small Number of Data Scientists

Optimizations Very Well-Known

Real-Time & Exciting!!Online in Live Production

Pipeline Extends into Production

Continuous Insight into Live Production

Huuuuuuge Number of Application Users

**Many Optimizations Not Yet Utilized

<<<

Model Serving

100’s Training Jobs per Day 1,000,000’s Predictions per Sec

AGENDA

Part 0: Latest PipelineAI Research

Part 1: PipelineAI + Kubernetes + Istio

AGENDA


§ Deploy, Tune Models + Runtimes Safely in Prod§ Compare Models Both Offline and Online§ Auto-Shift Traffic to Winning Model or Cloud§ Live, Continuous Model Training in Production

PACKAGE MODEL + RUNTIME AS ONE§ Build Model with Runtime into Immutable Docker Image

§ Emphasize Immutable Deployment and Infrastructure§ Same Runtime Dependencies in All Environments

§ Local, Development, Staging, Production§ No Library or Dependency Surprises

§ Deploy and Tune Model + Runtime Togetherpipeline predict-server-build --model-type=tensorflow \

--model-name=mnist \--model-tag=A \--model-path=./models/tensorflow/mnist/

Build LocalModel Server A

LOAD TEST LOCAL MODEL + RUNTIME§ Perform Mini-Load Test on Local Model Server

§ Immediate, Local Prediction Performance Metrics

§ Compare to Previous Model + Runtime Variationspipeline predict-server-start --model-type=tensorflow \

--model-name=mnist \--model-tag=A

pipeline predict --model-endpoint-url=http://localhost:8080 \--test-request-path=test_request.json \--test-request-concurrency=1000

Load Test LocalModel Server A

Start LocalModel Server A

PUSH IMAGE TO DOCKER REGISTRY§ Supports All Public + Private Docker Registries

§ DockerHub, Artifactory, Quay, AWS, Google, …

§ Or Self-Hosted, Private Docker Registry

pipeline predict-server-push --image-registry-url=<your-registry> \--image-registry-repo=<your-repo> \--model-type=tensorflow \--model-name=mnist \--model-tag=A

Push Image ToDocker Registry

CLOUD-BASED OPTIONS§ AWS SageMaker

§ Released Nov 2017 @ Re-invent§ Custom Docker Images for Training/Serving (ie. PipelineAI Images)§ Distributed TensorFlow Training through Estimator API§ Traffic Splitting for A/B Model Testing

§ Google Cloud ML Engine§ Mostly Command-Line Based§ Driving TensorFlow Open Source API (ie. Experiment API)

§ Azure ML

TUNE MODEL + RUNTIME AS SINGLE UNIT§ Model Training Optimizations

§ Model Hyper-Parameters (ie. Learning Rate)§ Reduced Precision (ie. FP16 Half Precision)

§ Post-Training Model Optimizations§ Quantize Model Weights + Activations From 32-bit to 8-bit§ Fuse Neural Network Layers Together

§ Model Runtime Optimizations§ Runtime Configs (ie. Request Batch Size)§ Different Runtimes (ie. TensorFlow Lite, Nvidia TensorRT)

POST-TRAINING OPTIMIZATIONS§ Prepare Model for Serving

§ Simplify Network§ Reduce Model Size§ Lower Precision for Fast Math

§ Some Tools§ Graph Transform Tool (GTT)§ tfcompile

After TrainingAfter

Optimizing!

pipeline optimize --optimization-list=[quantize_weights, tfcompile] \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--model-path=./tensorflow/mnist/model \--output-path=./tensorflow/mnist/optimized_model

Linear Regression

RUNTIME OPTION: TENSORFLOW LITE§ Post-Training Model Optimizations

§ Currently Supports iOS and Android

§ On-Device Prediction Runtime§ Low-Latency, Fast Startup

§ Selective Operator Loading§ 70KB Min - 300KB Max Runtime Footprint

§ Supports Accelerators (GPU, TPU)§ Falls Back to CPU without Accelerator

§ Java and C++ APIs

RUNTIME OPTION: NVIDIA TENSOR-RT

§ Post-Training Model Optimizations§ Specific to Nvidia GPU

§ GPU-Optimized Prediction Runtime§ Alternative to TensorFlow Serving

§ PipelineAI Supports TensorRT!

DEPLOY MODELS SAFELY TO PROD§ Deploy from CLI or Jupyter Notebook§ Tear-Down or Rollback Models Quickly§ Shadow Canary Deploy: ie.20% Live Traffic§ Split Canary Deploy: ie. 97-2-1% Live Traffic

pipeline predict-cluster-start --model-runtime=tflite \--model-type=tensorflow \--model-name=mnist \--model-tag=B \--traffic-split=2

Start ProductionModel Cluster B

pipeline predict-cluster-start --model-runtime=tensorrt \--model-type=tensorflow \--model-name=mnist \--model-tag=C \--traffic-split=1

Start ProductionModel Cluster C

pipeline predict-cluster-start --model-runtime=tfserving_gpu \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--traffic-split=97

Start ProductionModel Cluster A

AGENDA



COMPARE MODELS OFFLINE & ONLINE§ Offline, Batch Metrics

§ Validation + Training Accuracy§ CPU + GPU Utilization

§ Live Prediction Values§ Compare Relative Precision§ Newly-Seen, Streaming Data

§ Online, Real-Time Metrics§ Response Time, Throughput§ Cost ($) Per Prediction

VIEW REAL-TIME PREDICTION STREAM§ Visually Compare Real-Time Predictions

PredictionInputs

PredictionResults &

Confidences

Model B Model CModel A

PREDICTION PROFILING AND TUNING§ Pinpoint Performance Bottlenecks

§ Fine-Grained Prediction Metrics

§ 3 Steps in Real-Time Prediction1.transform_request()2.predict()3.transform_response()

AGENDA



LIVE, ADAPTIVE TRAFFIC ROUTING§ A/B Tests

§ Inflexible and Boring

§ Multi-Armed Bandits§ Adaptive and Exciting!

pipeline traffic-router-split --model-type=tensorflow \--model-name=mnist \--model-tag-list=[A,B,C] \--model-weight-list=[1,2,97]

AdjustTraffic Routing

Dynamically

SHIFT TRAFFIC TO MAX(REVENUE)§ Shift Traffic to Winning Model using AI Bandit Algos

SHIFT TRAFFIC TO MIN(CLOUD CO$T)

§ Based on Cost ($) Per Prediction

§ Cost Changes Throughout Day§ Lose AWS Spot Instances§ Google Cloud Becomes Cheaper

§ Shift Across Clouds & On-Prem

AGENDA



LIVE, CONTINUOUS MODEL TRAINING

§ The Holy Grail of Machine Learning§ Q1 2018: PipelineAI Supports Continuous Model Training!§ Kafka, Kinesis§ Spark Streaming

PSEUDO-CONTINUOUS TRAINING§ Identify and Fix Borderline Predictions (~50-50% Confidence)

§ Fix Along Class Boundaries

§ Retrain Newly-Labeled Data

§ Game-ify Labeling Process

§ Enable Crowd Sourcing

DEMOS!!



AGENDA



SPECIAL THANKS TO CHRISTIAN POSTA

§ http://blog.christianposta.com/istio-workshop/slides/

KUBERNETES INGRESS§ Single Service

§ Can also use Service (LoadBalancer or NodePort)§ Fan Out & Name-Based Virtual Hosting

§ Route Traffic Using Path or Host Header§ Reduces # of load balancers needed§ 404 Implemented as default backend

§ Federation / Hybrid-Cloud§ Creates Ingress objects in every cluster§ Monitors health and capacity of pods within each cluster§ Routes clients to appropriate backend anywhere in federation

apiVersion: extensions/v1beta1kind: Ingressmetadata: name: gateway-fanoutannotations:kubernetes.io/ingress.class: istio

spec: rules: - host: foo.bar.comhttp: paths: - path: /foo backend: serviceName: s1 servicePort: 80

- path: /bar backend: serviceName: s2 servicePort: 80

Fan Out (Path)

apiVersion: extensions/v1beta1kind: Ingressmetadata: name: gateway-virtualhostannotations:kubernetes.io/ingress.class: istio

spec: rules: - host: foo.bar.comhttp: paths: backend: serviceName: s1servicePort: 80

- host: bar.foo.comhttp: paths: backend: serviceName: s2 servicePort: 80

Virtual Hosting

KUBERNETES INGRESS CONTROLLER§ Ingress Controller Types

§ Google Cloud: kubernetes.io/ingress.class: gce§ Nginx: kubernetes.io/ingress.class: nginx§ Istio: kubernetes.io/ingress.class: istio

§ Must Start Ingress Controller Manually§ Just deploying Ingress is not enough§ Not started by kube-controller-manager§ Start Istio Ingress Controller kubectl apply -f \

$ISTIO_INSTALL_PATH/install/kubernetes/istio.yaml

ISTIO ARCHITECTURE: ENVOY§ Lyft Project§ High-perf Proxy (C++)§ Lots of Metrics§ Zone-Aware§ Service Discovery§ Load Balancing§ Fault Injection, Circuits§ %-based Traffic Split, Shadow§ Sidecar Pattern§ Rate Limiting, Retries, Outlier Detection, Timeout with Budget, …

ISTIO ARCHITECTURE: MIXER§ Enforce Access Control§ Evaluate Request-Attrs§ Collect Metrics § Platform-Independent§ Extensible Plugin Model

ISTIO ARCHITECTURE: PILOT§ Envoy service discovery§ Intelligent routing§ A/B Tests§ Canary deployments§ RouteRule->Envoy conf§ Propagates to sidecars§ Supports Kube, Consul, ...

ISTIO ARCHITECTURE: ISTIO-AUTH§ Mutual TLS Auth§ Credential management§ Uses Service-identity§ Canary deployments§ Fine-grained ACLs§ Attribute & role-based§ Auditing & monitoring

ISTIO ROUTE RULES

§ Kubernetes Custom Resource Definition (CRD)kind: CustomResourceDefinitionmetadata: name: routerules.config.istio.io

spec: group: config.istio.ionames: kind: RouteRulelistKind: RouteRuleListplural: routerulessingular: routerule

scope: Namespacedversion: v1alpha2

A/B & BANDIT MODEL TESTING§ Live Experiments in Production

§ Compare Existing Model A with Model B, Model C§ Safe Split-Canary Deployment § Tip: Keep Ingress Simple – Use Route Rules Instead!

apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:

name: live-experiment-20-5-75spec:

destination: name: predict-mnist

precedence: 2 # Greater than global deny-allroute: - labels:

version: Aweight: 20 # 20% still routes to model A

- labels: version: B # 5% routes to new model Bweight: 5

- labels: version: C # 75% routes to new model Cweight: 75





version: Aweight: 1 # 1% routes to model A







version: Aweight: 97 # 97% still routes to model A



ISTIO AUTO-SCALING

§ Traffic Routing and Auto-Scaling Occur Independently

§ Istio Continues to Obey Traffic Splits After Auto-Scaling

§ Auto-Scaling May Occur In Response to New Traffic Route

ADVANCED ROUTING RULES

§ Content-based Routing§ Uses headers, username, payload, …

§ Cross-Environment Routing§ Shadow traffic prod => staging

ISTIO DESTINATION POLICIES§ Load Balancing

§ ROUND_ROBIN (default)§ LEAST_CONN (between 2 randomly-selected hosts)§ RANDOM

§ Circuit Breaker§ Max connections§ Max requests per conn§ Consecutive errors§ Penalty timer (15 mins)§ Scan windows (5 mins)

circuitBreaker: simpleCb: maxConnections: 100 httpMaxRequests: 1000 httpMaxRequestsPerConnection: 10 httpConsecutiveErrors: 7 sleepWindow: 15m httpDetectionInterval: 5m

ISTIO EGRESS§ Whilelisted Domains Accessible Within Service Mesh§ Apply RoutingRules and DestinationPolicys§ Supports TLS, HTTP, GRPC kind: EgressRule

metadata: name: foo-egress-rule

spec: destination: service: api.pipeline.aiports: - port: 80 protocol: http

- port: 443 protocol: https

ISTIO & CHAOS + LATENCY MONKIES§ Fault Injection

§ Delay§ Abort

kind: RouteRulemetadata: name: predict-mnist

spec: destination: name: predict-mnist

httpFault: abort: httpStatus: 420 percent: 100

kind: RouteRulemetadata: name: predict-mnist

spec: destination: name: predict-mnist

httpFault: delay: fixedDelay: 7.000s percent: 100

ISTIO METRICS AND MONITORING§ Verify Traffic Splits§ Fine-Grained Request Tracing

ISTIO SECURITY

§ Istio Certificate Authority

§ Mutual TLS

AGENDA



THANK YOU!!



§ Reminder: VC’s Value GitHub Stars @ $1,500 Each (!!)

Contact [email protected]

@cfregly