pipelineai + tensorflow ai + spark ml + kuberenetes + istio + aws sagemaker + google cloud ml +...
TRANSCRIPT
HIGH PERFORMANCE MODEL SERVING WITH KUBERNETES AND ISTIO…
…AND AWS SAGEMAKER, GOOGLE CLOUD ML, AZURE ML!
CHRIS FREGLYFOUNDER @ PIPELINE.AI
RECENT PIPELINE.AI NEWS
Sept 2017
Dec 2017
INTRODUCTIONS: ME§ Chris Fregly, Founder & Engineer @PipelineAI
§ Formerly Netflix, Databricks, IBM Spark Tech
§ Advanced Spark and TensorFlow Meetup§ Please Join Our 60,000+ Global Members!!
Contact [email protected]
@cfregly
Global Locations* San Francisco* Chicago* Austin* Washington DC* Dusseldorf* London
INTRODUCTIONS: YOU§ Software Engineer, DevOps Engineer, Data {Scientist, Engineer}
§ Interested in Optimizing and Deploying TF Models to Production
§ Nice to Have a Working Knowledge of TensorFlow (Not Required)
PIPELINE.AI IS 100% OPEN SOURCE
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
§ Some VC’s Value GitHub Stars @ $1,500 Each (?!)
PIPELINE.AI OVERVIEW450,000 Docker Downloads60,000 Users Registered for GA60,000 Meetup Members40,000 LinkedIn Followers2,200 GitHub Stars12 Enterprise Beta Users
WHY HEAVY FOCUS ON MODEL SERVING?Model Training
Batch & BoringOffline in Research Lab
Pipeline Ends at Training
No Insight into Live Production
Small Number of Data Scientists
Optimizations Very Well-Known
Real-Time & Exciting!!Online in Live Production
Pipeline Extends into Production
Continuous Insight into Live Production
Huuuuuuge Number of Application Users
**Many Optimizations Not Yet Utilized
<<<
Model Serving
100’s Training Jobs per Day 1,000,000’s Predictions per Sec
AGENDA
Part 0: Latest PipelineAI Research
Part 1: PipelineAI + Kubernetes + Istio
AGENDA
Part 0: Latest PipelineAI Research
§ Deploy, Tune Models + Runtimes Safely in Prod§ Compare Models Both Offline and Online§ Auto-Shift Traffic to Winning Model or Cloud§ Live, Continuous Model Training in Production
PACKAGE MODEL + RUNTIME AS ONE§ Build Model with Runtime into Immutable Docker Image
§ Emphasize Immutable Deployment and Infrastructure§ Same Runtime Dependencies in All Environments
§ Local, Development, Staging, Production§ No Library or Dependency Surprises
§ Deploy and Tune Model + Runtime Togetherpipeline predict-server-build --model-type=tensorflow \
--model-name=mnist \--model-tag=A \--model-path=./models/tensorflow/mnist/
Build LocalModel Server A
LOAD TEST LOCAL MODEL + RUNTIME§ Perform Mini-Load Test on Local Model Server
§ Immediate, Local Prediction Performance Metrics
§ Compare to Previous Model + Runtime Variationspipeline predict-server-start --model-type=tensorflow \
--model-name=mnist \--model-tag=A
pipeline predict --model-endpoint-url=http://localhost:8080 \--test-request-path=test_request.json \--test-request-concurrency=1000
Load Test LocalModel Server A
Start LocalModel Server A
PUSH IMAGE TO DOCKER REGISTRY§ Supports All Public + Private Docker Registries
§ DockerHub, Artifactory, Quay, AWS, Google, …
§ Or Self-Hosted, Private Docker Registry
pipeline predict-server-push --image-registry-url=<your-registry> \--image-registry-repo=<your-repo> \--model-type=tensorflow \--model-name=mnist \--model-tag=A
Push Image ToDocker Registry
CLOUD-BASED OPTIONS§ AWS SageMaker
§ Released Nov 2017 @ Re-invent§ Custom Docker Images for Training/Serving (ie. PipelineAI Images)§ Distributed TensorFlow Training through Estimator API§ Traffic Splitting for A/B Model Testing
§ Google Cloud ML Engine§ Mostly Command-Line Based§ Driving TensorFlow Open Source API (ie. Experiment API)
§ Azure ML
TUNE MODEL + RUNTIME AS SINGLE UNIT§ Model Training Optimizations
§ Model Hyper-Parameters (ie. Learning Rate)§ Reduced Precision (ie. FP16 Half Precision)
§ Post-Training Model Optimizations§ Quantize Model Weights + Activations From 32-bit to 8-bit§ Fuse Neural Network Layers Together
§ Model Runtime Optimizations§ Runtime Configs (ie. Request Batch Size)§ Different Runtimes (ie. TensorFlow Lite, Nvidia TensorRT)
POST-TRAINING OPTIMIZATIONS§ Prepare Model for Serving
§ Simplify Network§ Reduce Model Size§ Lower Precision for Fast Math
§ Some Tools§ Graph Transform Tool (GTT)§ tfcompile
After TrainingAfter
Optimizing!
pipeline optimize --optimization-list=[quantize_weights, tfcompile] \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--model-path=./tensorflow/mnist/model \--output-path=./tensorflow/mnist/optimized_model
Linear Regression
RUNTIME OPTION: TENSORFLOW LITE§ Post-Training Model Optimizations
§ Currently Supports iOS and Android
§ On-Device Prediction Runtime§ Low-Latency, Fast Startup
§ Selective Operator Loading§ 70KB Min - 300KB Max Runtime Footprint
§ Supports Accelerators (GPU, TPU)§ Falls Back to CPU without Accelerator
§ Java and C++ APIs
RUNTIME OPTION: NVIDIA TENSOR-RT
§ Post-Training Model Optimizations§ Specific to Nvidia GPU
§ GPU-Optimized Prediction Runtime§ Alternative to TensorFlow Serving
§ PipelineAI Supports TensorRT!
DEPLOY MODELS SAFELY TO PROD§ Deploy from CLI or Jupyter Notebook§ Tear-Down or Rollback Models Quickly§ Shadow Canary Deploy: ie.20% Live Traffic§ Split Canary Deploy: ie. 97-2-1% Live Traffic
pipeline predict-cluster-start --model-runtime=tflite \--model-type=tensorflow \--model-name=mnist \--model-tag=B \--traffic-split=2
Start ProductionModel Cluster B
pipeline predict-cluster-start --model-runtime=tensorrt \--model-type=tensorflow \--model-name=mnist \--model-tag=C \--traffic-split=1
Start ProductionModel Cluster C
pipeline predict-cluster-start --model-runtime=tfserving_gpu \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--traffic-split=97
Start ProductionModel Cluster A
AGENDA
Part 0: Latest PipelineAI Research
§ Deploy, Tune Models + Runtimes Safely in Prod§ Compare Models Both Offline and Online§ Auto-Shift Traffic to Winning Model or Cloud§ Live, Continuous Model Training in Production
COMPARE MODELS OFFLINE & ONLINE§ Offline, Batch Metrics
§ Validation + Training Accuracy§ CPU + GPU Utilization
§ Live Prediction Values§ Compare Relative Precision§ Newly-Seen, Streaming Data
§ Online, Real-Time Metrics§ Response Time, Throughput§ Cost ($) Per Prediction
VIEW REAL-TIME PREDICTION STREAM§ Visually Compare Real-Time Predictions
PredictionInputs
PredictionResults &
Confidences
Model B Model CModel A
PREDICTION PROFILING AND TUNING§ Pinpoint Performance Bottlenecks
§ Fine-Grained Prediction Metrics
§ 3 Steps in Real-Time Prediction1.transform_request()2.predict()3.transform_response()
AGENDA
Part 0: Latest PipelineAI Research
§ Deploy, Tune Models + Runtimes Safely in Prod§ Compare Models Both Offline and Online§ Auto-Shift Traffic to Winning Model or Cloud§ Live, Continuous Model Training in Production
LIVE, ADAPTIVE TRAFFIC ROUTING§ A/B Tests
§ Inflexible and Boring
§ Multi-Armed Bandits§ Adaptive and Exciting!
pipeline traffic-router-split --model-type=tensorflow \--model-name=mnist \--model-tag-list=[A,B,C] \--model-weight-list=[1,2,97]
AdjustTraffic Routing
Dynamically
SHIFT TRAFFIC TO MAX(REVENUE)§ Shift Traffic to Winning Model using AI Bandit Algos
SHIFT TRAFFIC TO MIN(CLOUD CO$T)
§ Based on Cost ($) Per Prediction
§ Cost Changes Throughout Day§ Lose AWS Spot Instances§ Google Cloud Becomes Cheaper
§ Shift Across Clouds & On-Prem
AGENDA
Part 0: Latest PipelineAI Research
§ Deploy, Tune Models + Runtimes Safely in Prod§ Compare Models Both Offline and Online§ Auto-Shift Traffic to Winning Model or Cloud§ Live, Continuous Model Training in Production
LIVE, CONTINUOUS MODEL TRAINING
§ The Holy Grail of Machine Learning§ Q1 2018: PipelineAI Supports Continuous Model Training!§ Kafka, Kinesis§ Spark Streaming
PSEUDO-CONTINUOUS TRAINING§ Identify and Fix Borderline Predictions (~50-50% Confidence)
§ Fix Along Class Boundaries
§ Retrain Newly-Labeled Data
§ Game-ify Labeling Process
§ Enable Crowd Sourcing
DEMOS!!
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
AGENDA
Part 0: Latest PipelineAI Research
Part 1: PipelineAI + Kubernetes + Istio
SPECIAL THANKS TO CHRISTIAN POSTA
§ http://blog.christianposta.com/istio-workshop/slides/
KUBERNETES INGRESS§ Single Service
§ Can also use Service (LoadBalancer or NodePort)§ Fan Out & Name-Based Virtual Hosting
§ Route Traffic Using Path or Host Header§ Reduces # of load balancers needed§ 404 Implemented as default backend
§ Federation / Hybrid-Cloud§ Creates Ingress objects in every cluster§ Monitors health and capacity of pods within each cluster§ Routes clients to appropriate backend anywhere in federation
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: gateway-fanoutannotations:kubernetes.io/ingress.class: istio
spec: rules: - host: foo.bar.comhttp: paths: - path: /foo backend: serviceName: s1 servicePort: 80
- path: /bar backend: serviceName: s2 servicePort: 80
Fan Out (Path)
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: gateway-virtualhostannotations:kubernetes.io/ingress.class: istio
spec: rules: - host: foo.bar.comhttp: paths: backend: serviceName: s1servicePort: 80
- host: bar.foo.comhttp: paths: backend: serviceName: s2 servicePort: 80
Virtual Hosting
KUBERNETES INGRESS CONTROLLER§ Ingress Controller Types
§ Google Cloud: kubernetes.io/ingress.class: gce§ Nginx: kubernetes.io/ingress.class: nginx§ Istio: kubernetes.io/ingress.class: istio
§ Must Start Ingress Controller Manually§ Just deploying Ingress is not enough§ Not started by kube-controller-manager§ Start Istio Ingress Controller kubectl apply -f \
$ISTIO_INSTALL_PATH/install/kubernetes/istio.yaml
ISTIO ARCHITECTURE: ENVOY§ Lyft Project§ High-perf Proxy (C++)§ Lots of Metrics§ Zone-Aware§ Service Discovery§ Load Balancing§ Fault Injection, Circuits§ %-based Traffic Split, Shadow§ Sidecar Pattern§ Rate Limiting, Retries, Outlier Detection, Timeout with Budget, …
ISTIO ARCHITECTURE: MIXER§ Enforce Access Control§ Evaluate Request-Attrs§ Collect Metrics § Platform-Independent§ Extensible Plugin Model
ISTIO ARCHITECTURE: PILOT§ Envoy service discovery§ Intelligent routing§ A/B Tests§ Canary deployments§ RouteRule->Envoy conf§ Propagates to sidecars§ Supports Kube, Consul, ...
ISTIO ARCHITECTURE: ISTIO-AUTH§ Mutual TLS Auth§ Credential management§ Uses Service-identity§ Canary deployments§ Fine-grained ACLs§ Attribute & role-based§ Auditing & monitoring
ISTIO ROUTE RULES
§ Kubernetes Custom Resource Definition (CRD)kind: CustomResourceDefinitionmetadata: name: routerules.config.istio.io
spec: group: config.istio.ionames: kind: RouteRulelistKind: RouteRuleListplural: routerulessingular: routerule
scope: Namespacedversion: v1alpha2
A/B & BANDIT MODEL TESTING§ Live Experiments in Production
§ Compare Existing Model A with Model B, Model C§ Safe Split-Canary Deployment § Tip: Keep Ingress Simple – Use Route Rules Instead!
apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:
name: live-experiment-20-5-75spec:
destination: name: predict-mnist
precedence: 2 # Greater than global deny-allroute: - labels:
version: Aweight: 20 # 20% still routes to model A
- labels: version: B # 5% routes to new model Bweight: 5
- labels: version: C # 75% routes to new model Cweight: 75
apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:
name: live-experiment-1-2-97spec:
destination: name: predict-mnist
precedence: 2 # Greater than global deny-allroute: - labels:
version: Aweight: 1 # 1% routes to model A
- labels: version: B # 2% routes to new model Bweight: 2
- labels: version: C # 97% routes to new model Cweight: 97
apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:
name: live-experiment-97-2-1spec:
destination: name: predict-mnist
precedence: 2 # Greater than global deny-allroute: - labels:
version: Aweight: 97 # 97% still routes to model A
- labels: version: B # 2% routes to new model Bweight: 2
- labels: version: C # 1% routes to new model Cweight: 1
ISTIO AUTO-SCALING
§ Traffic Routing and Auto-Scaling Occur Independently
§ Istio Continues to Obey Traffic Splits After Auto-Scaling
§ Auto-Scaling May Occur In Response to New Traffic Route
ADVANCED ROUTING RULES
§ Content-based Routing§ Uses headers, username, payload, …
§ Cross-Environment Routing§ Shadow traffic prod => staging
ISTIO DESTINATION POLICIES§ Load Balancing
§ ROUND_ROBIN (default)§ LEAST_CONN (between 2 randomly-selected hosts)§ RANDOM
§ Circuit Breaker§ Max connections§ Max requests per conn§ Consecutive errors§ Penalty timer (15 mins)§ Scan windows (5 mins)
circuitBreaker: simpleCb: maxConnections: 100 httpMaxRequests: 1000 httpMaxRequestsPerConnection: 10 httpConsecutiveErrors: 7 sleepWindow: 15m httpDetectionInterval: 5m
ISTIO EGRESS§ Whilelisted Domains Accessible Within Service Mesh§ Apply RoutingRules and DestinationPolicys§ Supports TLS, HTTP, GRPC kind: EgressRule
metadata: name: foo-egress-rule
spec: destination: service: api.pipeline.aiports: - port: 80 protocol: http
- port: 443 protocol: https
ISTIO & CHAOS + LATENCY MONKIES§ Fault Injection
§ Delay§ Abort
kind: RouteRulemetadata: name: predict-mnist
spec: destination: name: predict-mnist
httpFault: abort: httpStatus: 420 percent: 100
kind: RouteRulemetadata: name: predict-mnist
spec: destination: name: predict-mnist
httpFault: delay: fixedDelay: 7.000s percent: 100
ISTIO METRICS AND MONITORING§ Verify Traffic Splits§ Fine-Grained Request Tracing
ISTIO SECURITY
§ Istio Certificate Authority
§ Mutual TLS
AGENDA
Part 0: Latest PipelineAI Research
Part 1: PipelineAI + Kubernetes + Istio
THANK YOU!!
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
§ Reminder: VC’s Value GitHub Stars @ $1,500 Each (!!)
Contact [email protected]
@cfregly