deep learning as a service - files.devnetwork.cloud · nvidia tesla available in dlaas ......

21
Deep Learning as a Service Susan Diamond Senior Technical Staff Member/Manager 10/2/2019

Upload: others

Post on 06-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Deep Learning as a Service

Susan Diamond Senior Technical Staff Member/Manager 10/2/2019

Page 2: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

• Why is deep learning the future over traditional machine

learning?

• Why are Watson services adopting DLaaS to power their

model training?

• If you are a data scientist, are you ready to try out DLaaS

(lite plan provides free GPUs)? If you are in the leadership

position in your company, are you ready to promote deep

learning technology to power your business?

Page 3: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Agenda

• Why build Deep Learning as a Service (DLaaS)

• DlaaS Architecture

• Key Design Aspects in DLaaS

• Watson Services Build Models on DLaaS

• Getting Started with DLaaS APIs

Page 4: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Machine Learning

and AI

are everywhere

4

facial recognition

unlocks your phone

fraud detection

protects your credit

recommendations

help you shop faster

speech recognition

lets you go hands-free

chat bots

route calls quicker

autonomous vehicles

detect pedestrians

machine vision

detects cancer early

spam detection

unclogs your Inbox

The future is now

Page 5: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

A human brain has: • 200 billion neurons

• 32 trillion connections between them

• 25 million “neurons”

• 100 million connections (parameters)

Deep Learning = Training Artificial Neural Networks

Intelligence arises from system interactions

Page 6: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Deep learning is neural network design

Machine Learning is algorithm selection

AI is systems architecture

Page 7: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

AI requires more…

data + compute + network complexity

Perf

orm

ance

machine learning

deep learning

Page 8: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

NVIDIA Tesla available in DLaaS

K80

CPU-

only V100

® ®

Graphic Processing Units

NVIDIA

Tesla

V100

®

®

Page 9: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

source code training run

definition

Training Lifecycle Management You provide code + data + training definition

DLaaS handles rest

Kubernetes container

training

artifacts

compute cluster

NVIDIA Tesla K80, V100

Cloud Object Storage

Training assets are

managed and tracked.

Page 10: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Hyperparameter Optimization: Efficiently automate searching your

network’s hyperparameter space to ensure the best model performance with

the fewest training runs.

Code with your favorite frameworks and tools

Graphs not Log Files:

Don’t stare at text logs

when you can overlay

accuracy and loss graphs

to dive deeper into the

training of your neural

networks.

Don’t be constrained. Select the framework appropriate to the unique requirements of

your problem domain and skills of your team.

Page 11: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

DLAAS - A deep learning platform to bridge

innovations/optimization across the entire stack

New hardware Infrastructure (Softlayer)

CPUs NVIDIA GPUs

Container and resource

management

Frameworks

API: train/manage/watch

Data Science

Experience

Watson

services

Advances in the cloud stack

Improvement in training techniques

Optimizations in DL frameworks

Innovation in neural net design

Better user experience and tools

Page 12: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Cloud native architecture

Challenge Solution

Resilience Observed faults in different layers (GPU, network,

etc.)

Engineered to survive restarts, and recover from intermittent

failures

Scalability and

elasticity Scale infrastructure to match workload Nodes/GPUs, service replicas can be added and removed live

Serviceability User cannot login and run diagnostic tools Expose standard APIs for useful logs and metrics

Security Run untrustworthy user code for DLaaS retail offering Defense-in-depth with multiple isolation techniques (at the

process, container, pod, network level)

Performance Multi TB customer trainer data transfer in the cloud

environment

Cloud object storage to store training data

High network bandwidth for data transfer

S3fs driver mount training data to training pod

Distributed

training

Support framework specific and optimized distribution

techniques (e.g. DDL, Horovod) Framework independent provisioning

Page 13: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

MongoDB

(document store)

Elasticsearch

(log store)

DLaaS training architecture

Trainer microservice

Training data microservice

Helper

Controller

Log collector

Data broker

Job monitor

Cre

ate

per

tra

inin

g jo

b

job record

Watson Machine Learning/Visual Recognition/Watson Assistant etc.

job status

Lifecycle manager microservice

job status

logs, metrics

job status

Training data microservice

logs, metrics

Learner Learner

NFS volume mount

•••

Cloud Object Store mount

• Logs (stdout, tensorboard,

etc.)

• Job state

• Training data

• Training results

• Logs

• Use GPUs (in exclusive mode)

• Block network access (running user code)

• Except to workers in the same job

• Pluggable DL frameworks

Logging (ELK stack)

Docker registry

Cloud Object Storage

NFS volumes (SSD-

backed)

Kubernetes

Jenkins

Page 14: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

High throughput object store access for deep

learning

• Provide high throughput streaming-like

access to IBM Cloud Object Storage

• Enable deep learning frameworks, e.g.,

Tensorflow, to run as-is

• No local storage requirement; no

intermediate cache such as NFS storage

• Collaborated with HRL to develop s3fs driver

as container storage with the Armada team

DL learner

Object Store

/s3fs

Kubernetes master

GPU enabled Docker container

/tmpfs /s3fs /tmpfs /s3fs /tmpfs

Page 15: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Optimized network for fast data transfer

Page 16: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

DLaaS – Basic Flow Model Spec

1. Create model in DL framework supported by DLaaS (Caffe, Tensorflow, Torch, Theano, Keras, …)

2. Store training data in object storage

3. Specify model metadata (framework, …); resource requirements (GPUs, mem, num learners, …); pointer to training data (object store, s3, …)

manifest.yml

DLaaS API: /v1/models

Object storage

Kubernetes cluster runs training jobs

Training Request

Training logs, status

Trained Model

4. Start a training job 5. Query training status and

retrieve logs 6. Get trained model

Page 17: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Watson AI services build models with DLaaS

https://cloud.ibm.com/catalog?category=ai

Page 18: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

DLaaS Summary

• Deep Learning is the future over traditional machine learning technology

• Deep Learning training is much faster than traditional machine learning

and achieve higher accuracy over traditional machine learning

• DLaaS is a cloud base platform that supports major deep learning

frameworks and powered by NVIDIA GPUs.

• DLaaS is designed and implemented to ensure security, high scale,

resilience and high performance as well as easy to use and serviceable.

Page 19: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Q&A

• Why is deep learning the future over traditional machine learning?

• Why are Watson services adopting DLaaS to power their model

training?

• If you are a data scientist, are you ready to try out DLaaS (lite plan

provides free GPUs)? If you are in the leadership position in your

company, are you ready to promote deep learning technology to

power your business?

Page 20: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Try DLaaS

• https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-

data/ml_dlaas.html

Page 21: Deep Learning as a Service - files.devnetwork.cloud · NVIDIA Tesla available in DLaaS ... Infrastructure (Softlayer) New hardware CPUs NVIDIA GPUs Container and resource management

Thank You!