deep dive on deep learning -...

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Julien Simon

Principal Technical Evangelist, AI and Machine Learning

@julsimon

Deep Dive on Deep Learning

• Deep Learning concepts

• Common architectures and use cases

• Apache MXNet

• Infrastructure for Deep Learning

• Demos along the way: MXNet, Gluon, Keras, TensorFlow, PyTorch☺

Agenda

Deep Learning concepts

Activation functions

i=1

l

xi ∗ wi = u

”Multiply and Accumulate”

Source: Wikipedia

The neuron

x =

x11, x12, …. x1I

x21, x22, …. x2I

… … …

xm1, xm2, …. xmI

I features

m samples

y =

2

0

…

4

m labels,N2 categories

0,0,1,0,0,…,0

1,0,0,0,0,…,0

…

0,0,0,0,1,…,0

One-hot encoding

Neural networks

x =

x11, x12, …. x1I

x21, x22, …. x2I

… … …

xm1, xm2, …. xmI

I features

m samples

y =

2

0

…

4

m labels,N2 categories

Total number of predictionsAccuracy =

Number of correct predictions

0,0,1,0,0,…,0

1,0,0,0,0,…,0

…

0,0,0,0,1,…,0

One-hot encoding

Neural networks

Initially, the network will not predict correctlyf(X1) = Y’1

A loss function measures the difference between the real label Y1 and the predicted label Y’1

error = loss(Y1, Y’1)

For a batch of samples:

𝑖=1

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

loss(Yi, Y’i) = batch error

The purpose of the training process is to minimize error by gradually adjusting weights.

Neural networks

Training data set Training

Trainedneural network

Batch sizeLearning rate

Number of epochsHyper parameters

Backpropagation

Training

Stochastic Gradient Descent

Imagine you stand on top of a mountain with skis strapped to your feet. You want to get down to the valley as quickly as possible, but there is fog and you can only see your immediate surroundings. How can you get down the mountain as quickly as possible? You look around and identify the steepest path down, go down that path for a bit, again look around and find the new steepest path, go down that path, and repeat—this is exactly what gradient descent does.

Tim DettmersUniversity of Lugano2015

https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/

The « step size » depends on the learning rate

z=f(x,y)

https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/

Local minima and saddle points

« Do neural networks enter and escape a series of local minima? Do they move at varying speed as theyapproach and then pass a variety of

saddle points? Answering thesequestions definitively is difficult, but

we present evidence stronglysuggesting that the answer to all of

these questions is no. »

« Qualitatively characterizing neural network optimization problems », Goodfellow et al, 2015 https://arxiv.org/abs/1412.6544

https://arxiv.org/abs/1412.6544

https://medium.com/@julsimon/tumbling-down-the-sgd-rabbit-hole-part-1-740fa402f0d7

SGD works remarkably well and is still widely used.

Adaptative optimizers use a variable learning rate.

Some even use a learning rate per dimension (Adam).

Optimizers

https://medium.cim/@julsimoni/tumbling-down-the-sgd-rabbit-hole-part-1-740fa402f0d7

Validation data set(also called dev set)

Neural networkin training

Validation accuracy

Prediction at the end of each

epoch

This data set must have the same distribution as real-life samples, or else validation accuracy won’t reflect real-life accuracy.

Validation

Test data set Fully trainedneural network

Test accuracy

Prediction at the end of

experimentation

This data set must have the same distribution as real-life samples, or else test accuracy won’t reflect real-life accuracy.

Test

Training accuracy

Loss function

Accuracy

100%

Epochs

Validation accuracy

Loss

Best epoch

OV

ERFITTIN

G

« Deep Learning ultimately is about finding a minimum that generalizes well, with bonus points for finding one

fast and reliably », Sebastian RuderEarly stopping

Common architectures and use cases

Convolutional Neural Networks (CNN)

Le Cun, 1998: handwritten digit recognition, 32x32 pixels

https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/

https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/

Source: http://timdettmers.com

Extracting features with convolution

Convolution extracts features automatically.Kernel parameters are learned during the training process.

Downsampling images with pooling

Source: Stanford University

Pooling shrinks images while preserving significant information.

Object Detection

https://github.com/precedenceguo/mx-rcnn https://github.com/zhreshold/mxnet-yolo

MXNet

https://github.com/precedenceguo/mx-rcnn

https://github.com/zhreshold/mxnet-yolo

Object Segmentation

https://github.com/TuSimple/mx-maskrcnn

MXNet

https://github.com/TuSimple/mx-maskrcnn

Text Detection and Recognition

https://github.com/Bartzi/stn-ocr

MXNet

https://github.com/Bartzi/stn-ocr

Face Detection

https://github.com/tornadomeet/mxnet-face https://github.com/deepinsight/insightfacehttps://arxiv.org/abs/1801.07698

Face Recognition

LFW 99.80%+Megaface 98%+

with a single model

MXNetMXNet

https://github.com/tornadomeet/mxnet-face

https://github.com/deepinsight/insightface

https://arxiv.org/abs/1801.07698

Real-Time Pose Estimation

https://github.com/dragonfly90/mxnet_Realtime_Multi-Person_Pose_Estimation

MXNet

https://github.com/dragonfly90/mxnet_Realtime_Multi-Person_Pose_Estimation

Long Short Term Memory Networks (LSTM)

• A LSTM neuron computes the output based on the input and a previousstate

• LSTM networks have memory

• They’re great at predicting sequences, e.g. machine translation

Machine Translation

https://github.com/awslabs/sockeye

MXNet

https://github.com/awslabs/sockeye

GAN: Welcome to the (un)real world, Neo

Generating new ”celebrity” faceshttps://github.com/tkarras/progressive_growing_of_gans

From semantic map to 2048x1024 picture https://tcwang0509.github.io/pix2pixHD/

TF

PyTorch

https://github.com/tkarras/progressive_growing_of_gans

https://tcwang0509.github.io/pix2pixHD/

Wait! There’s more!Models can also generate text from text, text from images, text from

video, images from text, sound from video, 3D models from 2D images, etc.

Apache MXNet

Apache MXNet: Open Source library for Deep Learning

Programmable Portable High Performance

Near linear scalingacross hundreds of GPUs

Highly efficient models for mobile

and IoT

Simple syntax, multiple languages

Most Open Best On AWSOptimized for

Deep Learning on AWS

Accepted into theApache Incubator

Input Output

1 1 1

1 0 1

0 0 03

mx. sym. Convol ut i on( dat a, ker nel =( 5, 5) , num_f i l t er =20)

mx. sym. Pool i ng( dat a, pool _t ype=" max" , ker nel =( 2, 2) ,

st r i de=( 2, 2)

l st m. l st m_unr ol l ( num_l st m_l ayer , seq_l en, l en, num_hi dden, num_embed)

4 2

2 0 4=Max

1

3

...

4

0.2

-0.1

...

0.7

mx. sym. Ful l yConnect ed( dat a, num_hi dden=128)

2

mx. symbol . Embeddi ng( dat a, i nput _di m, out put _di m = k)

0.2

-0.1

...

0.7

Queen

4 2

2 0 2=Avg

Input Weights

cos(w, queen ) = cos(w, k i ng) - cos(w, m an ) + cos(w, w om an)

mx. sym. Act i vat i on( dat a, act _t ype=" xxxx" )

" r el u"

" t anh"

" si gmoi d"

" sof t r el u"

Neural Art

Face Search

Image Segmentation

Image Caption

“People Riding

Bikes”

Bicycle, People,

Road, Sport

Image Labels

Image

Video

Speech

Text

“People Riding

Bikes”

Machine Translation

“Οι άνθρωποι

ιππασίας ποδήλατα”

Events

mx. model . FeedFor war d model . f i t

mx. sym. Sof t maxOut put

Anat omy of a Deep Lear ning Model

• More chances for optimization

• Language independent

• E.g. TensorFlow, Theano, Caffe, MXNet Symbol API

• Less flexible

• ’Black box’ training

PROS

CONS

C can share memory with D because C is deleted later

A = Variable('A')B = Variable('B')C = B * AD = C + 1f = compile(D)d = f(A=np.ones(10),

B=np.ones(10)*2)

A B

1

+

x

Declarative Programming

‘define then run’

DEMO: Symbol API1 – Fully Connected Neural Network (MNIST)

2 – Convolution Neural Network (MNIST)

import numpy as npa = np.ones(10)b = np.ones(10) * 2c = b * ad = c + 1

• Straightforward and flexible.

• Take advantage of language native features (loop, condition, debugger).

• E.g. Numpy, PyTorch, MXNet Gluon API

• Harder to optimize

PROS

CONS

Imperative Programming

‘define by run’

DEMO: Gluon APIFully Connected Network (MNIST)

Gluon CV: classification, detection, segmentation

https://github.com/dmlc/gluon-cv

[electric_guitar], with probability 0.671

https://github.com/dmlc/gluon-cv

DEMO: Gluon CV

https://github.com/awslabs/mxnet-model-server/

https://github.com/awslabs/mxnet-model-server/

https://aws.amazon.com/blogs/ai/announcing-onnx-support-for-apache-mxnet/

https://aws.amazon.com/blogs/ai/announcing-onnx-support-for-apache-mxnet/

Keras-MXNet

https://github.com/awslabs/keras-apache-mxnet

https://github.com/awslabs/keras-apache-mxnet

DEMO: Keras-MXNetConvolutional Neural Network (MNIST)

Infrastructure for Deep Learning

Amazon EC2 C5 instances

AVX 512

72 vCPUs

“Skylake”

144 GiB memory

C5

12 Gbps to EBS

2X vCPUs

2X performance

3X throughput

2.4X memory

C4

36 vCPUs

“Haswell”

4 Gbps to EBS

60 GiB memory

C 5 : N e x t G e n e r a t i o nC o m p u t e - O p t i m i z e dI n s t a n c e s w i t h I n t e l ® X e o n ® S c a l a b l e P r o c e s s o r

AW S C o m p u t e o p t i m i z e d i n s t a n c e s s u p p o r t t h e n e w I n t e l ® AV X - 5 1 2 a d v a n c e d i n s t r u c t i o n s e t , e n a b l i n g y o u t o m o r e e f f i c i e n t l y r u n v e c t o r p r o c e s s i n g w o r k l o a d s w i t h s i n g l e a n d d o u b l e f l o a t i n g p o i n t p r e c i s i o n , s u c h a s A I / m a c h i n e l e a r n i n g o r v i d e o p r o c e s s i n g .

25% improvement in price/performance over C4

Faster TensorFlow training on C5

https://aws.amazon.com/blogs/machine-learning/faster-training-with-optimized-tensorflow-1-6-on-amazon-ec2-c5-and-p3-instances/

https://aws.amazon.com/blogs/machine-learning/faster-training-with-optimized-tensorflow-1-6-on-amazon-ec2-c5-and-p3-instances/

Amazon EC2 P3 Instances

• P3.2xlarge, P3.8xlarge, P3.16xlarge

• Up to eight NVIDIA Tesla V100 GPUs in a single instance

• 40,960 CUDA cores, 5120 Tensor cores

• 128GB of GPU memory

• 1 PetaFLOPs of computational performance – 14x better than P2

• 300 GB/s GPU-to-GPU communication (NVLink) – 9x better than P2

T h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d

Preconfigured environments to quickly build Deep Learning applications

Conda AMIFor developers who want pre-installed pip packages of DL frameworks in separate virtual environments.

Base AMIFor developers who want a clean slate to set up private DL engine repositories or custom builds of DL engines.

AMI with source codeFor developers who want preinstalled DL frameworks and their source code in a shared Python environment.

https://aws.amazon.com/machine-learning/amis/

AWS Deep Learning AMI

https://aws.amazon.com/machine-learning/amis/

Amazon SageMaker

Pre-built notebooks for

common problems

K-Means ClusteringPrincipal Component AnalysisNeural Topic ModellingFactorization MachinesLinear Learner

XGBoostLatent Dirichlet AllocationImage ClassificationSeq2Seq, And more!

ALGORITHMS

Apache MXNetTensorFlow

Caffe2, CNTK, PyTorch, Torch

FRAMEWORKS S e t u p a n d m a n a g e e nv i ronm e nt s for t ra in ing

Tra in a n d t u n e m od e l ( t r ia l a n d

e rror )

D e p loy m od e l in p rod u c t ion

S c a le a n d m a n a g e t h e p rod u c t ion e n v i ron m e n t

Built-in, high-performance

algorithms

Build

Amazon SageMaker


common problems


algorithms

One-click training

Hyperparameter optimization

Build Train

Deploy model in production

Scale and manage the production

environment

Amazon SageMaker

Fully managed hosting with auto-scaling

One-click deployment


common problems


algorithms

One-click training

Hyperparameter optimization

Build Train Deploy

Amazon ECR

Model Training (on EC2)

Model Hosting (on EC2)

Trai

nin

g d

ata

Mo

del

art

ifac

ts

Training code Helper code

Helper codeInference code

Gro

un

d T

ruth

Client application

Inference code

Training code

Inference requestInference response

Inference Endpoint

Amazon SageMaker

Open Source Conta iners for TF and MXNet

https://github.com/aws/sagemaker-tensorflow-containers

https://github.com/aws/sagemaker-mxnet-containers

• Build them and run them on your own machine

• Run them directly on a notebook instance (aka local mode)

• Customize them and push them to ECR

• Run them on SageMaker for training and prediction at scale

https://github.com/aws/sagemaker-tensorflow-containers

https://github.com/aws/sagemaker-mxnet-containers

DEMO: SageMaker1 – Use the built-in algorithm for image classification (CIFAR-10)

2– Bring your own Tensorflow script for image classification (MNIST)

3– Bring your own Gluon script for sentiment analysis (Stanford Sentiment Tree Bank 2)

4 – Build your own Keras-MXNet container (CNN + MNIST)

5 – Build your own PyTorch container (CNN + MNIST)

Danke schön!

Julien Simon

Principal Technical Evangelist, AI and Machine Learning

@julsimon

https://aws.amazon.com/machine-learninghttps://aws.amazon.com/blogs/ai

https://mxnet.incubator.apache.org | https://github.com/apache/incubator-mxnethttps://gluon.mxnet.io | https://github.com/gluon-api

https://aws.amazon.com/sagemakerhttps://github.com/awslabs/amazon-sagemaker-exampleshttps://github.com/aws/sagemaker-python-sdkhttps://github.com/aws/sagemaker-spark

https://medium.com/@julsimonhttps://youtube.com/juliensimonfrhttps://gitlab.com/juliensimon/dlnotebooks

https://aws.amazon.com/machine-learning

https://aws.amazon.com/blogs/ai

https://mxnet.incubator.apache.org/

https://github.com/apache/incubator-mxnet

https://gluon.mxnet.io/

https://github.com/gluon-api

https://aws.amazon.com/sagemaker

https://github.com/awslabs/amazon-sagemaker-examples

https://github.com/aws/sagemaker-python-sdk

https://github.com/aws/sagemaker-spark

https://medium.com/@julsimon

https://youtube.com/juliensimonfr

https://github.com/juliensimon/dlnotebooks

deep dive on deep learning -...

Documents