deep learning student workshop - delta course€¦ · 5 intel student ambassadors - who are they?...

Deep Learning Student workshop

September, 2017

Agenda

⎯ Welcome & Introductions

⎯ Intel® Nervana™ AI Academy for Students

⎯ Intel® & AI

⎯ What is Machine Learning & Data Science

⎯ Deep Learning and Neural Networks

⎯ DL frameworks optimized for IA

Questions? Ask us!

BEN odom

Developer Evangelist

benjamin.j.odom@intel.com

BOB DUFFY

Student Ambassador Program Manager

Meghana RaoDeveloper evangelist

meghana.s.rao@intel.com

robert.p.duffy@intel.com

Niven SinghAI Student Developer Community Manager

niven.singh@intel.com

Announcing: Intel® Nervana™ AI ACADEMY for studentsWith the Intel® Nervana™ AI Academy for Students, our goal is to drive awareness of the innovation around AI at the academic level.

We do this by training students on campus and online, and then showcasing and highlighting their expertise, inspiration and innovation, as part of being an Intel Student Ambassadors.

⎯ Educate students, on campus, in person and begin to build a relationship between students, professors, universities and Intel

⎯ Recruit qualified Student Ambassadors

⎯ Support them with IA access and training

⎯ Coach and help them to deliver innovative ideas, expert content and student training to others students

⎯ Showcase examples of early innovation work by students

Intel student ambassadors - Who are they?

They’re just like you!

- Graduate and PhD students who are excited and want to do real work in the field of Deep Learning

- They are subject matter experts, who are going to events like SXSW, SIGSE, PyCon, and on campus to talk about their work

- They are active participants, working on projects, papers, articles – content that has their name on it!

- They are curious and inventive thinkers – trying new things, creating demos and working on REALLY cool stuff to share with the community

Intel student ambassadors – What are they doing?Intel Student Ambassadors are working on innovative, real world, applicable research and projects, like:

- Using smart phone cameras to collect and identify data on harmful vs. not mosquitos

- Leveraging neural networks and deep learning to conduct stock price analysis and predictions

- Enabling individuals with speech impediments to use speech-to-text software to recognize and dictate their speech.

- Using ML & AI to solve medical problems, like disease detection and identifying cures for epidemics. http://devmesh.intel.com

Intel & AI

libraries Intel® MKL MKL-DNN Intel® MLSL

toolkits

Frameworks

Intel® DAAL

hardwareMemory/Storage NetworkingCompute

Intel Distribution

Mlib BigDL

Intel® Nervana™ Graph*

Intel® Nervana™ PORTFOLIO

experiences

Intel® Nervana™ DL Software &

Computer Vision*Future

Intel® DL Training &

Deployment

Intel® Computer Vision SDK

MovidiusFathom

Intel® GO™ Automotive

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.

- Wikipedia

What is data science?

Source: https://en.wikipedia.org/wiki/Data_science

The data science process

NOSQL Passion

Statistics

R, Python, Scala

Communication

Visualization

Domain Knowledge

Machine Learning

Story Teller

Hacker MindsetLove the Data

DEEP Learning

Engage with “C” Level

Neural Networks

How to become a data scientist?

Applying Algorithms to observed data and make predictions based on data.

What is machine learning?

Machines Learn in two ways:

Supervised Learning & Unsupervised Learning

Supervised Learning

We train the model. We feed the model with correct answers. Model Learns and finally predicts.

We feed the model with “ground truth”.

Unsupervised Learning

Data is given to the model. Right answers are not provided to the model. The model makes sense of the data given to it.Can teach you something you were probably not aware of IN THE given dataset.

Types of Supervised and Unsupervised learning

Classification

Regression

Clustering

Recommendation

SUPERVISED UNSUPERVISED

CLASSIFICATIONPredict a label for an entity with a given set of features.

prediction sentiment analysis

REGRESSIONPredict a real numeric value for an entity with a given set of features.

Address

Parking

School

Transit

Total sqft

Lot Size

Bathrooms

Bedrooms

Fireplace

Property attributes

$Linear regression model

Market Segmentation

Play timein hours

Causal

Gamers

Serious

Gamers

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

CLUSTERINGGroup entities with similar features

RECOMMENDATIONRecommend an item to a user based on past behavior or preferences of similar users.

User Info+Your Past Purchase Data+Purchase of other user+Product Info

Recommendation ML Method

Recommendations

ClassifierMatrix

Applications of Machine Learning

Fraud Detection

Movie Recommendation

Face Detection

Anomaly Detection

Product Sentiment Analysis

Natural Language Processing

Image Analysis

IoT Analysis

Spam Filtering/Virus Detection

Working with data sets

Machine Learning Vocabulary - How do you read a data set?

Target Predicted category or value of the data (column to predict)

Features properties of the data used for prediction (non-target columns)

Example A single data point within the data (one row)

Label The target value for a single data point

An example data set

sepal length sepal width petal length petal width species

6.7 3.0 5.2 2.3 virginica

6.4 2.8 5.6 2.1 virginica

4.6 3.4 1.4 0.3 setosa

6.9 3.1 4.9 1.5 versicolor

4.4 2.9 1.4 0.2 setosa

4.8 3.0 1.4 0.1 setosa

5.9 3.0 5.1 1.8 virginica

5.4 3.9 1.3 0.4 setosa

4.9 3.0 1.4 0.2 setosa

5.4 3.4 1.7 0.2 setosa

Target

Example

Features

Training data set & Validation & Test dataset

If our Dataset is a 100,000 homes sold in Portland a typical split would be:

Train = 70,000 Homes

Validation = 10,000 Homes

Test = 20,000 Homes

Setting up your environment

What is in a Basic Data Science Toolkit

Intel® distribution of python* 2017

1. Install Anaconda https://www.continuum.io/downloads#linux

2. Choose Intel Packages: conda config --add channels intel

3. Create the environment: conda create –n intelpython3 intelpython3_full python=3

4. Activate the environment: source activate intelpython3

5. Run the jupyter notebook: jupyter notebook --no-browser (only use no browser if running remotely or using BASH on windows)

6. Access the notebook: http://localhost:8888

6 Steps to Jupyter Notebook with Intel Distribution of Python

linear regression

Introduction to Linear Regression

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥

1.0 2.0

Budget

1.0 2.0

Budget

coefficient

office

revenue

budgetcoefficient

1.0 2.0

Budget

𝛽0= 80 million, 𝛽1= 0.6

Predicting from Linear Regression

1.0 2.0

Budget

𝛽0= 80 million, 𝛽1= 0.6

Predict 175 Million Gross for

160 Million Budget

Which Model Fits the Best?

1.0 2.0

Budget

Calculating the Residuals

1.0 2.0

Budget

Predicted

Observe

d value

𝑦𝛽 𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

Calculating the Residuals

1.0 2.0

Budget

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

Mean Squared Error

1.0 2.0

Budget

𝑖=1

Minimum Mean Squared Error

1.0 2.0

Budget

min𝛽0,𝛽1

𝑖=1

Cost Function

1.0 2.0

Budget

𝐽 𝛽0, 𝛽1 =1

𝑖=1

Gradient DescentStart with a cost function J(𝛽):

𝑱 𝜷

Gradient DescentStart with a cost function J(𝛽):

𝑱 𝜷

Then gradually move towards the minimum.

Global Minimum

Now imagine there are two parameters

(𝛽0, 𝛽1)

Gradient Descent with Linear Regression

Now imagine there are two parameters (𝛽0, 𝛽1)

This is a more complicated surface on which the minimum must be found

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

Now imagine there are two parameters (𝛽0, 𝛽1)

This is a more complicated surface on which the minimum must be found

How can we do this without knowing what 𝐽(𝛽0, 𝛽1) looks like?

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

Compute the gradient, 𝛻𝐽(𝛽0, 𝛽1), which points in the direction of the biggest increase!

-𝛻𝐽(𝛽0, 𝛽1)(negative gradient) points to the biggest decrease at that point!

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝛻𝐽 𝛽0, … , 𝛽𝑛 = <𝜕𝐽

𝜕𝛽0, … ,

𝜕𝐽

𝜕𝛽𝑛>

Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔1 = 𝜔0 − 𝛼𝛻1

𝑖=1

2 𝜔0

Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):

The learning rate (𝛼) is a tunable parameter that determines step size

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔1 = 𝜔0 − 𝛼𝛻1

𝑖=1

2 𝜔0

Each point can be iteratively calculated from the previous one

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔2 = 𝜔1 − 𝛼𝛻1

𝑖=1

2 𝜔0

Each point can be iteratively calculated from the previous one

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔1𝜔2 = 𝜔1 − 𝛼𝛻

𝑖=1

𝜔3 = 𝜔2 − 𝛼𝛻1

𝑖=1

2 𝜔3

Modelling Best Practice

Use cost function to fit model

Develop multiple models

Compare results and choose best one

k nearest neighbors

K Nearest Neighbors Classification

Survived

Did not survive

Number of Malignant Nodes

Predict

Neighbor Count (K = 1):

Predict

Correct value for 'K'

How to measure closeness of neighbors?

What is Needed to Select a KNN Model?

Value of 'K' Affects Decision Boundary

0 10 20

Measurement of Distance in KNN

Euclidean Distance

Euclidean Distance (L2 Distance)

𝑑 = ∆𝑁𝑜𝑑𝑒𝑠2 + ∆𝐴𝑔𝑒2

∆ Age

∆ Nodes

Manhattan Distance (L1 or City Block Distance)

∆ Age

∆ Nodes 𝑑 = ∆𝑁𝑜𝑑𝑒𝑠 + ∆𝐴𝑔𝑒

Scale is Important for Distance Measurement

Number of Surgeries

Nearest Neighbors!

"Feature Scaling"

1 4 53

Number of Surgeries

"Feature Scaling"

1 4 53

Number of Surgeries

"Feature Scaling"

1 4 53

Number of Surgeries

Nearest Neighbors!

Performance comparison - Linear Regression and KNN

K nearest neighborsLinear regression

Fitting involves minimizing cost function (slow)

Model has few parameters (memory efficient)

Prediction involves calculation (fast)

Fitting involves storing training data (fast)

Model has many parameters (memory intensive)

Prediction involves finding closest neighbors (slow)

what is the issue with linear classifiers we have learnt so far?

XORThe counter

example to all models

We need non-linear functions

Source: https://medium.com/towards-data-science/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883

We need layers Usually lots with Non-Linear TransformationsXOR = X1 and not X2 OR Not X1 and X2

1.5 0.5

-2Output

Threshold to 0 or 1

This is a brewing domain called Deep Learning In the machine learning world, we use neural networks. The idea comes from biology. Each layer learns something.

Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations.

--Wikipedia

Layer 1 Layer 2 Layer N Prediction

Each layer learns something

Elephant

Elephants

Chairs

FullyConnected

What is deep learning good for?

Classification And DETECTION

Detect and label the image

Person

Motorcyclist

https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

Semantic Segmentation

Label every pixel

http://arxiv.org/pdf/1511.04164v3.pdf

Natural Language Object Retrieval

The same architecture is used for English and Mandarin Chinese speech recognition

http://svail.github.io/mandarin/

Speech Recognition

The basics of building a neural network

Motivation for Neural Nets• Use biology as inspiration for mathematical model

• Get signals from previous neurons

• Generate signals (or not) according to inputs

• Pass signals on to next neurons

• By layering many neurons, can create complex model

Basic Neuron Visualization

activationfunction

z = x1w1+ x2w2+ x3w3+b

• Sigmoid function

• Smooth transition in output between (0,1)

• Tanh function

• Smooth transition in output between (-1,1)

• ReLU function

• f(x) = max(x,0)

• Step function

• f(x) = (0,1)

Types of activation functions

Why Neural Nets?• Why not just use a single neuron? Why do we need a larger

network?• A single neuron (like logistic regression) only permits a linear

decision boundary.• Most real-world problems are considerably more complicated!

Feedforward Neural Network

𝑥3𝜎

Weights

𝑥3𝜎

Input Layer

𝑥3𝜎

Hidden Layers

𝑥3𝜎

Output Layer

𝑥3𝜎

Weights (represented by matrices)

𝑥3𝜎

𝑊(1) 𝑊(2) 𝑊(3)

Net Input (sum of weighted inputs, before activation function)

𝑥3𝜎

𝑧(2) 𝑧(3)

𝑧(4)

Activations (output of neurons to next layer)

𝑥3𝜎

𝑎(1)𝑎(2) 𝑎(3)

𝑎(4)

Matrix representation of computation

𝑥3𝜎

𝑧(2) = 𝑥𝑊(1)

𝑎(2) = 𝜎(𝑧 2 )

𝑥 = 𝑥1, 𝑥2, 𝑥3

(𝑥 = 𝑎(1))

𝑧(2)

𝑊(1)

𝑎(2)

𝑊(1) is a

3x4 matrix

𝑧(2) is a

4-vector

For a single data point (instance)

𝑎(2) is a

4-vector

Continuing the Computation

For a single training instance (data point)

Input: vector x (a row vector of length 3)Output: vector 𝑦 (a row vector of length 3)

𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )

𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )

𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )

Multiple data pointsIn practice, we do these computation for many data points at the same time, by “stacking” the rows into a matrix. But the equations look the same!

Input: matrix x (an nx3 matrix) (each row a single instance)Output: vector 𝑦 (an nx3 matrix) (each row a single prediction)

𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )

𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )

𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )

How to Train a Neural Net?

Input(Feature Vector)

Output(Label)

• Put in Training inputs, get the output• Compare output to correct answers: Look at loss function J• Adjust and repeat!

• Backpropagation tells us how to make a single adjustment using calculus.

Using Gradient Descent

1. Make prediction2. Calculate Loss3. Calculate gradient of the loss function w.r.t. parameters4. Update parameters by taking a step in the opposite direction5. Iterate

Calculate the loss function

𝑥3𝜎

Evaluate:𝐽 𝑦𝑖 , 𝑦𝑖

Chain Rule

𝜕𝐽

𝜕𝑊(2)= ( 𝑦 − 𝑦) ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑎(2)

𝜕𝐽

𝜕𝑊(1)= 𝑦 − 𝑦 ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑊 2 ⋅ 𝜎′ 𝑧 2 ⋅ 𝑋

𝜕𝐽

𝜕𝑊(3)= ( 𝑦 − 𝑦) ⋅ 𝑎(3)

• Recall that: 𝜎′ 𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))• Though they appear complex, above are easy to compute!

Backpropagation

𝑥3𝜎

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊𝑘

𝑊(1) 𝑊(2) 𝑊(3)Want:

Backpropagation

𝑥3𝜎

𝑊(1) 𝑊(2) 𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3

Backpropagation

𝑥3𝜎

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3

𝑊(1)

Backpropagation

𝑥3𝜎

What we have learnt so far

• Nomenclature required to build a NN

• Input, hidden, output layers

• Weights, activation

• Backpropagation using gradient descent

• Representing it all using matrices

Convolutional neural network

Convolutional Neural Nets

Primary Ideas behind Convolutional Neural Networks:

• Let the Neural Network learn which kernels are most useful• Use same set of kernels across entire image (translation invariance)• Reduces number of parameters and “variance” (from bias-variance point

of view)

Kernels as Feature Detectors

Can think of kernels as a ”local feature detectors”

Vertical Line Detector

-1 1 -1

Horizontal Line Detector

-1 -1 -1

Corner Detector

-1 -1 -1

-1 1 1

Without Padding, we lose data at the edges

Padding the input data

Pooling: Max-pool• For each distinct patch, represent it by the maximum

• 2x2 maxpool shown below

CNN for Digit recognition

Source: http://cs231n.github.io/

Convolutional Neural Networks (CNN) for Image Recognition

LeNet-5

How many total weights in the network?

Conv1: 1*6*5*5 + 6 = 156Conv3: 6*16*5*5 + 16 = 2416FC1: 400*120 + 120 = 48120FC2: 120*84 + 84 = 10164FC3: 84*10 + 10 = 850Total: = 61706

Less than a single FC layer with [1200x1200] weights!Note that Convolutional Layers have relatively few weights.

Differences between CNN and fully connected networks

CONVOLUTIONAL NEURAL NETWORK FULLY CONNECTED NEURAL NETWORKS• Each neuron connected to a small set of

nearby neurons in the previous layer• Uses same set of weights for each neuron• Ideal for spatial feature recognition, Ex:

Image recognition• Cheaper on resources due to fewer

connections

• Each neuron is connected to every neuron in the previous layer

• Every connection has a separate weight• Not optimal for detecting features• Computationally intensive – heavy

memory usage

Network architectures

AlexNet - Model Diagram

VGG16 Diagram

Layer 1 Layer 2 Layer 3

(Input)

We can say that the “receptive field” of Layer 2 is 3x3

Each output has been influenced by a 3x3 patch of inputs

(Input)

What about on Layer 3?

(Input)

This output on Layer 3 uses a 3x3 patch from Layer 2

How much from Layer 1 does it use?

(Input)

Each square in Layer 3 “sees” a 5x5 grid from Layer 1

3 × 3 × 𝐶 × 𝐶 = 9𝐶2 7 × 7 × 𝐶 × 𝐶 = 49𝐶2One 3x3 layer One 7x7 layer

3 × (9𝐶2) = 27𝐶2Three 3x3 layers

49𝐶2 27𝐶2 ≈45% reduction!

Two 3x3, stride 1 convolutions in a row one 5x5

Three 3x3 convolutions one 7x7 convolution

Benefit: fewer parameters

Inception V3 schematic

Inception

This whole “block” serves

the function of a previous

convolutional layer.

ResNet

• Add previous layer back in to current layer!• Similar idea to “boosting”

examples

Unattended baggage detection using Intel® optimized caffe*

Source: https://software.intel.com/en-us/articles/unattended-baggage-detection-using-deep-neural-networks-in-intel-architecture

why ARE Deep Neural Networks called “Deep”?

Source: https://research.facebook.com/publications/deepface-closing-the-gap-to-human-level-performance-in-face-verification/

Example of CNN topologies

11/9/2017 Intel Confidential

GoogLeNet (2014)ConvolutionPoolingSoftmaxOther

Source: Google white paper and Krizhevsky et al.

Diagnosis of heart disease using CNNs

Source: http://cs231n.stanford.edu/reports2016/331_Report.pdf

Using 30 MRIs during one cardiac cycle from different axis viewsto predict VS and VD

Diabetic Retinopathy diagnosis A Kaggle competition solution from deepsense.io

Images from EyePACS

Source: https://deepsense.io/diagnosing-diabetic-retinopathy-with-deep-learning/

Intel® NERVANA™ AI PORTFOLIO

libraries Intel® MKL MKL-DNN Intel® MLSL

toolkits

Frameworks

Intel® DAAL

hardwareMemory/Storage NetworkingCompute

Intel Distribution

Mlib BigDL

Intel® Nervana™ Graph*

Intel® Nervana™ PORTFOLIO

experiences

Intel® Nervana™ DL Software &

Computer Vision*Future

Intel® DL Training &

Deployment

Intel® Computer Vision SDK

MovidiusFathom

Intel® GO™ Automotive

Batch Many batch modelsTrain machine learning models across a

diverse set of dense and sparse dataTrain large deep neural networks

Train large models as fast as possible

LAKECREST

Stream EdgeInfer billions of data samples at a time

and feed applications within ~1 dayInfer deep data streams with low latency in order to take action within milliseconds

Power-constrained environments

Training

inference

or other Intel® edge processor

Option for higher throughput/watt

*Future*

Required for lower latency

AI silicon positioning

Intel® Movidius™ Neural Compute Stick

Get started: https://developer.movidius.com/

• Nervana Cloud Build an AI POC

• neon Train DL models quickly

• Intel Nervana Graph any framework, any hardware

• Intel Nervana HW industry leading AI, coming soon

“deep learning by design”

deep learning

framework

Intel® Nervana™ Full stack platform

Multi-user collaboration

Interactive sessions

Model library

Fast training

Batch training

Experiment tracking

Multi-node distribution

Analytics & visualization

Hyperparameter optimization

Batch inference

Model compression

Inference deployment

Export to edge devices

Data curation/processing

Data partitioning

Data labeling

Accelerate time-to-solution by compressing both compute and labor-intensive steps in the innovation cycle to deliver scalable end-to-end AI solutions

Intel® Nervana™ Deep Learning Software

Intel® distribution of python* 2017

DL Framework Optimized for IA:

Tensorflow

Coarse-Grained / multi-node

Domain decomposition

Performance Optimization on Modern Platforms

Utilize all the cores

OpenMP, MPI, TBB…

Reduce synchronization events, serial code

Improve load balancing

Vectorize/SIMD

Unit strided access per SIMD lane

High vector efficiency

Data alignment

Efficient memory/cache use

Blocking

Data reuse

Prefetching

Memory allocation

Hierarchical Parallelism

Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers)

2) Data decomposition (layer parallelism)

Scaling

Improve load balancing

Reduce synchronization events, all-to-all comms

Example Challenge 1: Data Layout Has Big Impact on Performance• Data Layouts impacts performance

• Sequential access to avoid gather/scatter• Have iterations in inner most loop to ensure high vector utilization• Maximize data reuse; e.g. weights in a convolution layer

• Converting to/from optimized Layout is some times less expensive than operating on unoptimized Layout

21 18 32 6 3

1 8 0 3 26

40 9 22 76 81

23 44 81 32 11

5 38 10 11 1

8 92 37 29 44

11 9 22 3 26

3 47 29 88 1

15 16 22 46 12

29 9 13 11 1 21 8 18 92 .. 1 11 ..

21 18 … 1 .. 8 92 ..

Better optimized for some operations

• End to end optimization can reduce conversions• Staying in optimized layout as long as possible becomes

one of the tuning goals • Minimize the number of back and forth conversions

• Use of graph optimization techniques

Convolution ConvolutionMax PoolNative to MKL layout

MKL layout to Native

Native to MKL layout

MKL layout to Native

Example Challenge 2: Minimize Conversions Overhead

Optimizing TensorFlow & Other DL Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools

• e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc.• Data Format/Shape:

• Right format/shape for max performance: blocking, gather/scatter• Data Layout:

• Minimize cost of data layout conversions • Parallelism:

• Use all cores, eliminate serial sections, load imbalance• Memory allocation

• unique characteristics and ability to reuse buffers• Data layer optimizations:

• parallelization, vectorization, IO• Optimize hyper parameters:

• e.g. batch size for more parallelism• learning rate and optimizer to ensure accuracy/convergence

Benchmark MetricBatch

Baseline

Performance

Training

Baseline

Inference

Optimized

Training

Optimized

Inference

Speedup

Training

Speedup

Inference

ConvNet-

Alexnet

Images

/ sec 12833.52 84.2

5241696

15.6x 20.2xConvNet-

GoogleNet

Images

/ sec 12816.87 49.9

112.3439.7

6.7x 8.8x

ConvNet-

Images

/ sec64 8.2 30.7 47.1 151.1 5.7x 4.9x

• Baseline using TensorFlow 1.0 release with standard compiler knobs

• Optimized performance using TensorFlow with Intel optimizations and built with

• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”

Initial Performance Gains on Modern Xeon (2 Sockets Broadwell - 22 Cores)

Benchmark MetricBatch

Baseline

Performance

Training

Baseline

Inference

Optimized

Training

Optimized

Inference

Speedup

Training

Speedup

Inference

ConvNet-

Alexnet

Images

/ sec 12812.21 31.3

549 2698.3 45x 86.2xConvNet-

GoogleNet

Images

/ sec 1285.43 10.9

106 576.6 19.5x 53x

ConvNet-

Images

/ sec64 1.59 24.6 69.4 251 43.6x 10.2x

• Baseline using TensorFlow 1.0 release with standard compiler knobs

• Optimized performance using TensorFlow with Intel optimizations and built with

• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”

Initial Performance Gains on Modern Xeon Phi (Knights Landing – 68 Cores)

• Data format: CPU prefers NCHW data format• Intra_op, inter_op and OMP_NUM_THREADS: set for best core utilization• Batch size: higher batch size provides for better parallelism

• Too high a batch size can increase working set and impact cache/memory perf

Benchmark Data Format Inter_op Intra_op KMP_BLOCKTIME Batch size

ConvNet- AlexnetNet NCHW 1 44 30 2048

ConvNet-Googlenet V1 NCHW 2 44 1 256

ConvNet-VGG NCHW 1 44 1 128

Best Setting for Xeon (Broadwell – 2 Socket – 44 Cores)

BenchmarkData

Format

Inter_

opIntra_op

KMP_BLOCKTI

OMP_NUM_

THREADSBatch size

ConvNet- AlexnetNet NCHW 1 68 30 136 2048

ConvNet-Googlenet V1 NCHW 2 68 1 68 256

ConvNet-VGG NCHW 1 68 1 136 128

Best Setting for Xeon Phi (Knights Landing – 68 Cores)

Additional Performance Gains from Parameters Tuning

Social Media & SurveyPrize Winners

Want to learn more?Check out the

Intel® Nervana™ AI Academy for students

software.intel.com/AIStudents

backup

Intel tools and libraries

Intel® Distribution for Python*• Ready access to set of tools and techniques for high performance on Intel®

Architecture

• Accelerated Python packages - NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py

• Integrated with Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL) and pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB)

• Get out-of-the-box performance that is closer to native code speeds.

• Speed up data analytics with pyDAAL and parallelize Python workloads.

• Manage packages and Jupyter Notebooks easily with conda, Anaconda Cloud, and PIP.

Learn more: https://software.intel.com/en-us/intel-distribution-for-python

Intel® Math Kernel Library (MKL)

• Features highly optimized, threaded and vectorized functions to maximize performance on Intel® Architecture and compatible processors

• Linear Algebra, Fast Fourier Transforms (FFT), Neural Network, Vector Math and Statistics functions

• Standard APIs for immediate performance results

• Utilizes de facto standard C and Fortran APIs for compatibility with BLAS, LAPACK and FFTW functions from other math libraries

• Available with both free community-supported and paid support licenses

Learn more: https://software.intel.com/en-us/intel-mkl

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)

• A library of DNN performance primitives optimized for Intel architectures

• A set of highly optimized building blocks intended to accelerate compute-intensive parts of deep learning applications, particularly DNN frameworks such as Caffe, Tensorflow, Theano and Torch

• Distributed as source code through GitHub

• Implemented in C++ and provides both C++ and C APIs

• Allows the functionality to be used from a wide range of high-level languages, such as Python or Java

Learn more: https://01.org/mkl-dnn/overview

Intel® Data Analytics Acceleration Library (Intel® DAAL)• Features highly tuned functions for deep learning, classical machine learning,

and data analytics performance across spectrum of Intel® architecture devices

• Intel® DAAL addresses all stages of the Big Data Ecosystem

• Includes Python*, C++, and Java* APIs and connectors to popular data sources including Spark* and Hadoop*

• Free and open source community-supported versions are available, as well as paid versions that include premium support.

Learn more: https://software.intel.com/en-us/intel-daal

Intel® Machine Learning Scaling Library for Linux* OS

• A library providing an efficient implementation of communication patterns used in deep learning.

• Built on top of MPI, allows for use of other communication libraries

• Optimized to drive scalability of communication patterns

• Works across various interconnects: Intel(R) Omni-Path Architecture, InfiniBand*, and Ethernet

• Common API to support Deep Learning frameworks (Caffe*, Theano*, Torch*, etc.)

Learn more: https://github.com/01org/MLSL

BigDL: Distributed Deep Learning Library for Apache Spark*

• Write deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters

• Rich deep learning support - numeric computing (via Tensor) and high level neural networks; load pre-trained Caffe or Torch models into Spark programs using BigDL

• Extremely high performance - uses Intel® MKL and multi-threaded programming in each Spark task

• Efficiently scale-out to “Big Data Scale” using Apache Spark

Learn more: https://github.com/intel-analytics/BigDL

Trusted analytics platform• Facilitates data ingestion, preparation, and analysis with parallel processing

and distributed analytics.

• The software leverages Apache Spark*, Intel® Data Analytics Acceleration Library, and Intel® Math Kernel Library for optimized distributed analytics and parallel processing on Intel® processors.

• Accelerates the modeling process with Intel optimized computational machine-learning and deep-learning algorithms, as well as graph operations, scoring engine, and pipelines.

• Integrates with industry-leading software frameworks such as Apache Spark, TensorFlow*, and Superset to expedite application development and enable deep-learning and visualization techniques.

Learn more: https://software.intel.com/en-us/bigdata/tap

deep learning student workshop - delta course€¦ · 5 intel student ambassadors - who are they?...

Documents

mcc-582 student ambassador application packet · student...

the sports ambassadors are recruited from student-athletes

nordstrom fashion ambassadors · nordstrom fashion...

student alumni ambassadors - university of rochester...

introduction to firefox student ambassadors program

faculty of arts and humanities student ambassadors for...

121 - doctor who and the ambassadors of death

in appreciation -...

evaluating the impact of digital literacy programmes:...

travel information for student ambassadors and parents

w uarctic student ambassadors council, bsu june, 2015

december 2016 - peel district school board ·...

employability skills workshop enhancing the employability of...

student ambassadors a source for college knowledge f page 14...

social student ambassadors #hewebroc

university student ambassadors “the best part is that we...

long kong cup all japan university student ambassadors eng

humboldt student ambassadors’ bullying presentation

munich student ambassadors

designer sneakers: student pages long version - polymer...