deep learning student workshop - delta course€¦ · 5 intel student ambassadors - who are they?...

Post on 26-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning Student workshop

September, 2017

Agenda

⎯ Welcome & Introductions

⎯ Intel® Nervana™ AI Academy for Students

⎯ Intel® & AI

⎯ What is Machine Learning & Data Science

⎯ Deep Learning and Neural Networks

⎯ DL frameworks optimized for IA

3 3

Questions? Ask us!

BEN odom

Developer Evangelist

benjamin.j.odom@intel.com

BOB DUFFY

Student Ambassador Program Manager

Meghana RaoDeveloper evangelist

meghana.s.rao@intel.com

robert.p.duffy@intel.com

Niven SinghAI Student Developer Community Manager

niven.singh@intel.com

4

Announcing: Intel® Nervana™ AI ACADEMY for studentsWith the Intel® Nervana™ AI Academy for Students, our goal is to drive awareness of the innovation around AI at the academic level.

We do this by training students on campus and online, and then showcasing and highlighting their expertise, inspiration and innovation, as part of being an Intel Student Ambassadors.

⎯ Educate students, on campus, in person and begin to build a relationship between students, professors, universities and Intel

⎯ Recruit qualified Student Ambassadors

⎯ Support them with IA access and training

⎯ Coach and help them to deliver innovative ideas, expert content and student training to others students

⎯ Showcase examples of early innovation work by students

5

Intel student ambassadors - Who are they?

They’re just like you!

- Graduate and PhD students who are excited and want to do real work in the field of Deep Learning

- They are subject matter experts, who are going to events like SXSW, SIGSE, PyCon, and on campus to talk about their work

- They are active participants, working on projects, papers, articles – content that has their name on it!

- They are curious and inventive thinkers – trying new things, creating demos and working on REALLY cool stuff to share with the community

6

Intel student ambassadors – What are they doing?Intel Student Ambassadors are working on innovative, real world, applicable research and projects, like:

- Using smart phone cameras to collect and identify data on harmful vs. not mosquitos

- Leveraging neural networks and deep learning to conduct stock price analysis and predictions

- Enabling individuals with speech impediments to use speech-to-text software to recognize and dictate their speech.

- Using ML & AI to solve medical problems, like disease detection and identifying cures for epidemics. http://devmesh.intel.com

Intel & AI

libraries Intel® MKL MKL-DNN Intel® MLSL

toolkits

Frameworks

Intel® DAAL

hardwareMemory/Storage NetworkingCompute

Intel Distribution

Mlib BigDL

Intel® Nervana™ Graph*

Intel® Nervana™ PORTFOLIO

experiences

Intel® Nervana™ DL Software &

Cloud

Computer Vision*Future

Intel® DL Training &

Deployment

Intel® Computer Vision SDK

MovidiusFathom

Intel® GO™ Automotive

SDK

*

9

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.

- Wikipedia

What is data science?

10

Source: https://en.wikipedia.org/wiki/Data_science

The data science process

NOSQL Passion

Math

Statistics

R, Python, Scala

Communication

Visualization

Domain Knowledge

Machine Learning

Story Teller

Hacker MindsetLove the Data

DEEP Learning

Engage with “C” Level

Neural Networks

11

How to become a data scientist?

12

Applying Algorithms to observed data and make predictions based on data.

What is machine learning?

13

Machines Learn in two ways:

Supervised Learning & Unsupervised Learning

14

Supervised Learning

We train the model. We feed the model with correct answers. Model Learns and finally predicts.

We feed the model with “ground truth”.

15

Unsupervised Learning

Data is given to the model. Right answers are not provided to the model. The model makes sense of the data given to it.Can teach you something you were probably not aware of IN THE given dataset.

16

Types of Supervised and Unsupervised learning

Classification

Regression

Clustering

Recommendation

SUPERVISED UNSUPERVISED

17

CLASSIFICATIONPredict a label for an entity with a given set of features.

SPAM

prediction sentiment analysis

REGRESSIONPredict a real numeric value for an entity with a given set of features.

18

Price

Address

Type

Age

Parking

School

Transit

Total sqft

Lot Size

Bathrooms

Bedrooms

Yard

Pool

Fireplace

Property attributes

$Linear regression model

sqft

19

Market Segmentation

Play timein hours

Age

Causal

Gamers

No

Gamers

Serious

Gamers

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

CLUSTERINGGroup entities with similar features

20

RECOMMENDATIONRecommend an item to a user based on past behavior or preferences of similar users.

User Info+Your Past Purchase Data+Purchase of other user+Product Info

Recommendation ML Method

Recommendations

ClassifierMatrix

YMAL

Data

21

Applications of Machine Learning

Fraud Detection

Movie Recommendation

Face Detection

Anomaly Detection

Product Sentiment Analysis

Natural Language Processing

Image Analysis

IoT Analysis

Spam Filtering/Virus Detection

Working with data sets

Machine Learning Vocabulary - How do you read a data set?

Target Predicted category or value of the data (column to predict)

Features properties of the data used for prediction (non-target columns)

Example A single data point within the data (one row)

Label The target value for a single data point

23

An example data set

24

sepal length sepal width petal length petal width species

6.7 3.0 5.2 2.3 virginica

6.4 2.8 5.6 2.1 virginica

4.6 3.4 1.4 0.3 setosa

6.9 3.1 4.9 1.5 versicolor

4.4 2.9 1.4 0.2 setosa

4.8 3.0 1.4 0.1 setosa

5.9 3.0 5.1 1.8 virginica

5.4 3.9 1.3 0.4 setosa

4.9 3.0 1.4 0.2 setosa

5.4 3.4 1.7 0.2 setosa

Target

Example

Features

Label

25

Training data set & Validation & Test dataset

If our Dataset is a 100,000 homes sold in Portland a typical split would be:

Train = 70,000 Homes

Validation = 10,000 Homes

Test = 20,000 Homes

Setting up your environment

27

What is in a Basic Data Science Toolkit

28

Intel® distribution of python* 2017

1. Install Anaconda https://www.continuum.io/downloads#linux

2. Choose Intel Packages: conda config --add channels intel

3. Create the environment: conda create –n intelpython3 intelpython3_full python=3

4. Activate the environment: source activate intelpython3

5. Run the jupyter notebook: jupyter notebook --no-browser (only use no browser if running remotely or using BASH on windows)

6. Access the notebook: http://localhost:8888

29

6 Steps to Jupyter Notebook with Intel Distribution of Python

linear regression

31

Introduction to Linear Regression

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥

0.0

1.0

2.0

x108

1.0 2.0

Budget

Bo

x O

ffic

e

x108

32

Introduction to Linear Regression

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥

0.0

1.0

2.0

x108

1.0 2.0

Budget

Bo

x O

ffic

e

coefficient

0

box

office

revenue

movie

budgetcoefficient

1x108

33

Introduction to Linear Regression

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

𝛽0= 80 million, 𝛽1= 0.6

34

Predicting from Linear Regression

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

𝛽0= 80 million, 𝛽1= 0.6

Predict 175 Million Gross for

160 Million Budget

35

Which Model Fits the Best?

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

36

Calculating the Residuals

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

Predicted

value

Observe

d value

𝑦𝛽 𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

37

Calculating the Residuals

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

38

Mean Squared Error

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

1

𝑚

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2

39

Minimum Mean Squared Error

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

min𝛽0,𝛽1

1

𝑚

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2

40

Cost Function

0.0

1.0

2.0

1.0 2.0

Budget

Bo

x O

ffic

e

x108

x108

𝐽 𝛽0, 𝛽1 =1

2𝑚

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2

42

Gradient DescentStart with a cost function J(𝛽):

𝑱 𝜷

𝜷

43

Gradient DescentStart with a cost function J(𝛽):

𝑱 𝜷

𝜷

Then gradually move towards the minimum.

Global Minimum

Now imagine there are two parameters

(𝛽0, 𝛽1)

44

Gradient Descent with Linear Regression

Now imagine there are two parameters (𝛽0, 𝛽1)

This is a more complicated surface on which the minimum must be found

45

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

Now imagine there are two parameters (𝛽0, 𝛽1)

This is a more complicated surface on which the minimum must be found

How can we do this without knowing what 𝐽(𝛽0, 𝛽1) looks like?

46

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

Compute the gradient, 𝛻𝐽(𝛽0, 𝛽1), which points in the direction of the biggest increase!

-𝛻𝐽(𝛽0, 𝛽1)(negative gradient) points to the biggest decrease at that point!

47

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters

48

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝛻𝐽 𝛽0, … , 𝛽𝑛 = <𝜕𝐽

𝜕𝛽0, … ,

𝜕𝐽

𝜕𝛽𝑛>

Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):

49

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔1 = 𝜔0 − 𝛼𝛻1

2

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2 𝜔0

𝜔1

Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):

The learning rate (𝛼) is a tunable parameter that determines step size

50

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔1 = 𝜔0 − 𝛼𝛻1

2

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2 𝜔0

𝜔1

Each point can be iteratively calculated from the previous one

51

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔2 = 𝜔1 − 𝛼𝛻1

2

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2 𝜔0

𝜔1

𝜔2

Each point can be iteratively calculated from the previous one

52

Gradient Descent with Linear Regression

𝐽 𝛽0, 𝛽1

𝛽1 𝛽0

𝜔0

𝜔1𝜔2 = 𝜔1 − 𝛼𝛻

1

2

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2

𝜔2

𝜔3 = 𝜔2 − 𝛼𝛻1

2

𝑖=1

𝑚

𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)

− 𝑦𝑜𝑏𝑠(𝑖)

2 𝜔3

53

Modelling Best Practice

Use cost function to fit model

Develop multiple models

Compare results and choose best one

k nearest neighbors

55

K Nearest Neighbors Classification

Survived

Did not survive

Number of Malignant Nodes

0

Age

60

40

20

10 20

56

K Nearest Neighbors Classification

Number of Malignant Nodes

0

Age

60

40

20

10 20

Predict

57

K Nearest Neighbors Classification

0

1

Neighbor Count (K = 1):

Number of Malignant Nodes

0

Age

60

40

20

10 20

Predict

58

K Nearest Neighbors Classification

1

1

Neighbor Count (K = 2):

Number of Malignant Nodes

0

Age

60

40

20

10 20

Predict

59

K Nearest Neighbors Classification

Number of Malignant Nodes

2

1

Neighbor Count (K = 3):

0

Age

60

40

20

10 20

Predict

60

K Nearest Neighbors Classification

Number of Malignant Nodes

0

Age

60

40

20

10 20

3

1

Predict

Neighbor Count (K = 4):

Correct value for 'K'

How to measure closeness of neighbors?

61

What is Needed to Select a KNN Model?

Number of Malignant Nodes

0

Age

60

40

20

10 20

62

Value of 'K' Affects Decision Boundary

Number of Malignant Nodes

K=1

0 10 20

60

40

20

0

60

40

20

10 20

Number of Malignant Nodes

K=All

63

Measurement of Distance in KNN

Number of Malignant Nodes

0

Age

60

40

20

10 20

64

Measurement of Distance in KNN

Number of Malignant Nodes

0

Age

60

40

20

10 20

65

Euclidean Distance

Number of Malignant Nodes

0

Age

60

40

20

10 20

66

Euclidean Distance (L2 Distance)

Number of Malignant Nodes

0

Age

60

40

20

10 20

𝑑 = ∆𝑁𝑜𝑑𝑒𝑠2 + ∆𝐴𝑔𝑒2

∆ Age

d

∆ Nodes

67

Manhattan Distance (L1 or City Block Distance)

Number of Malignant Nodes

0

Age

60

40

20

10 20

∆ Age

∆ Nodes 𝑑 = ∆𝑁𝑜𝑑𝑒𝑠 + ∆𝐴𝑔𝑒

68

Scale is Important for Distance Measurement

Number of Surgeries

12345

Age

60

40

20

69

Scale is Important for Distance Measurement

12345

Number of Surgeries

Age

60

40

20

24

22

20

18

70

Scale is Important for Distance Measurement

Number of Surgeries

12345

Age

60

40

20

24

22

20

18

Nearest Neighbors!

"Feature Scaling"

71

Scale is Important for Distance Measurement

1 4 53

Number of Surgeries

0

Age

60

40

20

2

"Feature Scaling"

72

Scale is Important for Distance Measurement

1 4 53

Number of Surgeries

0

Age

60

40

20

2

"Feature Scaling"

73

Scale is Important for Distance Measurement

1 4 53

Number of Surgeries

0

Age

60

40

20

2

Nearest Neighbors!

74

Performance comparison - Linear Regression and KNN

K nearest neighborsLinear regression

Fitting involves minimizing cost function (slow)

Model has few parameters (memory efficient)

Prediction involves calculation (fast)

Fitting involves storing training data (fast)

Model has many parameters (memory intensive)

Prediction involves finding closest neighbors (slow)

75

what is the issue with linear classifiers we have learnt so far?

XORThe counter

example to all models

We need non-linear functions

X1 X2

0 0 0

y

0 1 1

1 0 1

1 1 0

0

X1

X2

0

1

1

Source: https://medium.com/towards-data-science/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883

76

We need layers Usually lots with Non-Linear TransformationsXOR = X1 and not X2 OR Not X1 and X2

1.5 0.5

Input

Input

+1

+1

+1

+1

-2Output

Threshold to 0 or 1

X1 X2

0 0 0

y

0 1 1

1 0 1

1 1 0

77

This is a brewing domain called Deep Learning In the machine learning world, we use neural networks. The idea comes from biology. Each layer learns something.

Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations.

--Wikipedia

Layer 1 Layer 2 Layer N Prediction

78

Each layer learns something

Elephant

Faces

Cars

Elephants

Chairs

FullyConnected

layer

What is deep learning good for?

80

Classification And DETECTION

Detect and label the image

Person

Motorcyclist

Bike

https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

81

Semantic Segmentation

Label every pixel

http://arxiv.org/pdf/1511.04164v3.pdf

82

Natural Language Object Retrieval

The same architecture is used for English and Mandarin Chinese speech recognition

http://svail.github.io/mandarin/

83

Speech Recognition

The basics of building a neural network

Motivation for Neural Nets• Use biology as inspiration for mathematical model

• Get signals from previous neurons

• Generate signals (or not) according to inputs

• Pass signals on to next neurons

• By layering many neurons, can create complex model

bw3

Basic Neuron Visualization

activationfunction

x1

x2

x3

w1

w2

z = x1w1+ x2w2+ x3w3+b

f(z)

1

87

• Sigmoid function

• Smooth transition in output between (0,1)

• Tanh function

• Smooth transition in output between (-1,1)

• ReLU function

• f(x) = max(x,0)

• Step function

• f(x) = (0,1)

Types of activation functions

Why Neural Nets?• Why not just use a single neuron? Why do we need a larger

network?• A single neuron (like logistic regression) only permits a linear

decision boundary.• Most real-world problems are considerably more complicated!

Feedforward Neural Network

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

Weights

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

Input Layer

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

Hidden Layers

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

Output Layer

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

Weights (represented by matrices)

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑊(1) 𝑊(2) 𝑊(3)

Net Input (sum of weighted inputs, before activation function)

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑧(2) 𝑧(3)

𝑧(4)

Activations (output of neurons to next layer)

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑎(1)𝑎(2) 𝑎(3)

𝑎(4)

Matrix representation of computation

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝑧(2) = 𝑥𝑊(1)

𝑎(2) = 𝜎(𝑧 2 )

𝑥 = 𝑥1, 𝑥2, 𝑥3

(𝑥 = 𝑎(1))

𝑧(2)

𝑊(1)

𝑎(2)

𝑊(1) is a

3x4 matrix

𝑧(2) is a

4-vector

For a single data point (instance)

𝑎(2) is a

4-vector

Continuing the Computation

For a single training instance (data point)

Input: vector x (a row vector of length 3)Output: vector 𝑦 (a row vector of length 3)

𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )

𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )

𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )

Multiple data pointsIn practice, we do these computation for many data points at the same time, by “stacking” the rows into a matrix. But the equations look the same!

Input: matrix x (an nx3 matrix) (each row a single instance)Output: vector 𝑦 (an nx3 matrix) (each row a single prediction)

𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )

𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )

𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )

How to Train a Neural Net?

Input(Feature Vector)

Output(Label)

• Put in Training inputs, get the output• Compare output to correct answers: Look at loss function J• Adjust and repeat!

• Backpropagation tells us how to make a single adjustment using calculus.

Using Gradient Descent

1. Make prediction2. Calculate Loss3. Calculate gradient of the loss function w.r.t. parameters4. Update parameters by taking a step in the opposite direction5. Iterate

𝑦1

Calculate the loss function

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑦2

𝑦3

Evaluate:𝐽 𝑦𝑖 , 𝑦𝑖

Chain Rule

𝜕𝐽

𝜕𝑊(2)= ( 𝑦 − 𝑦) ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑎(2)

𝜕𝐽

𝜕𝑊(1)= 𝑦 − 𝑦 ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑊 2 ⋅ 𝜎′ 𝑧 2 ⋅ 𝑋

𝜕𝐽

𝜕𝑊(3)= ( 𝑦 − 𝑦) ⋅ 𝑎(3)

• Recall that: 𝜎′ 𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))• Though they appear complex, above are easy to compute!

𝑦1

Backpropagation

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑦2

𝑦3

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊𝑘

𝑊(1) 𝑊(2) 𝑊(3)Want:

𝑦1

Backpropagation

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑦2

𝑦3

𝑊(1) 𝑊(2) 𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3

𝑦1

Backpropagation

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑦2

𝑦3

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊2

𝑊(1)

𝑦1

Backpropagation

𝑥1

𝑥2

𝑥3𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝜎

𝑦1

𝑦2

𝑦3

𝑦2

𝑦3

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊2

𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊1

108

What we have learnt so far

• Nomenclature required to build a NN

• Input, hidden, output layers

• Weights, activation

• Backpropagation using gradient descent

• Representing it all using matrices

Convolutional neural network

Convolutional Neural Nets

Primary Ideas behind Convolutional Neural Networks:

• Let the Neural Network learn which kernels are most useful• Use same set of kernels across entire image (translation invariance)• Reduces number of parameters and “variance” (from bias-variance point

of view)

Kernels as Feature Detectors

Can think of kernels as a ”local feature detectors”

Vertical Line Detector

-1 1 -1

-1 1 -1

-1 1 -1

Horizontal Line Detector

-1 -1 -1

1 1 1

-1 -1 -1

Corner Detector

-1 -1 -1

-1 1 1

-1 1 1

Without Padding, we lose data at the edges

Padding the input data

Pooling: Max-pool• For each distinct patch, represent it by the maximum

• 2x2 maxpool shown below

115

CNN for Digit recognition

Source: http://cs231n.github.io/

116

Convolutional Neural Networks (CNN) for Image Recognition

LeNet-5

How many total weights in the network?

Conv1: 1*6*5*5 + 6 = 156Conv3: 6*16*5*5 + 16 = 2416FC1: 400*120 + 120 = 48120FC2: 120*84 + 84 = 10164FC3: 84*10 + 10 = 850Total: = 61706

Less than a single FC layer with [1200x1200] weights!Note that Convolutional Layers have relatively few weights.

Differences between CNN and fully connected networks

119

CONVOLUTIONAL NEURAL NETWORK FULLY CONNECTED NEURAL NETWORKS• Each neuron connected to a small set of

nearby neurons in the previous layer• Uses same set of weights for each neuron• Ideal for spatial feature recognition, Ex:

Image recognition• Cheaper on resources due to fewer

connections

• Each neuron is connected to every neuron in the previous layer

• Every connection has a separate weight• Not optimal for detecting features• Computationally intensive – heavy

memory usage

Network architectures

AlexNet - Model Diagram

VGG16 Diagram

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

Layer 1 Layer 2 Layer 3

We can say that the “receptive field” of Layer 2 is 3x3

Each output has been influenced by a 3x3 patch of inputs

VGG

Layer 1 Layer 2 Layer 3

(Input)

What about on Layer 3?

VGG

Layer 1 Layer 2 Layer 3

(Input)

This output on Layer 3 uses a 3x3 patch from Layer 2

How much from Layer 1 does it use?

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

VGG

Layer 1 Layer 2 Layer 3

(Input)

Each square in Layer 3 “sees” a 5x5 grid from Layer 1

VGG

3 × 3 × 𝐶 × 𝐶 = 9𝐶2 7 × 7 × 𝐶 × 𝐶 = 49𝐶2One 3x3 layer One 7x7 layer

3 × (9𝐶2) = 27𝐶2Three 3x3 layers

49𝐶2 27𝐶2 ≈45% reduction!

Two 3x3, stride 1 convolutions in a row one 5x5

Three 3x3 convolutions one 7x7 convolution

Benefit: fewer parameters

Inception V3 schematic

Inception

This whole “block” serves

the function of a previous

convolutional layer.

ResNet

• Add previous layer back in to current layer!• Similar idea to “boosting”

examples

143

Unattended baggage detection using Intel® optimized caffe*

Source: https://software.intel.com/en-us/articles/unattended-baggage-detection-using-deep-neural-networks-in-intel-architecture

why ARE Deep Neural Networks called “Deep”?

Source: https://research.facebook.com/publications/deepface-closing-the-gap-to-human-level-performance-in-face-verification/

144

Example of CNN topologies

11/9/2017 Intel Confidential

GoogLeNet (2014)ConvolutionPoolingSoftmaxOther

Source: Google white paper and Krizhevsky et al.

145

146

Diagnosis of heart disease using CNNs

Source: http://cs231n.stanford.edu/reports2016/331_Report.pdf

Using 30 MRIs during one cardiac cycle from different axis viewsto predict VS and VD

147

Diabetic Retinopathy diagnosis A Kaggle competition solution from deepsense.io

Images from EyePACS

Source: https://deepsense.io/diagnosing-diabetic-retinopathy-with-deep-learning/

Intel® NERVANA™ AI PORTFOLIO

libraries Intel® MKL MKL-DNN Intel® MLSL

toolkits

Frameworks

Intel® DAAL

hardwareMemory/Storage NetworkingCompute

Intel Distribution

Mlib BigDL

Intel® Nervana™ Graph*

Intel® Nervana™ PORTFOLIO

experiences

Intel® Nervana™ DL Software &

Cloud

Computer Vision*Future

Intel® DL Training &

Deployment

Intel® Computer Vision SDK

MovidiusFathom

Intel® GO™ Automotive

SDK

*

Batch Many batch modelsTrain machine learning models across a

diverse set of dense and sparse dataTrain large deep neural networks

Train large models as fast as possible

LAKECREST

Stream EdgeInfer billions of data samples at a time

and feed applications within ~1 dayInfer deep data streams with low latency in order to take action within milliseconds

Power-constrained environments

Training

inference

Batch

or other Intel® edge processor

OR OR

OR

Option for higher throughput/watt

*Future*

Required for lower latency

AI silicon positioning

OR

151

Intel® Movidius™ Neural Compute Stick

Get started: https://developer.movidius.com/

• Nervana Cloud Build an AI POC

• neon Train DL models quickly

• Intel Nervana Graph any framework, any hardware

• Intel Nervana HW industry leading AI, coming soon

“deep learning by design”

neon

deep learning

framework

Intel® Nervana™ Full stack platform

Multi-user collaboration

Interactive sessions

Model library

Fast training

Batch training

Experiment tracking

Multi-node distribution

Analytics & visualization

Hyperparameter optimization

Batch inference

Model compression

Inference deployment

Export to edge devices

Data curation/processing

Data partitioning

Data labeling

Accelerate time-to-solution by compressing both compute and labor-intensive steps in the innovation cycle to deliver scalable end-to-end AI solutions

Intel® Nervana™ Deep Learning Software

154

Intel® distribution of python* 2017

DL Framework Optimized for IA:

Tensorflow

Coarse-Grained / multi-node

Domain decomposition

156

Performance Optimization on Modern Platforms

Utilize all the cores

OpenMP, MPI, TBB…

Reduce synchronization events, serial code

Improve load balancing

Vectorize/SIMD

Unit strided access per SIMD lane

High vector efficiency

Data alignment

Efficient memory/cache use

Blocking

Data reuse

Prefetching

Memory allocation

Hierarchical Parallelism

Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers)

2) Data decomposition (layer parallelism)

Scaling

Improve load balancing

Reduce synchronization events, all-to-all comms

157

Example Challenge 1: Data Layout Has Big Impact on Performance• Data Layouts impacts performance

• Sequential access to avoid gather/scatter• Have iterations in inner most loop to ensure high vector utilization• Maximize data reuse; e.g. weights in a convolution layer

• Converting to/from optimized Layout is some times less expensive than operating on unoptimized Layout

21 18 32 6 3

1 8 0 3 26

40 9 22 76 81

23 44 81 32 11

5 38 10 11 1

8 92 37 29 44

11 9 22 3 26

3 47 29 88 1

15 16 22 46 12

29 9 13 11 1 21 8 18 92 .. 1 11 ..

21 18 … 1 .. 8 92 ..

Better optimized for some operations

vs

158

• End to end optimization can reduce conversions• Staying in optimized layout as long as possible becomes

one of the tuning goals • Minimize the number of back and forth conversions

• Use of graph optimization techniques

Convolution ConvolutionMax PoolNative to MKL layout

MKL layout to Native

Native to MKL layout

MKL layout to Native

Example Challenge 2: Minimize Conversions Overhead

159

Optimizing TensorFlow & Other DL Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools

• e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc.• Data Format/Shape:

• Right format/shape for max performance: blocking, gather/scatter• Data Layout:

• Minimize cost of data layout conversions • Parallelism:

• Use all cores, eliminate serial sections, load imbalance• Memory allocation

• unique characteristics and ability to reuse buffers• Data layer optimizations:

• parallelization, vectorization, IO• Optimize hyper parameters:

• e.g. batch size for more parallelism• learning rate and optimizer to ensure accuracy/convergence

160

Benchmark MetricBatch

Size

Baseline

Performance

Training

Baseline

Perf

Inference

Optimized

Perf

Training

Optimized

Perf

Inference

Speedup

Training

Speedup

Inference

ConvNet-

Alexnet

Images

/ sec 12833.52 84.2

5241696

15.6x 20.2xConvNet-

GoogleNet

v1

Images

/ sec 12816.87 49.9

112.3439.7

6.7x 8.8x

ConvNet-

VGG

Images

/ sec64 8.2 30.7 47.1 151.1 5.7x 4.9x

• Baseline using TensorFlow 1.0 release with standard compiler knobs

• Optimized performance using TensorFlow with Intel optimizations and built with

• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”

Initial Performance Gains on Modern Xeon (2 Sockets Broadwell - 22 Cores)

161

Benchmark MetricBatch

Size

Baseline

Performance

Training

Baseline

Perf

Inference

Optimized

Perf

Training

Optimized

Perf

Inference

Speedup

Training

Speedup

Inference

ConvNet-

Alexnet

Images

/ sec 12812.21 31.3

549 2698.3 45x 86.2xConvNet-

GoogleNet

v1

Images

/ sec 1285.43 10.9

106 576.6 19.5x 53x

ConvNet-

VGG

Images

/ sec64 1.59 24.6 69.4 251 43.6x 10.2x

• Baseline using TensorFlow 1.0 release with standard compiler knobs

• Optimized performance using TensorFlow with Intel optimizations and built with

• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”

Initial Performance Gains on Modern Xeon Phi (Knights Landing – 68 Cores)

162

• Data format: CPU prefers NCHW data format• Intra_op, inter_op and OMP_NUM_THREADS: set for best core utilization• Batch size: higher batch size provides for better parallelism

• Too high a batch size can increase working set and impact cache/memory perf

Benchmark Data Format Inter_op Intra_op KMP_BLOCKTIME Batch size

ConvNet- AlexnetNet NCHW 1 44 30 2048

ConvNet-Googlenet V1 NCHW 2 44 1 256

ConvNet-VGG NCHW 1 44 1 128

Best Setting for Xeon (Broadwell – 2 Socket – 44 Cores)

BenchmarkData

Format

Inter_

opIntra_op

KMP_BLOCKTI

ME

OMP_NUM_

THREADSBatch size

ConvNet- AlexnetNet NCHW 1 68 30 136 2048

ConvNet-Googlenet V1 NCHW 2 68 1 68 256

ConvNet-VGG NCHW 1 68 1 136 128

Best Setting for Xeon Phi (Knights Landing – 68 Cores)

Additional Performance Gains from Parameters Tuning

Q&A

Social Media & SurveyPrize Winners

Want to learn more?Check out the

Intel® Nervana™ AI Academy for students

software.intel.com/AIStudents

backup

Intel tools and libraries

168

Intel® Distribution for Python*• Ready access to set of tools and techniques for high performance on Intel®

Architecture

• Accelerated Python packages - NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py

• Integrated with Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL) and pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB)

• Get out-of-the-box performance that is closer to native code speeds.

• Speed up data analytics with pyDAAL and parallelize Python workloads.

• Manage packages and Jupyter Notebooks easily with conda, Anaconda Cloud, and PIP.

Learn more: https://software.intel.com/en-us/intel-distribution-for-python

169

Intel® Math Kernel Library (MKL)

• Features highly optimized, threaded and vectorized functions to maximize performance on Intel® Architecture and compatible processors

• Linear Algebra, Fast Fourier Transforms (FFT), Neural Network, Vector Math and Statistics functions

• Standard APIs for immediate performance results

• Utilizes de facto standard C and Fortran APIs for compatibility with BLAS, LAPACK and FFTW functions from other math libraries

• Available with both free community-supported and paid support licenses

Learn more: https://software.intel.com/en-us/intel-mkl

170

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)

• A library of DNN performance primitives optimized for Intel architectures

• A set of highly optimized building blocks intended to accelerate compute-intensive parts of deep learning applications, particularly DNN frameworks such as Caffe, Tensorflow, Theano and Torch

• Distributed as source code through GitHub

• Implemented in C++ and provides both C++ and C APIs

• Allows the functionality to be used from a wide range of high-level languages, such as Python or Java

Learn more: https://01.org/mkl-dnn/overview

171

Intel® Data Analytics Acceleration Library (Intel® DAAL)• Features highly tuned functions for deep learning, classical machine learning,

and data analytics performance across spectrum of Intel® architecture devices

• Intel® DAAL addresses all stages of the Big Data Ecosystem

• Includes Python*, C++, and Java* APIs and connectors to popular data sources including Spark* and Hadoop*

• Free and open source community-supported versions are available, as well as paid versions that include premium support.

Learn more: https://software.intel.com/en-us/intel-daal

172

Intel® Machine Learning Scaling Library for Linux* OS

• A library providing an efficient implementation of communication patterns used in deep learning.

• Built on top of MPI, allows for use of other communication libraries

• Optimized to drive scalability of communication patterns

• Works across various interconnects: Intel(R) Omni-Path Architecture, InfiniBand*, and Ethernet

• Common API to support Deep Learning frameworks (Caffe*, Theano*, Torch*, etc.)

Learn more: https://github.com/01org/MLSL

173

BigDL: Distributed Deep Learning Library for Apache Spark*

• Write deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters

• Rich deep learning support - numeric computing (via Tensor) and high level neural networks; load pre-trained Caffe or Torch models into Spark programs using BigDL

• Extremely high performance - uses Intel® MKL and multi-threaded programming in each Spark task

• Efficiently scale-out to “Big Data Scale” using Apache Spark

Learn more: https://github.com/intel-analytics/BigDL

174

Trusted analytics platform• Facilitates data ingestion, preparation, and analysis with parallel processing

and distributed analytics.

• The software leverages Apache Spark*, Intel® Data Analytics Acceleration Library, and Intel® Math Kernel Library for optimized distributed analytics and parallel processing on Intel® processors.

• Accelerates the modeling process with Intel optimized computational machine-learning and deep-learning algorithms, as well as graph operations, scoring engine, and pipelines.

• Integrates with industry-leading software frameworks such as Apache Spark, TensorFlow*, and Superset to expedite application development and enable deep-learning and visualization techniques.

Learn more: https://software.intel.com/en-us/bigdata/tap

top related