deep learning explained - d37djvu3ytnwxt.cloudfront.net · deep learning explained module 2:...

Deep Learning ExplainedModule 2: Logistic Regression

Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft

Roland Fernandez, Senior Researcher, Microsoft

Module Outline

Application:

OCR with MNIST data

Model:

Logistic Regression

Concepts:

Loss, Minibatch

Train-Test-Predict workflow

MNIST Handwritten Digits (OCR)

• Data set of hand written digits (0-9) with✓60,000 training images

✓10,000 test images

• Each image is: 28 x 28 pixels

Handwritten Digits

1 5 4 35 3 5 35 9 0 6

Corresponding Labels

28 pix

28

pix

.

784 pixels (x)

Logistic Regression

784 pixels ( Ԧ𝑥)

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3

Model

(W, 𝑏)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏

Ԧ𝑧

0 1 2 3 4 5 6 7 8 9

weights (W)

784

10

bias( Ԧ𝑏)10

Model Parameters

Model that maps input features to

discrete output classesopposed to linear

regression

28 pix

28

pix

.

784 pixels

S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇

784 784

Logistic Regression

0

𝑤0


28 pix

28

pix

.

784 pixels (x)

S S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇

784 784

Logistic Regression

0 1

S = Sum (weights x pixels) = 𝑤1 ∙ Ԧ𝑥𝑇

784 784

𝑤1


28 pix

28

pix

.

784 pixels (x)

S S

Weights (W)


784 784

784

10

Logistic Regression

S0 1 9


S = Sum (weights x pixels ) = 𝑤9 ∙ Ԧ𝑥𝑇

784 784

…

…


784 784

Ԧ𝑧 =

𝑤9

Ԧ𝑧 = W Ԧ𝑥𝑇

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

= map to (0-1) rangeActivation function

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid

784

10

Model

Logistic Regression

Sbias ( Ԧ𝑏)(10) 0 1 9

…


Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

= map to (0-1) rangeActivation function

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid

784

10

Model

Logistic Regression

SBias (10)

(𝑏)0 1 9

…


Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏

= None (pass through)

z0 z1 z2 z3 z4 z5 z6 z7 z8 z9

784

10

Model

Logistic Regression with Softmax

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01𝑒𝑧i

σ𝑗=09 𝑒𝑧j

softmax

SBias (10)

(𝑏)0 1 9

…


Ԧ𝑧 =

Predicted Probabilities (p)

28 pix

28

pix

.

28 x 28 pix

Loss Function

Lossfunctions

se = σ𝑗= 09 𝑦𝑗 − 𝑝𝑗

2Squared error

ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗

Cross entropy error

1 5 4 35 3 5 35 9 0 6

Label One-hot encoded (Y)

0 0 0 1 0 0 0 0 0 0

Model(w, b)

Predicted Probabilities (p)

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01

Train(learner)

Reporting

TrainingData

Trainmore?

Data SamplerFeatures (x), Labels (Y)

Model

z(params)

Train Workflow

params

update params

loss

iterationsY

Train Workflow

MNISTTrain

3

7

8

0

Input feature (X: 128 x 784)12

8 s

am

ples

(min

i-ba

tch

)

Loss cross_entropy_with_softmax(z,Y)One-hot encoded

Label

(Y: 128 x 10)

0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 0

1 0 0 0 0 0 0 0 0 0

bias(𝑏

)(D

im-

10

)

Model

S

z = times(X , W) + b

= 𝐖 X𝑇 + 𝑏

Trainer(model, (loss, error), learner)

Trainer.train_minibatch({X, Y})

Learnersgd, adagrad etc, are solvers to estimate – W & b

Error(optional)

classification_error(z,Y)

weights (W)

784

10

Learn the weights: Learners / Optimizers / Solvers

For 1 sample:

Loss (𝐿𝑖) = −σ𝑗=09 𝑦𝑗

𝑖𝑙𝑜𝑔 𝑝𝑗 where: 𝑝𝑗 = 𝑓(𝑥 𝑖 ; 𝜃)𝑗

𝜃 ∈ (𝑤, 𝑏)

For all samples (m = 60000 images):

Total loss = σ𝑖=1𝑚 𝐿𝑖 (𝜃; (𝑥

𝑖 , 𝑦 𝑖 ))

Convex function: There is 1 and only 1 minimum

Fig: courtesy http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html

Gradient Descent

𝜃′ = 𝜃 − 𝜇 𝑔𝑟𝑎𝑑 𝐿; 𝜃

Where: 𝜃 = model parameter𝜇 = learning rate

Computing “Total Loss” (σ𝑖𝑛 𝐿𝑖) for large data set is expensive and often redundant

- refer to http://sebastianruder.com/optimizing-gradient-descent/ for details

http://sebastianruder.com/optimizing-gradient-descent/

Stochastic Gradient Descent (SGD)

SGD:Update the parameters foreach (data, label) pair

Mini-batch SGD:Update the parameters formini-batch setSet of (data, label) pairs

refer to http://sebastianruder.com/optimizing-gradient-descent/ for details on different learners


Other learners

Momentum-SGDNestorovAdagradAdsdeltaAdam

Refer to http://sebastianruder.com/optimizing-gradient-descent/ for

details on different learners

Image by: Alec Radford


Train(learner)

Reporting

TrainingData

Trainmore?



Validate

Reporting

More?

ValidationData

Model

z(params)

Validation Workflow

params

update params

trainedparams

loss

iterations

Modelfinal

Y Y


Test

Reporting

Testmore?

TestData

Modelfinal

Test Workflow

trainedparams

Y

Test Workflow

MNISTTest

3

7

8

0

Input feature (X*: 32 x 784)3

2sa

mpl

es(m

ini-

batc

h)

One-hot encoded

Label

(Y*: 32 x 10)

0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 0

1 0 0 0 0 0 0 0 0 0

ModelS

z = times(X* , 𝐖∗ ) + b*

= 𝐖∗ X∗𝑇 + 𝑏∗

Trainer.test_minibatch({X*, Y*})

Returns the classification error as % incorrectlylabeled MNIST image.

MNISTTrain

bias(𝑏

*)

(Dim

-10

)weights (W*)

784

10

Prediction Workflow

Any MNIST

9

Input feature (new X: 1 x 784)Model

(W, b)

Model.eval(new X)

0.02 0.09 0.03 0.03 0.01 0.02 0.02 0.06 0.02 0.70

Predicted Softmax Probabilities (predicted_label)

[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]

[9]

Prediction Workflow

Any MNIST

9

5

8

2

Input feature (new X: 25 x 784)Model

(W, b)

Model.eval(new X)

Predicted Softmax Probabilities (predicted_label)

[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]

[9, 5, 8, …, 2]

deep learning explained - d37djvu3ytnwxt.cloudfront.net · deep learning explained module 2:...

Documents