do deep nets really need to be deep?

19
DO DEEP NETS REALLY NEED TO BE DEEP? Meoni Marco – UNIPI – March 7 th 2016 Lei Jimmy Ba University of Toronto Rich Caruana Microsoft Research PhD course in Deep Learning

Upload: marco-meoni

Post on 13-Jan-2017

250 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Do deep nets really need to be deep?

DO DEEP NETS REALLY NEED TO BE DEEP?

Meoni Marco – UNIPI – March 7th 2016

Lei Jimmy Ba University of Toronto

Rich Caruana Microsoft Research

PhD course in Deep Learning

Page 2: Do deep nets really need to be deep?

NNs

Outputs

Inputs

SNN: Single Hidden Layer

Outputs

Inputs

DNN: Three Hidden Layers Outputs

Inputs

CNN: Three Hidden Layers above Convolutional/MaxPooling Layers

Page 3: Do deep nets really need to be deep?

Introduction • DNNs excel over SNNs

•  e.g. accuracy on top of 1M labeled points is 91% vs 86%

• Source of improvement of DNNs vs SNNs •  Deep nets have more parameters? •  Deep nets can learn more complex functions? •  Convolution gives a plus?

Page 4: Do deep nets really need to be deep?

Contribution • Possible to train a SNN that mimics the function of a DNN

•  Model compression method

• Possible to mimic but non able to train •  SNNs as accurate as DNNs even if not possible to train SNNs as

accurate as DNNs on the original labeled data

• Necessary to be deep? •  If SNN can mimic a DNN, DDN learning function not that deep?

• Success related to the learning process

Page 5: Do deep nets really need to be deep?

Model Compression

DNN CNN …

Ensemble

Data

1. Build a complex model 2.  Train a simple model to mimic complex function

3.  Apply it

Scores

Labels

SNN

Data

Scores

SNN

Data

Labels

•  Compress large ensembles into smaller, faster models •  Train to learn the function learned by the larger model, not on original labels

Page 6: Do deep nets really need to be deep?

Model Compression (Bucila, Caruana & Niculescu 2006)

•  Train smaller model to mimic a larger, smarter model •  train smart model anyway you want:

•  DNN, CNN, or ensemble of CNNs •  pass large unlabeled data through model to collect predictions (capture

the function learned by smart model)

•  train “small” model to mimic large model on labeled data

Page 7: Do deep nets really need to be deep?

Logits • Model compression

•  train mimic SNNs using data labeled by DNNs

• DNN trained with softmax output and cross-entropy • SNN trained on logits (log of predicted probabilities)

before softmax activation

Page 8: Do deep nets really need to be deep?

SNN-MIMIC •  Training data

• Objective function

• Weights updated with BP and SGD with momentum

Page 9: Do deep nets really need to be deep?

Speed-up Mimic Learning • SNN has same #parameters: slow learning (GPU weeks) • Add bottleneck linear layer

•  k linear hidden units between input and non-linear hidden layer •  factorize W ∈ RH×D into the product of 2 low-rank matrices

Page 10: Do deep nets really need to be deep?

Cost Function with Linear Layer

• O(k(H+D)) memory instead of O(HD) •  Factorization between input and hidden levels is new and

improve convergence speed during training •  Previous works factorize last output layer

Page 11: Do deep nets really need to be deep?

Use Cases TIMIT (phoneme recognition) •  In: lexically/phonetically labeled sentences •  Out: phonemes

CIFAR-10 (image recognition) •  In: images •  Out: classes

Page 12: Do deep nets really need to be deep?

TIMIT Phoneme Recognition •  1845 dimension input vector from raw waveform audio data •  183 dimension target label vectors (61 phonemes x 3) •  1.1M examples in training set •  DNN

•  3 hidden layers with 2000 ReLU units •  CNN

•  Convolutional + maxPooling + 3 hidden (2000 ReLU) layers •  ECNN

•  Ensemble of 9 CNNs •  SNN

•  8k/50k/400k non linear hidden units

Page 13: Do deep nets really need to be deep?

TIMIT - Compression Results

Page 14: Do deep nets really need to be deep?

TIMIT - Accuracy

Page 15: Do deep nets really need to be deep?

CIFAR-10 Image Recognition •  3072 dimension input vector (32x32 pixels x 3 colors) •  10-dimension target label vectors •  1.05M images in two merged training sets

Page 16: Do deep nets really need to be deep?

CIFAR-10 - Compression Results

Page 17: Do deep nets really need to be deep?

Discussion • Why MIMIC models can be more accurate than training on

original labels •  If labels have errors, teacher may

eliminate them making learning easier for student

•  Teacher might resolve complex regions

•  Learning from probabilities is easier •  All outputs have “reason” for student

while teacher may encounter unexplainable things

Page 18: Do deep nets really need to be deep?

Representational Power “We see little evidence that shallow models have limited capacity

or representational power. Instead, the main limitation appears to be the learning and regularization procedures used to train the shallow models”

Page 19: Do deep nets really need to be deep?

THANK YOU!