do deep nets really need to be deep?
TRANSCRIPT
DO DEEP NETS REALLY NEED TO BE DEEP?
Meoni Marco – UNIPI – March 7th 2016
Lei Jimmy Ba University of Toronto
Rich Caruana Microsoft Research
PhD course in Deep Learning
NNs
Outputs
Inputs
SNN: Single Hidden Layer
Outputs
Inputs
DNN: Three Hidden Layers Outputs
Inputs
CNN: Three Hidden Layers above Convolutional/MaxPooling Layers
Introduction • DNNs excel over SNNs
• e.g. accuracy on top of 1M labeled points is 91% vs 86%
• Source of improvement of DNNs vs SNNs • Deep nets have more parameters? • Deep nets can learn more complex functions? • Convolution gives a plus?
Contribution • Possible to train a SNN that mimics the function of a DNN
• Model compression method
• Possible to mimic but non able to train • SNNs as accurate as DNNs even if not possible to train SNNs as
accurate as DNNs on the original labeled data
• Necessary to be deep? • If SNN can mimic a DNN, DDN learning function not that deep?
• Success related to the learning process
Model Compression
DNN CNN …
Ensemble
Data
1. Build a complex model 2. Train a simple model to mimic complex function
3. Apply it
Scores
Labels
SNN
Data
Scores
SNN
Data
Labels
• Compress large ensembles into smaller, faster models • Train to learn the function learned by the larger model, not on original labels
Model Compression (Bucila, Caruana & Niculescu 2006)
• Train smaller model to mimic a larger, smarter model • train smart model anyway you want:
• DNN, CNN, or ensemble of CNNs • pass large unlabeled data through model to collect predictions (capture
the function learned by smart model)
• train “small” model to mimic large model on labeled data
Logits • Model compression
• train mimic SNNs using data labeled by DNNs
• DNN trained with softmax output and cross-entropy • SNN trained on logits (log of predicted probabilities)
before softmax activation
SNN-MIMIC • Training data
• Objective function
• Weights updated with BP and SGD with momentum
Speed-up Mimic Learning • SNN has same #parameters: slow learning (GPU weeks) • Add bottleneck linear layer
• k linear hidden units between input and non-linear hidden layer • factorize W ∈ RH×D into the product of 2 low-rank matrices
Cost Function with Linear Layer
• O(k(H+D)) memory instead of O(HD) • Factorization between input and hidden levels is new and
improve convergence speed during training • Previous works factorize last output layer
Use Cases TIMIT (phoneme recognition) • In: lexically/phonetically labeled sentences • Out: phonemes
CIFAR-10 (image recognition) • In: images • Out: classes
TIMIT Phoneme Recognition • 1845 dimension input vector from raw waveform audio data • 183 dimension target label vectors (61 phonemes x 3) • 1.1M examples in training set • DNN
• 3 hidden layers with 2000 ReLU units • CNN
• Convolutional + maxPooling + 3 hidden (2000 ReLU) layers • ECNN
• Ensemble of 9 CNNs • SNN
• 8k/50k/400k non linear hidden units
TIMIT - Compression Results
TIMIT - Accuracy
CIFAR-10 Image Recognition • 3072 dimension input vector (32x32 pixels x 3 colors) • 10-dimension target label vectors • 1.05M images in two merged training sets
CIFAR-10 - Compression Results
Discussion • Why MIMIC models can be more accurate than training on
original labels • If labels have errors, teacher may
eliminate them making learning easier for student
• Teacher might resolve complex regions
• Learning from probabilities is easier • All outputs have “reason” for student
while teacher may encounter unexplainable things
Representational Power “We see little evidence that shallow models have limited capacity
or representational power. Instead, the main limitation appears to be the learning and regularization procedures used to train the shallow models”
THANK YOU!