quoc le tera-scale deep learning
TRANSCRIPT
Tera-scale deep learning Quoc V. Le
Stanford University and Google
Samy Bengio, Zhenghao Chen, Tom Dean, Pangwei Koh, Mark Mao, Jiquan Ngiam, Patrick Nguyen, Andrew Saxe, Mark Segal, Jon Shlens, Vincent Vanhouke, Xiaoyun Wu, Peng Xe, Serena Yeung, Will Zou
AddiNonal Thanks:
Greg Corrado Jeff Dean MaQhieu Devin Kai Chen
Rajat Monga Andrew Ng Marc’Aurelio Ranzato
Paul Tucker Ke Yang
Joint work with
Machine Learning successes
Face recogniNon OCR Autonomous car
RecommendaNon systems Web page ranking
Email classificaNon
Classifier
Feature extracNon (Mostly hand-‐cra]ed features)
Feature ExtracNon
Hand-‐Cra]ed Features Computer vision:
Speech RecogniNon:
MFCC Spectrogram ZCR
…
SIFT/HOG SURF
…
New feature-‐designing paradigm
Unsupervised Feature Learning / Deep Learning ReconstrucNon ICA Expensive and typically applied to small problems
The Trend of BigData
No maQer the algorithm, more features always more successful.
Outline
-‐ ReconstrucNon ICA
-‐ ApplicaNons to videos, cancer images
-‐ Ideas for scaling up
-‐ Scaling up Results
Topographic Independent Component Analysis (TICA)
2. Learning
Input data:
W 1
W 9
W 1 T W
9 T
1. Feature computaNon
W 9 T ) (
2
W 1 T ) (
2
W =
W 1
W 2
W 10000
.
.
Topographic Independent Component Analysis (TICA)
Invariance explained
F2
F1
Loc1
1
0
0
1
Pooled feature of F1 and F2
Loc2
Images
Features
Image1 Image2
sqrt(1 + 0 ) = 1 2 2 sqrt(0 + 1 ) = 1 2 2
Same value regardless the locaNon of the edge
TICA: ReconstrucNon ICA:
Equivalence between Sparse Coding, Autoencoders, RBMs and ICA
Build deep architecture by treaNng the output of one layer as input to another layer
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
ReconstrucNon ICA:
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
ReconstrucNon ICA:
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
Data whitening
TICA: ReconstrucNon ICA:
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
Data whitening
Why RICA?
Algorithms Speed
Sparse Coding
RBMs/Autoencoders
Invariant Features Ease of training
TICA
ReconstrucNon ICA
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
Summary of RICA
-‐ Two-‐layered network
-‐ ReconstrucNon cost instead of orthogonality constraints
-‐ Learns invariant features
ApplicaNons of RICA
Sit up Drive Car Get Out of Car
Eat Answer phone Kiss
Run Stand up Shake hands
AcNon recogniNon
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
70
71
72
73
74
75
76
75
77
79
81
83
85
87
35
37
39
41
43
45
47
49
51
53
55
80
82
84
86
88
90
92
94 KTH Hollywood2
UCF YouTube
Hessian/SURF Learned Features Learned Features
Combined Engineered Features
Learned Features Learned Features
pLSA HOF HOG GRBMs 3DCNN HMAX Hessian/SURF HOG/HOF HOG3D GRBMS HOF
HOG3D HOF Hessian HOG.HOF
HOG Hessian/SURF
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
Cancer classificaNon
ApoptoNc
Viable tumor region
Necrosis
…
Le, et al., Learning Invariant Features of Tumor Signatures. ISBI 2012
84%
86%
88%
90%
92%
Hand engineered Features RICA
Scaling up deep RICA networks
Scaling up Deep Learning
Real data
Deep learning data
No maQer the algorithm, more features always more successful.
It’s beQer to have more features!
Coates, et al., An Analysis of Single-‐Layer Networks in Unsupervised Feature Learning. AISTATS’11
Most are local features
Local recepNve field networks
Machine #1 Machine #2 Machine #3 Machine #4
Le, et al., Tiled Convolu1onal Neural Networks. NIPS 2010
RICA features
Image
Challenges with 1000s of machines
Asynchronous Parallel SGDs
Parameter server
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Asynchronous Parallel SGDs
Parameter server
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Summary of Scaling up
-‐ Local connecNvity
-‐ Asynchronous SGDs
… And more
-‐ RPC vs MapReduce
-‐ Prefetching
-‐ Single vs Double
-‐ Removing slow machines
-‐ OpNmized So]max
-‐ …
10 million 200x200 images
1 billion parameters
Training
Dataset: 10 million 200x200 unlabeled images from YouTube/Web Train on 2000 machines (16000 cores) for 1 week 1.15 billion parameters -‐ 100x larger than previously reported -‐ Small compared to visual cortex
Pooling Size = 5 Number of maps = 8
Image Size = 200
Number of output channels = 8
Number of input channels = 3
One
laye
r
RF size = 18
Input to another layer above (image with 8 channels)
W
H
LCN Size = 5
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Image
RICA
RICA
RICA
Top sNmuli from the test set OpNmal sNmulus by numerical opNmizaNon
The face neuron
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Feature value
Random distractors
Faces
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Frequency
Invariance properNes Feature respon
se
Horizontal shi]s VerNcal shi]s
Feature respon
se
3D rotaNon angle
Feature respon
se
90
20 pixels
o Feature respon
se
Scale factor
1.6x
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
0 pixels 0 pixels 20 pixels
0 o 1x 0.4x
Top sNmuli from the test set OpNmal sNmulus by numerical opNmizaNon
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Random distractors
Pedestrians
Feature value
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Frequency
Top sNmuli from the test set OpNmal sNmulus by numerical opNmizaNon
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Random distractors
Cat faces
Feature value
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Frequency
ImageNet classificaNon
22,000 categories 14,000,000 images Hand-‐engineered features (SIFT, HOG, LBP), SpaNal pyramid, SparseCoding/Compression
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
22,000 is a lot of categories… … smoothhound, smoothhound shark, Mustelus mustelus American smooth dogfish, Mustelus canis Florida smoothhound, Mustelus norrisi whiteNp shark, reef whiteNp shark, Triaenodon obseus AtlanNc spiny dogfish, Squalus acanthias Pacific spiny dogfish, Squalus suckleyi hammerhead, hammerhead shark smooth hammerhead, Sphyrna zygaena smalleye hammerhead, Sphyrna tudes shovelhead, bonnethead, bonnet shark, Sphyrna Nburo angel shark, angelfish, SquaNna squaNna, monkfish electric ray, crampfish, numbfish, torpedo smalltooth sawfish, PrisNs pecNnatus guitarfish roughtail sNngray, DasyaNs centroura buQerfly ray eagle ray spoQed eagle ray, spoQed ray, Aetobatus narinari cownose ray, cow-‐nosed ray, Rhinoptera bonasus manta, manta ray, devilfish AtlanNc manta, Manta birostris devil ray, Mobula hypostoma grey skate, gray skate, Raja baNs liQle skate, Raja erinacea …
SNngray
Mantaray
Best sNmuli
Pooling Size = 5 Number of maps = 8
Image Size = 200
Number of output channels = 8
Number of input channels = 3
One
laye
r
RF size = 18
Input to another layer above (image with 8 channels)
W
H
LCN Size = 5
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Pooling Size = 5 Number of maps = 8
Image Size = 200
Number of output channels = 8
Number of input channels = 3
One
laye
r
RF size = 18
Input to another layer above (image with 8 channels)
W
H
LCN Size = 5
Feature 7
Feature 8
Feature 6
Feature 9
Best sNmuli
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Pooling Size = 5 Number of maps = 8
Image Size = 200
Number of output channels = 8
Number of input channels = 3
One
laye
r
RF size = 18
Input to another layer above (image with 8 channels)
W
H
LCN Size = 5
Feature 11
Feature 10
Feature 12
Feature 13
Best sNmuli
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
0.005% Random guess
9.5% ? Feature learning From raw pixels
State-‐of-‐the-‐art (Weston, Bengio ‘11)
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
ImageNet 2009 (10k categories): Best published result: 17% (Sanchez & Perronnin ‘11 ), Our method: 20% Using only 1000 categories, our method > 50%
0.005% Random guess
9.5% State-‐of-‐the-‐art
(Weston, Bengio ‘11)
15.8% Feature learning From raw pixels
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
No maQer the algorithm, more features always more successful.
Other results
-‐ We also have great features for
-‐ Speech recogniNon
-‐ Word-‐vector embedding for NLPs
• RICA learns invariant features • Face neuron with totally unlabeled data with enough training and data • State-‐of-‐the-‐art performances on
– AcNon RecogniNon – Cancer image classificaNon – ImageNet
Conclusions
80
82
84
86
88
90
92
94
Cancer classificaNon AcNon recogniNon AcNon recogniNon benchmarks
Feature visualizaNon Face neuron
0.005% 9.5% 15.8% Random guess Best published result Our method
ImageNet
• Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, A.Y. Ng. Building high-‐level features using large-‐scale unsupervised learning. ICML, 2012.
• Q.V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, A.Y. Ng. Tiled Convolu8onal Neural Networks. NIPS, 2010.
• Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng. Learning hierarchical spa8o-‐temporal features for ac8on recogni8on with independent subspace analysis. CVPR, 2011.
• Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. On op8miza8on methods for deep learning. ICML, 2011.
• Q.V. Le, A. Karpenko, J. Ngiam, A.Y. Ng. ICA with Reconstruc8on Cost for Efficient Overcomplete Feature Learning. NIPS, 2011.
• Q.V. Le, J. Han, J. Gray, P. Spellman, A. Borowsky, B. Parvin. Learning Invariant Features for Tumor Signatures. ISBI, 2012.
• I.J. Goodfellow, Q.V. Le, A.M. Saxe, H. Lee, A.Y. Ng, Measuring invariances in deep networks. NIPS, 2009.
References
hQp://ai.stanford.edu/~quocle