quoc le, slides mlconf 11/15/13
DESCRIPTION
TRANSCRIPT
Large Scale Deep Learning
Quoc V. Le Google & CMU
Deep Learning
• Google is using Machine Learning • Machine Learning is difficult • Requires domain knowledge from human experts
Deep Learning:
• Great performances for many problems
• Works well with a large amount of data
• Requires less domain knowledge
Focus:
• Scale deep learning to bigger models and bigger problems
Quoc V. Le
Deep Learning
• Google is using Machine Learning • Machine Learning is difficult • Requires domain knowledge from human experts
Deep Learning:
• Great performances for many problems
• Works well with a large amount of data
• Requires less domain knowledge
Focus:
• Scale deep learning to bigger models and bigger problems
Quoc V. Le
Quoc V. Le
What is Deep Learning?
Quoc V. Le
x
v = g(B u)
…
A
(images, audio, texts, etc.)
u = g(A x)
What is Deep Learning?
B
Quoc V. Le
x
v = g(B u)
…
A
(images, audio, texts, etc.)
u = g(A x)
What is Deep Learning?
B
Quoc V. Le
…
Pixels
High-level features by Deep Learning
Edge detectors
Face detector, Cat detector
Model
Training Data
Quoc V. Le
Google’s DistBelief
Goal: Train deep learning on many machines Model: A multiple layered architecture
Forward pass to compute the features Backward pass to compute the gradient
Model
DistBelief distributes a model across multiple machines and multiple cores.
Training Data
Machine (Model Partition)
Quoc V. Le
Model partition with DistBelief
Model
Machine (Model Partition)
Core Training Data
Quoc V. Le
DistBelief distributes a model across multiple machines and cores.
Model partition with DistBelief
Model
Training Data
Stochastic Gradient Descent (SGD)
Model parameters are partitioned
Can use up to 1000 cores
Quoc V. Le
Model partition with DistBelief
Model
Training Data
But training is still slow on large data sets
Can we add more parallelism? Idea: Train multiple models on different partitions of the data, and merge them
Quoc V. Le
Model partition with DistBelief
Parameter Server
Model Workers
Data Shards
p’ = p + ∆p
∆p p’
Quoc V. Le
Data partition with DistBelief
Model parallelism via model partitioning
Data parallelism via data partitioning and asynchronous communications
DistBelief can scale to billion examples and use 100,000 cores or more
Thanks to its speed, DistBelief dramatically improves many applications
Quoc V. Le
Parallelism in DistBelief
Quoc V. Le
Voice Search Photo Search Text Understanding
Applications
label!
Voice Search
Speech frame
Hidden layers with 1000s nodes
Classifier
Quoc V. Le
Quoc V. Le
Voice Search
Quoc V. Le
Voice Search Photo Search Text Understanding
Applications
Photo Search
Quoc V. Le
Cat detector Front page of New York Times
Seat-belt Boston rocker
Archery Shredder
Amusement, Park
Face
Hammock
Google+ PhotoSearch
Quoc V. Le
Voice Search Photo Search Text Understanding
Applications
Text understanding
Quoc V. Le
Very useful but also difficult
We should try to understand the meaning of words
Deep Learning can learn the meaning of words
~100-D vector space
dolphin
Clinton Paris
Text understanding
whale
Obama
Quoc V. Le
the! cat! sat! on! the!
E E E E Word Matrix
Hidden Layers
Classifier
Predicting the next word in a sentence
is a matrix of dimension ||Vocab|| x d E
Quoc V. Le
Visualizing the word vectors
• Example nearest neighbors trained on Google News
apple Apple iPhone
Mikolov, Sutskever, Le. Learning the Meaning behind Words. Google OpenSource Blog, 2013
Quoc V. Le
Relation Extraction
Quoc V. Le
Machine Translation
Quoc V. Le
Model partition Data partition
Voice Search Photo Search Text Understanding
Summary
Samy Bengio, Tom Dean, Josh Levenberg, Geoff Hinton, Tomas Mikolov, Mark Mao, Patrick Nguyen, Marc’Aurelio Ranzato, Mark Segal, Jon Shlens, Ilya Sutskever, Vincent Vanhoucke
Additional Thanks:
Greg Corrado Jeff Dean Matthieu Devin Kai Chen
Rajat Monga Andrew Ng Paul Tucker Ke Yang
Joint work with