quoc le, slides mlconf 11/15/13

Large Scale Deep Learning

Quoc V. Le Google & CMU

Deep Learning

•  Google is using Machine Learning •  Machine Learning is difficult •  Requires domain knowledge from human experts

Deep Learning:

•  Great performances for many problems

•  Works well with a large amount of data

•  Requires less domain knowledge

Focus:

•  Scale deep learning to bigger models and bigger problems

Quoc V. Le

Quoc V. Le

What is Deep Learning?

Quoc V. Le

x

v = g(B u)

…

A

(images, audio, texts, etc.)

u = g(A x)

What is Deep Learning?

B

Quoc V. Le

…

Pixels

High-level features by Deep Learning

Edge detectors

Face detector, Cat detector

Model

Training Data

Quoc V. Le

Google’s DistBelief

Goal: Train deep learning on many machines Model: A multiple layered architecture

Forward pass to compute the features Backward pass to compute the gradient

Model

DistBelief distributes a model across multiple machines and multiple cores.

Training Data

Machine (Model Partition)

Quoc V. Le

Model partition with DistBelief

Model

Machine (Model Partition)

Core Training Data

Quoc V. Le

DistBelief distributes a model across multiple machines and cores.


Model

Training Data

Stochastic Gradient Descent (SGD)

Model parameters are partitioned

Can use up to 1000 cores

Quoc V. Le


Model

Training Data

But training is still slow on large data sets

Can we add more parallelism? Idea: Train multiple models on different partitions of the data, and merge them

Quoc V. Le


Parameter Server

Model Workers

Data Shards

p’ = p + ∆p

∆p p’

Quoc V. Le

Data partition with DistBelief

Model parallelism via model partitioning

Data parallelism via data partitioning and asynchronous communications

DistBelief can scale to billion examples and use 100,000 cores or more

Thanks to its speed, DistBelief dramatically improves many applications

Quoc V. Le

Parallelism in DistBelief

Quoc V. Le

Voice Search Photo Search Text Understanding

Applications

label!

Voice Search

Speech frame

Hidden layers with 1000s nodes

Classifier

Quoc V. Le

Quoc V. Le

Voice Search

Quoc V. Le


Applications

Photo Search

Quoc V. Le

Cat detector Front page of New York Times

Seat-belt Boston rocker

Archery Shredder

Amusement, Park

Face

Hammock

Google+ PhotoSearch

Quoc V. Le


Applications

Text understanding

Quoc V. Le

Very useful but also difficult

We should try to understand the meaning of words

Deep Learning can learn the meaning of words

~100-D vector space

dolphin

Clinton Paris

Text understanding

whale

Obama

Quoc V. Le

the! cat! sat! on! the!

E E E E Word Matrix

Hidden Layers

Classifier

Predicting the next word in a sentence

is a matrix of dimension ||Vocab|| x d E

Quoc V. Le

Visualizing the word vectors

•  Example nearest neighbors trained on Google News

apple Apple iPhone

Mikolov, Sutskever, Le. Learning the Meaning behind Words. Google OpenSource Blog, 2013

Quoc V. Le

Relation Extraction

Quoc V. Le

Machine Translation

Quoc V. Le

Model partition Data partition


Summary

Samy Bengio, Tom Dean, Josh Levenberg, Geoff Hinton, Tomas Mikolov, Mark Mao, Patrick Nguyen, Marc’Aurelio Ranzato, Mark Segal, Jon Shlens, Ilya Sutskever, Vincent Vanhoucke

Additional Thanks:

Greg Corrado Jeff Dean Matthieu Devin Kai Chen

Rajat Monga Andrew Ng Paul Tucker Ke Yang

Joint work with

quoc le, slides mlconf 11/15/13

Technology

deep learning google

train deep learning

large scale deep learning

distbelief model parallelism

meaning of words deep

deep learning face detector

machines model

human experts deep learning