corinna cortes, head of research, google, at mlconf nyc 2017

37
Harnessing Neural Networks Corinna Cortes Google Research, NY

Upload: mlconf

Post on 05-Apr-2017

161 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing Neural Networks

Corinna CortesGoogle Research, NY

Page 2: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?

Page 3: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Google’s mission is to organize the world’s information and make it universally accessible and useful.

Page 4: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Google Translate

Page 5: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Smart reply in Inbox

10%of all responses sent on mobile

Page 6: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

LSTM in Action

Page 7: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

LSTMs and Extrapolation

They daydream or hallucinate :-)

Feature or bug?

Page 8: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

DeepDream Art Auction and Symposium (A&MI)

Page 9: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Magenta

A

ht

Xt

A.I. Duethttps://aiexperiments.withgoogle.com/ai-duet/view/

Page 10: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?

Page 11: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Restricting the Output. Smart Replies.http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf

● Ungrammatical and inappropriate answers○ thanks hon!; Yup, got it thx; Leave me alone!

● Work with a Fixed Response Set○ Sanitized answers are clustered in semantically similar answers using

label propagation;○ The answers in the clusters are used to filter the candidate set generated

by the LSTM. Diversity is ensured by using top answers from different clusters.

● Efficient search via tries

Page 12: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Search Tree, Trie, for Valid Responses

Tuesday Wednesday Tuesday? Wednesday?

I can do

Cluster responses

How about

. !

! What time works for you?

. What time works for you?

Page 13: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Computational Complexity

● Exhaustive: R x l

R size of response set, l length of longest sentence

● Beam search: b x l

Typical size of R ~ millions, typical size of b ~ 10-30

Page 14: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

● A more elegant solution based on rules○ Exploit rules to efficiently enlarge the response set:

■ “Can you do Monday?” “Yes, I can do Monday”■ “Can you do Tuesday?” “Yes, I can do Tuesday”■ ...

“Can you do <time>?”

“Yes, I can do <time>” or “No, I can do <time + 1>

What if the Response Set in Billions?

Page 15: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Rules for Response Set

Text Normalization for Text-to-Speech, TTS, SystemsNavigation assistant

Page 16: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Text Normalization

Richard Sproat, Navdeep Jaitly, Google: “RNN Approaches to Text Normalization: A Challenge”https://arxiv.org/pdf/1611.00068.pdf

Page 17: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Break the Task in Two

● Channel model○ possible normalizations of that token? Sequence of tokens to words.○ Example: 123

■ one hundred twenty three, one two three, one twenty three, ...

● Language model○ which one is appropriate to the given context? Words to words.○ Example: 123

■ 123 King Ave. - the correct reading in American English would normally be one twenty three.

Page 18: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Combining the Models

One combined LSTM

Page 19: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Silly Mistakes

Page 20: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Add a Grammar to Constrain the OutputRule: <number> + <measurement abbreviation> => <number> + the possible verbalizations of the measure abbreviation.

Instantiation: 24.2kg => twenty four point two kilogram, twenty four point two kilograms, twenty four point two kilo.

Finite State Transducers: a finite state automaton which produces output as well as reading input, pattern matching, regular expressions.

Page 21: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Thrax GrammarMEASURE: <number> + <measurement abbreviation> -> <number> + measurement verbalizations

Input: 5 kg -> five kilo/kilograms/kilogram

MONEY: $ <number> -> <number> dollars

Input composed with FSTs. The output of the FST is used to restrict the output of the LSTM.

Page 22: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

TTS: RNN + FSTMeasure and Money restricted by grammar.

Page 23: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?

Page 24: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

One class per image type (horse, car, …), M classes.

Neural network inference: Just to compute the last layer requires MN multiply adds.

Super-Multiclass Classification Problem

Output layer, M units:

Last hidden layer, N units:

Page 25: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Asymmetric Hashing

W1

W2

W3

WM

Weights to the output layer, parted in N/k chunks

● Represent each chunk with a set of cluster centers (256) using k-means.

● Save the coordinates of the centers, (ID, coordinates).

● Save each weight vector as a set of closest IDs, hashcode.

Page 26: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Asymmetric Hashing

W1

W2

W3

WM

Weights to the output layer, parted in N/k chunks

● Represent each chunk with a set of cluster centers (256) using k-means.

● Save the coordinates of the centers, (ID, coordinates).

● Save each weight vector as a set of closest IDs, hashcode.

78 184 15 12 63 192

56 82 72

201 37 51

Page 27: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Asymmetric Hashing, Searching● For given activation u, divide it into its N/k chunks, uj:

○ Compute the 256 N/k distances to centers. 256N multiply adds, not MN.○ Compute the distances to all hash codes:

● MN/k additions needed.● The “Asymmetric” in “Asymmetric Hashing” refers to the fact that we hash the

weight vectors but not the activation vector.

Page 28: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Asymmetric HashingIncredible saving in inference time

Sometimes also with a bit of improved accuracy

Page 29: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?

Page 30: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”

Google: AdaNet, Architecture Search with Reinforcement Learning

MIT: Designing Neural Networks Architectures Using Reinforcement Learning,

Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.

Genetic Algorithms, Reinforcement Learning, Boosting Algorithm

Page 31: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Modeling Challenges for ML

The right model choice can significantly improve the performance. For Deep Learning it is particularly hard as the search space is huge and

● Difficult non-convex optimization● Lack of sufficient theory

Questions● Can neural network architectures be learned

together with their weights?● Can this problem be solved efficiently and in a

principled way?● Can we capture the end-to-end process?

Page 32: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

AdaNet● Incremental construction: At each round, the algorithm adds a subnetwork to

the existing neural network;

● Algorithm leverages embeddings previous learned;● Adaptively grows network, balancing trade-off between empirical error and

model complexity;● Learning bound:

Page 33: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Experimental Results, AdaNet

CIFAR-10: 60,000 images, 10 classes

SD of all #’s: 0.01

Label Pair AdaNet Log. Reg. NN

deer-truck 0.94 0.90 0.92

deer-horse 0.84 0.77 0.81automobile-truck 0.85 0.80 0.81

cat-dog 0.69 0.67 0.66

dog-horse 0.84 0.80 0.81

Page 34: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Neural Architecture Search with RL

Page 35: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Neural Architecture Search with RL

Error rates on CIFAR-10

Perplexity on Penn Treebank

Current accuracy of NAS on ImageNet: 78%State-of-Art: 80.x%

Page 36: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”

Google: AdaNet, Architecture Search with Reinforcement Learning

MIT: Designing Neural Networks Architectures Using Reinforcement Learning,

Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.

Genetic Algorithms, Reinforcement Learning, Boosting Algorithm

Page 37: Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?