nips deeplearningworkshop nnfortext

31
Learning Representations of Text using Neural Networks Tom ´ s Mikolov Joint work with Ilya Sutskever, Kai Chen, Greg Corrado, Jeff Dean, Quoc Le, Thomas Strohmann Google Research NIPS Deep Learning Workshop 2013 1 / 31

Upload: jumped99

Post on 04-Dec-2015

42 views

Category:

Documents


7 download

DESCRIPTION

Nice presentation of deep learning techniques for Natural Language Processing

TRANSCRIPT

Page 1: NIPS DeepLearningWorkshop NNforText

Learning Representations of Text using NeuralNetworks

Tomas MikolovJoint work with Ilya Sutskever, Kai Chen, Greg Corrado,

Jeff Dean, Quoc Le, Thomas Strohmann

Google Research

NIPS Deep Learning Workshop 2013

1 / 31

Page 2: NIPS DeepLearningWorkshop NNforText

Overview

Distributed Representations of Text

Efficient learning

Linguistic regularities

Examples

Translation of words and phrases

Available resources

2 / 31

Page 3: NIPS DeepLearningWorkshop NNforText

Representations of Text

Representation of text is very important for performance ofmany real-world applications. The most common techniquesare:

Local representationsN-grams

Bag-of-words

1-of-N coding

Continuous representationsLatent Semantic Analysis

Latent Dirichlet Allocation

Distributed Representations

3 / 31

Page 4: NIPS DeepLearningWorkshop NNforText

Distributed Representations

Distributed representations of words can be obtained fromvarious neural network based language models:

Feedforward neural net language model

Recurrent neural net language model

4 / 31

Page 5: NIPS DeepLearningWorkshop NNforText

Feedforward Neural Net Language Model

input projection hidden output

w(t-3)

w(t-2)

w(t-1)

w(t)

U

U

U

V W

Four-gram neural net language model architecture (Bengio2001)The training is done using stochastic gradient descent andbackpropagationThe word vectors are in matrix U

5 / 31

Page 6: NIPS DeepLearningWorkshop NNforText

Efficient Learning

The training complexity of the feedforward NNLM is high:

Propagation from projection layer to the hidden layer

Softmax in the output layer

Using this model just for obtaining the word vectors is veryinefficient

6 / 31

Page 7: NIPS DeepLearningWorkshop NNforText

Efficient Learning

The full softmax can be replaced by:

Hierarchical softmax (Morin and Bengio)

Hinge loss (Collobert and Weston)

Noise contrastive estimation (Mnih et al.)

Negative sampling (our work)

We can further remove the hidden layer: for large models,this can provide additional speedup 1000x

Continuous bag-of-words model

Continuous skip-gram model

7 / 31

Page 8: NIPS DeepLearningWorkshop NNforText

Skip-gram Architecture

w(t)

Input projection output

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Predicts the surrounding words given the current word8 / 31

Page 9: NIPS DeepLearningWorkshop NNforText

Continuous Bag-of-words Architecture

w(t)

w(t-1)

w(t-2)

Input projection output

w(t+1)

w(t+2)

SUM

Predicts the current word given the context9 / 31

Page 10: NIPS DeepLearningWorkshop NNforText

Efficient Learning - Summary

Efficient multi-threaded implementation of the new modelsgreatly reduces the training complexity

The training speed is in order of 100K - 5M words persecond

Quality of word representations improves significantly withmore training data

10 / 31

Page 11: NIPS DeepLearningWorkshop NNforText

Linguistic Regularities in Word Vector Space

The word vector space implicitly encodes many regularitiesamong words

11 / 31

Page 12: NIPS DeepLearningWorkshop NNforText

Linguistic Regularities in Word Vector Space

The resulting distributed representations of words containsurprisingly a lot of syntactic and semantic information

There are multiple degrees of similarity among words:

KING is similar to QUEEN as MAN is similar toWOMAN

KING is similar to KINGS as MAN is similar to MEN

Simple vector operations with the word vectors providevery intuitive results

12 / 31

Page 13: NIPS DeepLearningWorkshop NNforText

Linguistic Regularities - Results

Regularity of the learned word vector space is evaluatedusing test set with about 20K questions

The test set contains both syntactic and semanticquestions

We measure TOP1 accuracy (input words are removedduring search)

We compare our models to previously published wordvectors

13 / 31

Page 14: NIPS DeepLearningWorkshop NNforText

Linguistic Regularities - Results

Model Vector Training Training Accuracy

Dimensionality Words Time [%]

Collobert NNLM 50 660M 2 months 11

Turian NNLM 200 37M few weeks 2

Mnih NNLM 100 37M 7 days 9

Mikolov RNNLM 640 320M weeks 25

Huang NNLM 50 990M weeks 13

Our NNLM 100 6B 2.5 days 51

Skip-gram (hier.s.) 1000 6B hours 66

CBOW (negative) 300 1.5B minutes 72

14 / 31

Page 15: NIPS DeepLearningWorkshop NNforText

Linguistic Regularities in Word Vector Space

Expression Nearest token

Paris - France + Italy Rome

bigger - big + cold colder

sushi - Japan + Germany bratwurst

Cu - copper + gold Au

Windows - Microsoft + Google Android

Montreal Canadiens - Montreal + Toronto Toronto Maple Leafs

15 / 31

Page 16: NIPS DeepLearningWorkshop NNforText

Performance on Rare Words

Word vectors from neural networks were previouslycriticized for their poor performance on rare words

Scaling up training data set size helps to improveperformance on rare words

For evaluation of progress, we have used data set fromLuong et al.: Better word representations with recursiveneural networks for morphology, CoNLL 2013

16 / 31

Page 17: NIPS DeepLearningWorkshop NNforText

Performance on Rare Words - Results

Model Correlation with Human Ratings

(Spearman’s rank correlation)

Collobert NNLM 0.28

Collobert NNLM + Morphology features 0.34

CBOW (100B) 0.50

17 / 31

Page 18: NIPS DeepLearningWorkshop NNforText

Rare Words - Examples of Nearest Neighbours

Redmond Havel graffiti capitulate

conyers plauen cheesecake abdicate

Collobert NNLM lubbock dzerzhinsky gossip accede

keene osterreich dioramas rearm

McCarthy Jewell gunfire -

Turian NNLM Alston Arzu emotion -

Cousins Ovitz impunity -

Podhurst Pontiff anaesthetics Mavericks

Mnih NNLM Harlang Pinochet monkeys planning

Agarwal Rodionov Jews hesitated

Redmond Wash. Vaclav Havel spray paint capitulation

Skip-gram Redmond Washington president Vaclav Havel grafitti capitulated

(phrases) Microsoft Velvet Revolution taggers capitulating

18 / 31

Page 19: NIPS DeepLearningWorkshop NNforText

From Words to Phrases and Beyond

Often we want to represent more than just individualwords: phrases, queries, sentences

The vector representation of a query can be obtained by:

Forming the phrases

Adding the vectors together

19 / 31

Page 20: NIPS DeepLearningWorkshop NNforText

From Words to Phrases and Beyond

Example query:restaurants in mountain view that are not very good

Forming the phrases:restaurants in (mountain view) that are (not very good)

Adding the vectors:restaurants + in + (mountain view) + that + are + (not verygood)

Very simple and efficient

Will not work well for long sentences or documents

20 / 31

Page 21: NIPS DeepLearningWorkshop NNforText

Compositionality by Vector Addition

Expression Nearest tokens

Czech + currency koruna, Czech crown, Polish zloty, CTK

Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese

German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa

Russian + river Moscow, Volga River, upriver, Russia

French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg

21 / 31

Page 22: NIPS DeepLearningWorkshop NNforText

Visualization of Regularities in Word Vector Space

We can visualize the word vectors by projecting them to 2Dspace

PCA can be used for dimensionality reduction

Although a lot of information is lost, the regular structure isoften visible

22 / 31

Page 23: NIPS DeepLearningWorkshop NNforText

Visualization of Regularities in Word Vector Space

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

princess

prince

heroine

hero

actress

actor

landlady

landlord

female

male

cow

bull

hen

cockqueen

king

she

he

23 / 31

Page 24: NIPS DeepLearningWorkshop NNforText

Visualization of Regularities in Word Vector Space

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

give

gave

given

take

took

taken

draw

drew

drawn

fall

fell

fallen

24 / 31

Page 25: NIPS DeepLearningWorkshop NNforText

Visualization of Regularities in Word Vector Space

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

China

Japan

France

Russia

Germany

Italy

SpainGreece

Turkey

Beijing

Paris

Tokyo

Poland

Moscow

Portugal

Berlin

RomeAthens

Madrid

Ankara

Warsaw

Lisbon

25 / 31

Page 26: NIPS DeepLearningWorkshop NNforText

Machine Translation

Word vectors should have similar structure when trainedon comparable corpora

This should hold even for corpora in different languages

26 / 31

Page 27: NIPS DeepLearningWorkshop NNforText

Machine Tanslation - English to Spanish

−0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

cat

dog

cow

horse

pig

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

gato (cat)

perro (dog)vaca (cow)

caballo (horse)

cerdo (pig)

The figures were manually rotated and scaled

27 / 31

Page 28: NIPS DeepLearningWorkshop NNforText

Machine Translation

For translation from one vector space to another, we needto learn a linear projection (will perform rotation andscaling)

Small starting dictionary can be used to train the linearprojection

Then, we can translate any word that was seen in themonolingual data

28 / 31

Page 29: NIPS DeepLearningWorkshop NNforText

MT - Accuracy of English to Spanish translation

107

108

109

1010

0

10

20

30

40

50

60

70

Number of training words

Acc

ura

cy

Precision@1

Precision@5

29 / 31

Page 30: NIPS DeepLearningWorkshop NNforText

Machine Translation

When applied to English to Spanish word translation, theaccuracy is above 90% for the most confident translations

Can work for any language pair (we tried English toVietnamese)

More details in paper: Exploiting similarities amonglanguages for machine translation

30 / 31

Page 31: NIPS DeepLearningWorkshop NNforText

Available Resources

The project webpage is code.google.com/p/word2vecopen-source code

pretrained word vectors (model for common words andphrases will be uploaded soon)

links to the papers

31 / 31