nips deeplearningworkshop nnfortext

Learning Representations of Text using NeuralNetworks

Tomas MikolovJoint work with Ilya Sutskever, Kai Chen, Greg Corrado,

Jeff Dean, Quoc Le, Thomas Strohmann

Google Research

NIPS Deep Learning Workshop 2013

1 / 31

Overview

Distributed Representations of Text

Efficient learning

Linguistic regularities

Examples

Translation of words and phrases

Available resources

2 / 31

Representations of Text

Representation of text is very important for performance ofmany real-world applications. The most common techniquesare:

Local representationsN-grams

Bag-of-words

1-of-N coding

Continuous representationsLatent Semantic Analysis

Latent Dirichlet Allocation

Distributed Representations

3 / 31

Distributed Representations

Distributed representations of words can be obtained fromvarious neural network based language models:

Feedforward neural net language model

Recurrent neural net language model

4 / 31

Feedforward Neural Net Language Model

input projection hidden output

w(t-3)

w(t-2)

w(t-1)

w(t)

U

U

U

V W

Four-gram neural net language model architecture (Bengio2001)The training is done using stochastic gradient descent andbackpropagationThe word vectors are in matrix U

5 / 31

Efficient Learning

The training complexity of the feedforward NNLM is high:

Propagation from projection layer to the hidden layer

Softmax in the output layer

Using this model just for obtaining the word vectors is veryinefficient

6 / 31

Efficient Learning

The full softmax can be replaced by:

Hierarchical softmax (Morin and Bengio)

Hinge loss (Collobert and Weston)

Noise contrastive estimation (Mnih et al.)

Negative sampling (our work)

We can further remove the hidden layer: for large models,this can provide additional speedup 1000x

Continuous bag-of-words model

Continuous skip-gram model

7 / 31

Skip-gram Architecture

w(t)

Input projection output

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Predicts the surrounding words given the current word8 / 31

Continuous Bag-of-words Architecture

w(t)

w(t-1)

w(t-2)

Input projection output

w(t+1)

w(t+2)

SUM

Predicts the current word given the context9 / 31

Efficient Learning - Summary

Efficient multi-threaded implementation of the new modelsgreatly reduces the training complexity

The training speed is in order of 100K - 5M words persecond

Quality of word representations improves significantly withmore training data

10 / 31

Linguistic Regularities in Word Vector Space

The word vector space implicitly encodes many regularitiesamong words

11 / 31


The resulting distributed representations of words containsurprisingly a lot of syntactic and semantic information

There are multiple degrees of similarity among words:

KING is similar to QUEEN as MAN is similar toWOMAN

KING is similar to KINGS as MAN is similar to MEN

Simple vector operations with the word vectors providevery intuitive results

12 / 31

Linguistic Regularities - Results

Regularity of the learned word vector space is evaluatedusing test set with about 20K questions

The test set contains both syntactic and semanticquestions

We measure TOP1 accuracy (input words are removedduring search)

We compare our models to previously published wordvectors

13 / 31

Linguistic Regularities - Results

Model Vector Training Training Accuracy

Dimensionality Words Time [%]

Collobert NNLM 50 660M 2 months 11

Turian NNLM 200 37M few weeks 2

Mnih NNLM 100 37M 7 days 9

Mikolov RNNLM 640 320M weeks 25

Huang NNLM 50 990M weeks 13

Our NNLM 100 6B 2.5 days 51

Skip-gram (hier.s.) 1000 6B hours 66

CBOW (negative) 300 1.5B minutes 72

14 / 31


Expression Nearest token

Paris - France + Italy Rome

bigger - big + cold colder

sushi - Japan + Germany bratwurst

Cu - copper + gold Au

Windows - Microsoft + Google Android

Montreal Canadiens - Montreal + Toronto Toronto Maple Leafs

15 / 31

Performance on Rare Words

Word vectors from neural networks were previouslycriticized for their poor performance on rare words

Scaling up training data set size helps to improveperformance on rare words

For evaluation of progress, we have used data set fromLuong et al.: Better word representations with recursiveneural networks for morphology, CoNLL 2013

16 / 31

Performance on Rare Words - Results

Model Correlation with Human Ratings

(Spearman’s rank correlation)

Collobert NNLM 0.28

Collobert NNLM + Morphology features 0.34

CBOW (100B) 0.50

17 / 31

Rare Words - Examples of Nearest Neighbours

Redmond Havel graffiti capitulate

conyers plauen cheesecake abdicate

Collobert NNLM lubbock dzerzhinsky gossip accede

keene osterreich dioramas rearm

McCarthy Jewell gunfire -

Turian NNLM Alston Arzu emotion -

Cousins Ovitz impunity -

Podhurst Pontiff anaesthetics Mavericks

Mnih NNLM Harlang Pinochet monkeys planning

Agarwal Rodionov Jews hesitated

Redmond Wash. Vaclav Havel spray paint capitulation

Skip-gram Redmond Washington president Vaclav Havel grafitti capitulated

(phrases) Microsoft Velvet Revolution taggers capitulating

18 / 31

From Words to Phrases and Beyond

Often we want to represent more than just individualwords: phrases, queries, sentences

The vector representation of a query can be obtained by:

Forming the phrases

Adding the vectors together

19 / 31

From Words to Phrases and Beyond

Example query:restaurants in mountain view that are not very good

Forming the phrases:restaurants in (mountain view) that are (not very good)

Adding the vectors:restaurants + in + (mountain view) + that + are + (not verygood)

Very simple and efficient

Will not work well for long sentences or documents

20 / 31

Compositionality by Vector Addition

Expression Nearest tokens

Czech + currency koruna, Czech crown, Polish zloty, CTK

Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese

German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa

Russian + river Moscow, Volga River, upriver, Russia

French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg

21 / 31

Visualization of Regularities in Word Vector Space

We can visualize the word vectors by projecting them to 2Dspace

PCA can be used for dimensionality reduction

Although a lot of information is lost, the regular structure isoften visible

22 / 31


−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

princess

prince

heroine

hero

actress

actor

landlady

landlord

female

male

cow

bull

hen

cockqueen

king

she

he

23 / 31


−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

give

gave

given

take

took

taken

draw

drew

drawn

fall

fell

fallen

24 / 31


-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

China

Japan

France

Russia

Germany

Italy

SpainGreece

Turkey

Beijing

Paris

Tokyo

Poland

Moscow

Portugal

Berlin

RomeAthens

Madrid

Ankara

Warsaw

Lisbon

25 / 31

Machine Translation

Word vectors should have similar structure when trainedon comparable corpora

This should hold even for corpora in different languages

26 / 31

Machine Tanslation - English to Spanish

−0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

cat

dog

cow

horse

pig

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

gato (cat)

perro (dog)vaca (cow)

caballo (horse)

cerdo (pig)

The figures were manually rotated and scaled

27 / 31

Machine Translation

For translation from one vector space to another, we needto learn a linear projection (will perform rotation andscaling)

Small starting dictionary can be used to train the linearprojection

Then, we can translate any word that was seen in themonolingual data

28 / 31

MT - Accuracy of English to Spanish translation

107

108

109

1010

0

10

20

30

40

50

60

70

Number of training words

Acc

ura

cy

Precision@1

Precision@5

29 / 31

Machine Translation

When applied to English to Spanish word translation, theaccuracy is above 90% for the most confident translations

Can work for any language pair (we tried English toVietnamese)

More details in paper: Exploiting similarities amonglanguages for machine translation

30 / 31

Available Resources

The project webpage is code.google.com/p/word2vecopen-source code

pretrained word vectors (model for common words andphrases will be uploaded soon)

links to the papers

31 / 31

code.google.com/p/word2vec

nips deeplearningworkshop nnfortext

Documents

words architecturewtwt

surrounding words

learned word vector

feedforward nnlm

withmore training data10

collobert nnlm

projection layer

mnih nnlm