visual-semantic embeddings: some thoughts on language

1

@graphificRoelof Pieters

Visual-‐Seman3c Embeddings:

Some Thoughts on Language

23 January 2015 KTH

FEEDAwww.csc.kth.se/~roelof/

[email protected]

https://twitter.com/graphific

http://@graphific

http://www.feeda.com













































http://www.csc.kth.se/~roelof/


mailto:[email protected]

Language Understanding

2

Can we understand Language ?

1. Language is ambiguous:Every sentence has many possible interpretations.

2. Language is productive:We will always encounter new words or new constructions

3. Language is culturally specific

Some of the challenges in Language Understanding:

3




• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

• the students went to class DT NN VB P NN

4





5


[Karlgren 2014, NLP Sthlm Meetup]6

http://www.apple.com




3. Language is culturally specific


7

ML: Traditional Approach

1. Gather as much LABELED data as you can get

2. Throw some algorithms at it (mainly put in an SVM and keep it at that)

3. If you actually have tried more algos: Pick the best

4. Spend hours hand engineering some features / feature selection / dimensionality reduction (PCA, SVD, etc)

5. Repeat…

For each new problem/question::

8

Machine Learning for NLP

Data

Classic Approach: Data is fed into a learning algorithm:

Learning Algorithm

9


some of the (many) treebank datasets

source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

!

10

http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

Penn TreebankThat’s a lot of “manual” work:

11

• the students went to class

DT NN VB P NN

• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

With a lot of issues:

Penn Treebank

12


Learning AlgorithmData

“Features”

PredictionPrediction/Classifier

train set

test set

13


Learning Algorithm

“Features”

PredictionPrediction/Classifier

train set

test set

14

Deep Learning: Why for NLP ?

Beat state of the art at:• Language Modeling (Mikolov et al. 2011) [WSJ AR task] • Speech Recognition (Dahl et al. 2012, Seide et al 2011;

following Mohammed et al. 2011)• Sentiment Classification (Socher et al. 2011)

and we already know for some longer time it works well for Computer Vision of course:• MNIST hand-written digit recognition (Ciresan et al.

2010)• Image Recognition (Krizhevsky et al. 2012) [ImageNet]

15

One Model rules them all ?DL approaches have been successfully applied to:

Deep Learning: Why for NLP ?

Automatic summarization Coreference resolution Discourse analysis

Machine translation Morphological segmentation Named entity recognition (NER)

Natural language generation

Natural language understanding

Optical character recognition (OCR)

Part-of-speech tagging

Parsing

Question answering

Relationship extraction

sentence boundary disambiguation

Sentiment analysis

Speech recognition

Speech segmentation

Topic segmentation and recognition

Word segmentation

Word sense disambiguation

Information retrieval (IR)

Information extraction (IE)

Speech processing

16

• What is the meaning of a word? (Lexical semantics)

• What is the meaning of a sentence? ([Compositional] semantics)

• What is the meaning of a longer piece of text?(Discourse semantics)

Semantics: Meaning

17

• NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:

• or in vector space:

• also known as “one hot” representation.

• Its problem ?

Word Representation

Love Candy Store

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]

Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] ANDStore [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !

18

• Structure corresponds to meaning:

Structure and Meaning

19

• Semantics

• Syntax

20

NLP: what can we work with?

• Language models define probability distributions over (natural language) strings or sentences

• Joint and Conditional Probability

Language Model

21


Language Model

22


Language Model

23

Word senses

What is the meaning of words?

• Most words have many different senses:dog = animal or sausage?

How are the meanings of different words related?

• - Specific relations between senses:Animal is more general than dog.

• - Semantic fields:money is related to bank

24

Word senses

Polysemy:

• A lexeme is polysemous if it has different related senses

• bank = financial institution or building

Homonyms:

• Two lexemes are homonyms if their senses are unrelated, but they happen to have the same spelling and pronunciation

• bank = (financial) bank or (river) bank

25

Word senses: relations

Symmetric relations:

• Synonyms: couch/sofaTwo lemmas with the same sense

• Antonyms: cold/hot, rise/fall, in/outTwo lemmas with the opposite sense

Hierarchical relations:

• Hypernyms and hyponyms: pet/dog The hyponym (dog) is more specific than the hypernym (pet)

• Holonyms and meronyms: car/wheelThe meronym (wheel) is a part of the holonym (car)

26

Distributional representations

“You shall know a word by the company it keeps” (J. R. Firth 1957)

One of the most successful ideas of modern statistical NLP!

these words represent banking

• Hard (class based) clustering models

• Soft clustering models27

Distributional hypothesis

He filled the wampimuk, passed it around and we all drunk some

We found a little, hairy wampimuk sleeping behind the tree

(McDonald & Ramscar 2001)28

Distributional semantics

Landauer and Dumais (1997), Turney and Pantel (2010), …29

Distributional semanticsDistributional meaning as co-occurrence vector:

30

Distributional representations

• Taking it further:

• Continuous word embeddings

• Combine vector space semantics with the prediction of probabilistic models

• Words are represented as a dense vector:

Candy =

31

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

32


adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born

33

• Can theoretically (given enough units) approximate “any” function

• and fit to “any” kind of data

• Efficient for NLP: hidden layers can be used as word lookup tables

• Dense distributed word vectors + efficient NN training algorithms:

• Can scale to billions of words !

Why Neural Networks for NLP?

34

• Representation of words as continuous vectors has a long history (Hinton et al. 1986; Rumelhart et al. 1986; Elman 1990)

• First neural network language model: NNLM (Bengio et al. 2001; Bengio et al. 2003) based on earlier ideas of distributed representations for symbols (Hinton 1986)

How?

35

http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/74

http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf


Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born ?

…

36

Compositionality

Principle of compositionality:

the “meaning (vector) of a complex expression (sentence) is determined by:

— Gottlob Frege (1848 - 1925)

- the meanings of its constituent expressions (words) and

- the rules (grammar) used to combine them”

37

• How do we handle the compositionality of language in our models?

38

Compositionality


• Recursion :the same operator (same parameters) is applied repeatedly on different components

39

Compositionality


• Option 1: Recurrent Neural Networks (RNN)

40

RNN 1: Recurrent Neural Networks


• Option 2: Recursive Neural Networks (also sometimes called RNN)

41

RNN 2: Recursive Neural Networks

• achieved SOTA in 2011 on Language Modeling (WSJ AR task) (Mikolov et al., INTERSPEECH 2011):

• and again at ASRU 2011:

42

Recurrent Neural Networks

“Comparison to other LMs shows that RNN LMs are state of the art by a large margin. Improvements inrease with more training data.”

“[ RNN LM trained on a] single core on 400M words in a few days, with 1% absolute improvement in WER on state of the art setup”

Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)Recurrent neural network based language model

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

43


(simple recurrent neural network for LM)

input

hidden layer(s)

output layer

+ sigmoid activation function+ softmax function:

Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)Recurrent neural network based language model

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

44


backpropagation through time

45


backpropagation through time

class based recurrent NN

[code (Mikolov’s RNNLM Toolkit) and more info: http://rnnlm.org/ ]

http://rnnlm.org/

• Recursive Neural Network for LM (Socher et al. 2011; Socher 2014)

• achieved SOTA on new Stanford Sentiment Treebank dataset (but comparing it to many other models):

Recursive Neural Network

46

Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

info & code: http://nlp.stanford.edu/sentiment/

http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

http://nlp.stanford.edu/sentiment/

Recursive Neural Tensor Network

47

code & info: http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks

Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011) Parsing Natural Scenes and Natural Language with Recursive Neural Networks

http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks

http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf

Recursive Neural Tensor Network

48

• RNN (Socher et al. 2011a)


49






• Matrix-Vector RNN (MV-RNN) (Socher et al., 2012)


50






• Matrix-Vector RNN (MV-RNN) (Socher et al., 2012)

• Recursive Neural Tensor Network (RNTN) (Socher et al. 2013)


51

• negation detection:


52





NP

PP/IN

NP

DT NN PRP$ NN

Parse TreeRecurrent NN for Vector Space

53

NP

PP/IN

NP

DT NN PRP$ NN

Parse Tree

INDT NN PRP NN

Compositionality

54

Recurrent NN: CompositionalityRecurrent NN for Vector Space

NP

IN

NP

PRP NN

Parse Tree

DT NN

Compositionality

55


NP

IN

NP

DT NN PRP NN

PP

NP (S / ROOT)

“rules” “meanings”

Compositionality

56


Vector Space + Word Embeddings: Socher

57


Vector Space + Word Embeddings: Socher

58

Recurrent NN for Vector Space

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 59

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 60

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

61

http://arxiv.org/pdf/1103.0398v1.pdf

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)Improving Word Representations via Global Context and Multiple Word Prototypes

62

http://www.socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

63

https://code.google.com/p/word2vec/

http://research.microsoft.com/pubs/189726/rvecs.pdf

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation

64

http://arxiv.org/pdf/1309.4168.pdf

Recursive Deep Models & Sentiment: Socher (2013)

Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

code & demo: http://nlp.stanford.edu/sentiment/index.html65


http://nlp.stanford.edu/sentiment/index.html

Paragraph Vectors: Le & Mikolov (2014)

Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents66

• add context (sentence, paragraph, document) to word vectors during training

!

Results on Stanford Sentiment Treebank dataset:

http://www-cs.stanford.edu/~quocle/paragraph_vector.pdf

Global Vectors, GloVe: Stanford (2014)

Pennington, P., Socher, R., Manning,. D.M. (2014). GloVe: Global Vectors for Word Representation

code & demo: http://nlp.stanford.edu/projects/glove/

vsresults on the word analogy task

“similar accuracy”

67

http://nlp.stanford.edu/projects/glove/glove.pdf

http://nlp.stanford.edu/projects/glove/

https://twitter.com/RichardSocher/status/497235079903473664

https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/mobilebasic?pli=1

Dependency-based Embeddings: Levy & Goldberg (2014)

Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings

code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

- Syntactic Dependency Context

Australian scientist discovers star with telescope

- Bag of Words (BoW) Context

0.3$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

1$

0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$

Precision

$

Recall$

“Dependency-based embeddings have more

functional similarities”

68

http://www.cs.bgu.ac.il/~yoavg/publications/acl2014syntemb.pdf

https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Joint Image-Word Embeddings

69

1. Multimodal representation learning

2. Generating descriptions of images

3. Ranking images and captions (“image-sentence ranking”)

4. Evaluation Methods?

Some Current Approaches

70

• Learning joint image-word embeddings in a low-dimension embedding space (Weston et al 2010; Frome et al. 2013)

• Embedding images and sentences into a common space (Socher et al. TACL 2014; Karpathy et al. NIPS 2014)

• also: deep Boltzmann machines (Srivastava&Salakhutdinov NIPS 2012), log-bilinear neural language models (Kiros et al. ICML 2014), autoencoders (Ngiam ICML 2011), recurrent neural networks (m-RNN: Mao et al. 2014) and topic-models (Jia ICCV 2011)

1. Multimodal representation learning

71

http://bengio.abracadoudou.com/cv/publications/pdf/weston_2010_ecml.pdf

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41473.pdf

• Neural network methods

• Composition-based methods.

• Template-based methods.

2. Generating descriptions of images

72

• Bi-directional approaches (Hockenmaier group): kernel CCA (Hodosh et al. JAIR 2013), normalized CCA (Gong et al. EECV 2014)

• Dependency tree recursive networks (Stanford)(Socher et al. TACL 2014)

3. Ranking images and captions

73

• Current evaluation methods unreliable and don’t match with human judgements

• Bleu

• Rouge

• Perplexity??

• Ranking as proxy for generation / scoring function

4. Evaluation?

74

• unsupervised pre-training on many images

• in parallel train a neural network Language Model

• train linear mapping between image representations and word embeddings, representing the different “classes”

75

Zero-shot Learning

DeViSE model (Frome et al. 2013)

• skip-gram text model on wikipedia corpus of 5.7 million documents (5.4 billion words) - approach from (Mikolov et al. ICLR 2013)

76

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013) Devise: A deep visual-semantic embedding model

http://arxiv.org/abs/1301.3781


Encoder: A deep convolutional network (CNN) and long short-term memory recurrent network (LSTM) for learning a joint image-sentence embedding. Decoder: A new neural language model that combines structure and content vectors for generating words one at a time in sequence.

Encoder-Decoder pipeline (Kiros et al 2014)

77

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models


Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

• matches state-of-the-art performance on Flickr8K and Flickr30K without using object detections

• new best results when using the 19-layer Oxford convolutional network.

• linear encoders: learned embedding space captures multimodal regularities (e.g. *image of a blue car* - "blue" + "red" is near images of red cars)

Encoder-Decoder pipeline (Kiros et al 2014)

78


demo time (code adapted from: https://github.com/karpathy/

neuraltalk)

79

https://github.com/karpathy/neuraltalk

• Theano - CPU/GPU symbolic expression compiler in python (from LISA lab at University of Montreal). http://deeplearning.net/software/theano/

• Pylearn2 - library designed to make machine learning research easy. http://deeplearning.net/software/pylearn2/

• Torch - Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu) http://torch.ch/

• more info: http://deeplearning.net/software links/

Wanna Play ?

Wanna Play ? General Deep Learning

80

http://deeplearning.net/software/theano/

http://deeplearning.net/software/pylearn2/

http://torch.ch/

http://deeplearning.net/software%20links/

• RNNLM (Mikolov) http://rnnlm.org

• NB-SVM https://github.com/mesnilgr/nbsvm

• Word2Vec (skipgrams/cbow)https://code.google.com/p/word2vec/ (original) http://radimrehurek.com/gensim/models/word2vec.html (python)

• GloVehttp://nlp.stanford.edu/projects/glove/ (original) https://github.com/maciejkula/glove-python (python)

• Socher et al / Stanford RNN Sentiment code:http://nlp.stanford.edu/sentiment/code.html

• Deep Learning without Magic Tutorial: http://nlp.stanford.edu/courses/NAACL2013/

Wanna Play ? NLP

81

http://rnnlm.org

https://github.com/mesnilgr/nbsvm

https://code.google.com/p/word2vec/

http://radimrehurek.com/gensim/models/word2vec.html

http://nlp.stanford.edu/projects/glove/

https://github.com/maciejkula/glove-python

http://nlp.stanford.edu/sentiment/code.html

http://nlp.stanford.edu/courses/NAACL2013/

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/CUDA, optimized for GTX 580) https://code.google.com/p/cuda-convnet2/

• Caffe (Berkeley) (Cuda/OpenCL, Theano, Python)http://caffe.berkeleyvision.org/

• OverFeat (NYU) http://cilvr.nyu.edu/doku.php?id=code:start

Wanna Play ? Computer Vision

82

https://code.google.com/p/cuda-convnet2/

http://caffe.berkeleyvision.org/

http://cilvr.nyu.edu/doku.php?id=code:start

as PhD candidate KTH/CSC:“Always interested in discussing

Machine Learning, Deep Architectures, Graphs, and

Language Technology”

In touch!

[email protected]/~roelof/

Internship / EntrepeneurshipAcademic/Researchas CIO/CTO Feeda:

“Always looking for additions to our brand new R&D team”

[Internships upcoming on KTH exjobb website…]

[email protected]

Feeda

83

mailto:[email protected]


mailto:[email protected]?subject=


Were Hiring!

[email protected]

Feeda

• Dev Ops • Software Developers • Data Scientists

84

mailto:[email protected]?subject=


visual-semantic embeddings: some thoughts on language

Presentations & Public Speaking

language understanding2can

class dt nn vb p nn4some

language modeling mikolov

new words

new constructions3

new constructions5some

new problemquestion

possible interpretations