mimicking word embeddings using subword rnns · add subword layer to supervised task ling et al....

Mimicking Word Embeddings using Subword RNNsYuval Pinter, Robert Guthrie, Jacob Eisenstein

@yuvalpi

Presented at EMNLP

September 2017, Copenhagen

The Word Embedding Pipeline

Unlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Embedding

(vectors)

Polyglot

FastText

Unlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Supervised

task corpusSupervised

task corpus

Penn TreeBank

SemEval

OntoNotes

Univ. Dependencies

Embedding

(vectors)

Polyglot

FastText

Unlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Supervised

task corpusSupervised

task corpus

Penn TreeBank

SemEval

OntoNotes

Univ. Dependencies

Embedding

(vectors)

Polyglot

FastText

Unlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Wikipedia

GigaWord

Tagging

Parsing

Sentiment

All possible text

Unlabeled text

Assumed Pattern

Supervised

task text

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

● No pre-trained vectors

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

● Affects supervised tasks

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

● Multiple treatments

suggested

All possible text

Unlabeled text

Actual Pattern

Supervised

task text

● Multiple treatments

suggested

● Our method - compositional

subword OOV model 5

Sources of OOVs

Sources of OOVs● Names Chalabi has increasingly marginalized within Iraq, ...

Sources of OOVs● Names

● Domain-specific jargon

Chalabi has increasingly marginalized within Iraq, ...

Important species (...) include shrimp, (...) and some varieties of flatfish.

● Foreign words

This term was first used in German (Hochrenaissance), …

● Foreign words

● Rare morphological derivations

Without George Martin the Beatles would have been just another

untalented band as Oasis.

● Foreign words

● Nonce words

What if Google morphed into GoogleOS?

● Foreign words

● Nonce words

● Nonstandard orthography

We’ll have four bands, and Big D is cookin’. lots of fun and great prizes.

● Foreign words

● Nonce words

● Typos and other errors

I dislike this urban society and I want to leave this whole enviroment.

● Foreign words

● Nonce words

● …

I dislike this urban society and I want to leave this whole enviroment.

Supervised

task corpusUnlabeled

corpusUnlabeled

corpus

Unlabeled

corpus

Common OOV handling techniques● None (random init)

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

● One UNK to rule them all○ Average existing embeddings

○ Trained with embeddings (stochastic unking)

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

● Add subword model during WE training○ Bhatia et al. (2016), Wieting et al. (2016)

Supervised

task corpus

● Add subword model during WE training○ Bhatia et al. (2016), Wieting et al. (2016)

○ What if we don’t have access to the original corpus? (e.g. FastText)

Char2Tag

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Char2Tag● Add subword layer to supervised task

○ Ling et al. (2015), Plank et al. (2016)

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

● OOVs benefit from co-trained character model

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

● OOVs benefit from co-trained character model

● Requires large supervised training set forefficient transfer to test set OOVs

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

● What data do we have, post-unlabeled corpus?○ Vector dictionary

○ Orthography (the way words are spelled)

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

● Use the former as training objective, latter as input

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

● Pre-trained vectors as target○ No need to access original unlabeled corpus

○ Many training examples

○ (No context)Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

● Pre-trained vectors as target○ No need to access original unlabeled corpus

○ Many training examples

○ (No context)

● Subword units as inputs○ Very extensible

○ (Character inventory changes?)

Supervised

corpusUnlabeled

corpus

Unlabeled

corpus

Enter MIMICK

MIMICK Training

Pre-trained Embedding(Polyglot/FastText/etc.) make

All possible text

Unlabeled text

Characterembeddings

MIMICK Training

All possible text

Unlabeled text

Characterembeddings

MIMICK Training

All possible text

Unlabeled text

Forward LSTM

Characterembeddings

MIMICK Training

All possible text

Unlabeled text

makeBackward

Forward LSTM

Characterembeddings

MIMICK Training

All possible text

Unlabeled text

makeBackward

Forward LSTM

Multilayered Perceptron

Characterembeddings

MIMICK Training

Loss (L2)

Mimicked EmbeddingAll possible text

Unlabeled text

makeBackward

Forward LSTM

Characterembeddings

MIMICK Inference

Mimicked EmbeddingAll possible text

Unlabeled textblah

Backward LSTM

Forward LSTM

Observation – Nearest Neighbors

Observation – Nearest Neighbors● English (OOV Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM

○ pesky → euphoric, disagreeable, horrid, ghastly

○ lawnmower → tradesman, bookmaker, postman, hairdresser

● Hebrew

● Hebrew○ תפתור תתג → (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve

○ טריים גיאומטריים → geometric (m.pl., nontrad. spelling) geometric (m.pl.)

○ ארך רדסון’ריצ → Richardson Eustrach

● ✔ Surface form ✔ Syntactic properties ✘ Semantics

Intrinsic Evaluation – RareWords

● RareWords similarity task: morphologically-complex, mostly unseen words

● Names

● Foreign words

● Rare(-ish) morphological derivations

● Nonce words

● ...15

● Names

● Foreign words

● Nonce words

● ...

Nearest:

programmatic

transformational

mechanistic

transactional

contextual

NN FUN!!!

Extrinsic Evaluation – POS + Attribute Tagging

● Names

● Foreign words

● Nonce words

● ... 16

● UD is annotated for POS and morphosyntactic attributes○ Eng: his stated goals Tense=Past|VerbForm=Part

people of advanced age

● Names

● Foreign words

● Nonce words

● ... 16

● POS model from Ling et al. (2015)

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

● Attributes - same as POS layer

pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

Number

pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

Number

● Negative effect on POS

pres ------

-- --sing--

VBZ VBGNNDT

Wordembeddings

the sittingiscat

Forward LSTM

Backward LSTM

Number

● Negative effect on POS

● Attribute evaluation metric○ Micro F1

Language Selection

Minna Sundberg

Language Selection● |UD ∩ Polyglot| = 44, we took 23

Minna Sundberg

● Morphological structure

Minna Sundberg

● Morphological structure○ 12 fusional

Minna Sundberg

○ 3 analytic

Minna Sundberg

○ 3 analytic

○ 1 isolating

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● Geneological diversity

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● Geneological diversity○ 13 Indo-European (7 different branches)

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

○ 10 from 8 non-IE branches

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● MRLs (e.g. Slavic languages)

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

● MRLs (e.g. Slavic languages)○ Much word-level data

Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

○ Relatively free word order Minna Sundberg

○ 3 analytic

○ 1 isolating

○ 7 agglutinative

○ Relatively free word order

Institutional

Entrepreneurial

Linguistic

Anatomical

Ideological

Minna Sundberg

Language Selection (contd.)

Language Selection (contd.)● Script type

○ 7 in non-alphabetic scripts

○ Ideographic (Chinese) - ~12K characters

○ Hebrew, Arabic - no casing, no vowels, syntactic fusion

○ Vietnamese - tokens are non-compositional syllables

● Attribute-carrying tokens

● Attribute-carrying tokens○ Range from 0% (Vietnamese) to 92.4% (Hindi)

● OOV rate (UD against Polyglot vocabulary)

● OOV rate (UD against Polyglot vocabulary)○ 16.9%-70.8% type-level (median 29.1%)

○ 2.2%-33.1% token-level (median 9.2%)

Evaluated Systems● NONE: Polyglot’s default UNK embedding

the sittingisflatfish

● MIMICK

● CHAR2TAG - additional RNN layer○ 3x Training time

● MIMICK

● BOTH: MIMICK + CHAR2TAG

● MIMICK

● BOTH: MIMICK + CHAR2TAG

Results - Full Data

POS tags (accuracy) Morpho. Attributes (micro F1)21

Results - 5,000 training tokens

POS tags (accuracy) Morpho. Attributes (micro F1)22

Results - Language Types (5,000 tokens)

Slavic languages POS23

Results - Language Types (5,000 tokens)

Slavic languages POS Agglutinative languages morpho. attribute F123

Results - Chinese

24POS tags (accuracy) Morpho. Attributes (micro F1)

A Word (Model) from our Sponsor

Code & models:

https://github.com/yuvalpinter/Mimick

A Word (Model) from our Sponsor● Our extrinsic results are on tagging

Code & models:

● Please consider us for all your WE use cases!

Code & models:

● Please consider us for all your WE use cases!○ Sentiment!

Code & models:

○ Parsing!

Code & models:

○ Parsing!

○ IE! Code & models:

○ Parsing!

○ IE!

○ QA!

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

● Code compatible with w2v, Polyglot, FastText

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

● Models for Polyglot also on github

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

● Models for Polyglot also on github○ <1MB each, dynet format

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

○ Learn all OOVs in advance and add to param table, or

Code & models:

○ Parsing!

○ IE!

○ QA!

○ ...

○ Learn all OOVs in advance and add to param table, or

○ Load into memory and infer on-line

Code & models:

Conclusions

Conclusions● MIMICK: an OOV-extension embedding processing step for downstream tasks

● Compositional model complementing distributional artifact

● Powerful technique for low-resource scenarios

● Especially good for:

● Especially good for:○ Morphologically-rich languages

○ Large character vocabulary

● Sore spots and Future Work

● Sore spots and Future Work○ Vietnamese - syllabic vocabulary

○ Hebrew and Arabic - nontrivial tokenization, no case

○ Try other subword levels (morphemes, phonemes, bytes)

○ Improve morphosyntactic attribute tagging scheme

Questions?

Code & models:

Neglect

Satisfaction

Illness

Espionage

Bullying

mimicking word embeddings using subword rnns · add subword layer to supervised task ling et al....

Documents

transformers are rnns: fast autoregressive transformers with...

learning and reﬁning of privileged information-based rnns...

sleep stage scoring using cnns and rnns

bpemb: tokenization-free pre-trained 0.71ex subword ... ›...

subword-based approaches for spoken document...

finetuning pretrained transformers into rnns

rnns and tensorflow

transformers are rnns: fast autoregressive transformers...

rnns as directed graphical models

recurrent neural networks (rnns) - github pages ·...

introduction to rnns - svetlana...

the möbius function of generalized subword order ›...

hello edge: keyword spotting on...

the möbius function of generalized subword...

advances in subword-based hmm-dnn speech recognition

deep learning - recurrent neural network (rnns) · deep...

recurrent neural networks...

distributed optimization of cnns and rnns - gtc...

hybrid/tandem models + tdnns + intro to rnns - iit bombay

[dl輪読会]training rnns as fast as cnns