the influence of down-sampling strategies on svd word … · johannes hellrich, bernd kampe&...

RepEval 2019 June 6th 2019, Minneapolis, USA

Johannes Hellrich, Bernd Kampe & Udo Hahn Down-Sampling and SVD Word Embedding Stability 1

The Influence of Down-Sampling Strategieson SVD Word Embedding Stability

Johannes Hellrich, Bernd Kampe & Udo Hahn

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany www.julielab.de



Typical Word Embeddings are Unstable

dog

cattiger

lotsoftext

corpus

dogcat

tiger

random embeddings final embeddingsrandom processing



Typical Word Embeddings are Unstable

dogcat

tiger lotsoftext

corpus

dogcat

tiger

random embeddings final embeddingsrandom processing



Measuring Stability

dogcat

tiger

dogcat

tiger

dog

cattiger

lotsoftext

corpus

The inclusion of accuracy measurements and thelarger size of our training corpora exceed priorwork. We show how the choice of down-samplingstrategies, a seemingly minor detail, leads to majordifferences in the characterization of SVDPPMIin recent studies (Hellrich and Hahn, 2017; An-toniak and Mimno, 2018). We also presentSVDWPPMI, a simple modification of SVDPPMIthat replaces probabilistic down-sampling withweighting. What, at first sight, appears to be asmall change leads, nevertheless, to an unrivaledcombination of stability and accuracy, making itparticularly well-suited for the above-mentionedcorpus linguistic applications.

2 Computational Methodology

2.1 Measuring Stability

Measuring word embedding stability can be linkedto older research comparing distributional thesauri(Salton and Lesk, 1971) by the most similarwords they contain for particular anchor words(Weeds et al., 2004; Padro et al., 2014). Moststability experiments focused on repeatedly train-ing the same algorithm on one corpus (Hellrichand Hahn, 2016a,b, 2017; Antoniak and Mimno,2018; Pierrejean and Tanguy, 2018; Chugh et al.,2018), whereas Wendlandt et al. (2018) quantifiedstability by comparing word similarity for modelstrained with different algorithms. We follow theformer approach, since we deem it more relevantfor ensuring that study results can be replicated orreproduced.

Stability can be quantified by calculating theoverlap between sets of words considered mostsimilar in relation to pre-selected anchor words.Reasonable metrical choices are, e.g., the Jaccardcoefficient (Jaccard, 1912) between these sets(Antoniak and Mimno, 2018; Chugh et al., 2018),or a percentage based coefficient (Hellrich andHahn, 2016a,b; Wendlandt et al., 2018; Pierrejeanand Tanguy, 2018). We here use j@n, i.e., theJaccard coefficient for the n most similar words.It depends on a set M of word embedding models,m, for which the n most similar words (by cosine)from a set A of anchor words, a, as provided bythe ’most similar words’ function msw(a, n,m),are compared:

j@n :=1

|A|X

a2A

|T

m2M msw(a, n,m)||S

m2M msw(a, n,m)| (1)

2.2 SVDPPMI Word Embeddings

The SVDPPMI algorithm from Levy et al. (2015)generates word embeddings in a three-step pro-cess. First, a corpus is transformed to a word-context matrix listing co-occurrence frequencies.Next, the frequency-based word-context matrixis transformed into a word-context matrix thatcontains word association values. Finally, singularvalue decomposition (SVD; Berry (1992); Saad(2003)) is applied to the latter matrix to reduce itsdimensionality and generate word embeddings.

Each token from the corpus is successively pro-cessed in the first step by recording co-occurrenceswith other tokens within a symmetric windowof a certain size. For example, in a token se-quence . . . , wi�2, wi�1, wi, wi+1, wi+2, . . . , withwi as the currently modeled token, a windowof size 1 would be concerned with wi�1 andwi+1 only. Down-sampling as described by Levyet al. (2015) increases accuracy by ignoring certainco-occurrences while populating the word-contextmatrix (further details are described below). Aword-context matrix is also used in GLOVE,whereas SGNS directly operates on sampled co-occurrences in a streaming manner.

Positive pointwise mutual information (PPMI)is a variant of pointwise mutual information (Fano,1961; Church and Hanks, 1990), independentlydeveloped by Niwa and Nitta (1994) and Bul-linaria and Levy (2007). PPMI measures theratio between observed co-occurrences (normal-ized and treated as a joint probability) and theexpected co-occurrences (based on normalizedfrequencies treated as individual probabilities) fortwo words i and j while ignoring all cases inwhich the observed co-occurrences are fewer thanthe expected ones:

PPMI(i, j) :=

(0 if P (i,j)

P (i)P (j) < 1

log( P (i,j)P (i)P (j)) otherwise

(2)Truncated SVD reduces the dimensionality of

the vector space described by the PPMI word-context matrix M . SVD factorizes M in threespecial1 matrices, so that M = U⌃V T. Entries of⌃ are ordered by their size, allowing to infer therelative importance of vectors in U and V . Thiscan be used to discard all but the highest d values

1 U and V are orthogonal matrices containing so calledsingular vectors. ⌃ is a diagonal matrix containing singularvalues.



Why SVD Embeddings?

lotsoftext

corpus

dogcat

tiger

final embeddings

counting SVD

food roar

dog 475 156

cat 823 492

tiger 51 19



Why SVD Embeddings?

lotsoftext

corpus

dogcat

tiger

final embeddings

counting& down-sampling

SVD

food roar

dog 0.02 0.01

cat 0.5 0.4

tiger 0.01 0.19

Replaced with association values in SVDPPMI (Levy et al., TACL 2015)



Why Down-Sampling?

• Avoids over-representing frequent words

• Closer context words are more salient than distant ones

à Increased Performance (Mikolov, NIPS 2013)



Down-Sampling Mechanism

Weighting

• GloVe• New: SVDwPPMI

Probabilistic

• word2vec• SVDPPMI



Experimental Design I/II

• Three Corpora:• Corpus of Historical American English 2000s decade

(COHA; 28M tokens.)• English News Crawl Corpus (NEWS; 550M tokens)• Wikipedia (WIKI; 1.7G tokens)à Other studies used mostly COHA-sized corpora!



Experimental Design II/II

• Train 10 models each with SGNS, GloVe, SVDPPMI(none / prob. down-sampling), SVDwPPMI

• Evaluate intrinsically with four word similarity & two analogy test sets

• Measure stability with j@10 for 1k most frequent words



Stability Results

GloVe‘s high stability (Antoniak & Mimno, TACL 2018; Wendlandt et al., NAACL 2018) is true only for small corpora



Exemplary Accuracy Results

Wilcoxon rank-sum test shows SVDwPPMI and SGNS to be indistinguishable in accuracy over all test sets and corpora



Conclusion

• Typical word embeddings are unstable

• Down-sampling details greatly affect stability

• GloVe’s stability is worse than claimed in literature

• SVDwPPMI embeddings provide SGNS-like performance and perfect stability

• See paper for additional results (and bootstrapping)



The Influence of Down-Sampling Strategieson SVD Word Embedding Stability

Johannes Hellrich, Bernd Kampe & Udo Hahn

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany www.julielab.de

the influence of down-sampling strategies on svd word … · johannes hellrich, bernd kampe&...

Documents