arxiv:2004.10813v1 [cs.cl] 22 apr 2020

11
Revisiting the Context Window for Cross-lingual Word Embeddings Ryokan Ri and Yoshimasa Tsuruoka The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan {li0123,tsuruoka}@logos.t.u-tokyo.ac.jp Abstract Existing approaches to mapping-based cross- lingual word embeddings are based on the assumption that the source and target em- bedding spaces are structurally similar. The structures of embedding spaces largely de- pend on the co-occurrence statistics of each word, which the choice of context window determines. Despite this obvious connection between the context window and mapping- based cross-lingual embeddings, their rela- tionship has been underexplored in prior work. In this work, we provide a thorough evalua- tion, in various languages, domains, and tasks, of bilingual embeddings trained with differ- ent context windows. The highlight of our findings is that increasing the size of both the source and target window sizes improves the performance of bilingual lexicon induction, especially the performance on frequent nouns. 1 Introduction Cross-lingual word embeddings can capture word semantics invariant among multiple languages, and facilitate cross-lingual transfer for low- resource languages (Ruder et al., 2019). Recent research has focused on mapping-based methods, which find a linear transformation from the source to target embedding spaces (Mikolov et al., 2013b; Artetxe et al., 2016; Lample et al., 2018). Learn- ing a linear transformation is based on a strong assumption that the two embedding spaces are structurally similar or isometric. The structure of word embeddings heavily de- pends on the co-occurrence information of words (Turney and Pantel, 2010; Baroni et al., 2014), i.e., word embeddings are computed by counting other words that appear in a specific context window of each word. The choice of context window changes the co-occurrence statistics of words and thus is crucial to determine the structure of an embed- ding space. For example, it has been known that an embedding space trained with a smaller linear window captures functional similarities, while a larger window captures topical similarities (Levy and Goldberg, 2014a). Despite this important re- lationship between the choice of context window and the structure of embedding space, how the choice of context window affects the structural similarity of two embedding spaces has not been fully explored yet. In this paper, we attempt to deepen the under- standing of cross-lingual word embeddings from the perspective of the choice of the context win- dow through carefully designed experiments. We experiment with a variety of settings, with dif- ferent domains and languages. We train monolin- gual word embeddings varying the context window sizes, align them with a mapping-based method, and then evaluate them with both intrinsic and downstream cross-lingual transfer tasks. Our re- search questions and the summary of the findings are as follows: RQ1: What kind of context windows produces a better alignment of two embedding spaces? Our result shows that increasing the window sizes of both the source and target embeddings improves the accuracy of bilingual dictionary induction con- sistently regardless of the domains of the source and target corpora. Our fine-grained analysis re- veals that frequent nouns receive the most benefit from larger context sizes. RQ2. In downstream cross-lingual transfer, do the context windows that perform well on the source language also perform well on the tar- get languages? No. We find that even when some context window performs well on the source lan- guage task, that is often not the best choice for the target language. The general tendency is that broader context windows produce better perfor- mance for the target languages. arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Upload: others

Post on 06-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Revisiting the Context Window for Cross-lingual Word Embeddings

Ryokan Ri and Yoshimasa TsuruokaThe University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan{li0123,tsuruoka}@logos.t.u-tokyo.ac.jp

AbstractExisting approaches to mapping-based cross-lingual word embeddings are based on theassumption that the source and target em-bedding spaces are structurally similar. Thestructures of embedding spaces largely de-pend on the co-occurrence statistics of eachword, which the choice of context windowdetermines. Despite this obvious connectionbetween the context window and mapping-based cross-lingual embeddings, their rela-tionship has been underexplored in prior work.In this work, we provide a thorough evalua-tion, in various languages, domains, and tasks,of bilingual embeddings trained with differ-ent context windows. The highlight of ourfindings is that increasing the size of both thesource and target window sizes improves theperformance of bilingual lexicon induction,especially the performance on frequent nouns.

1 Introduction

Cross-lingual word embeddings can capture wordsemantics invariant among multiple languages,and facilitate cross-lingual transfer for low-resource languages (Ruder et al., 2019). Recentresearch has focused on mapping-based methods,which find a linear transformation from the sourceto target embedding spaces (Mikolov et al., 2013b;Artetxe et al., 2016; Lample et al., 2018). Learn-ing a linear transformation is based on a strongassumption that the two embedding spaces arestructurally similar or isometric.

The structure of word embeddings heavily de-pends on the co-occurrence information of words(Turney and Pantel, 2010; Baroni et al., 2014), i.e.,word embeddings are computed by counting otherwords that appear in a specific context window ofeach word. The choice of context window changesthe co-occurrence statistics of words and thus iscrucial to determine the structure of an embed-ding space. For example, it has been known that

an embedding space trained with a smaller linearwindow captures functional similarities, while alarger window captures topical similarities (Levyand Goldberg, 2014a). Despite this important re-lationship between the choice of context windowand the structure of embedding space, how thechoice of context window affects the structuralsimilarity of two embedding spaces has not beenfully explored yet.

In this paper, we attempt to deepen the under-standing of cross-lingual word embeddings fromthe perspective of the choice of the context win-dow through carefully designed experiments. Weexperiment with a variety of settings, with dif-ferent domains and languages. We train monolin-gual word embeddings varying the context windowsizes, align them with a mapping-based method,and then evaluate them with both intrinsic anddownstream cross-lingual transfer tasks. Our re-search questions and the summary of the findingsare as follows:

RQ1: What kind of context windows producesa better alignment of two embedding spaces?Our result shows that increasing the window sizesof both the source and target embeddings improvesthe accuracy of bilingual dictionary induction con-sistently regardless of the domains of the sourceand target corpora. Our fine-grained analysis re-veals that frequent nouns receive the most benefitfrom larger context sizes.

RQ2. In downstream cross-lingual transfer, dothe context windows that perform well on thesource language also perform well on the tar-get languages? No. We find that even when somecontext window performs well on the source lan-guage task, that is often not the best choice forthe target language. The general tendency is thatbroader context windows produce better perfor-mance for the target languages.

arX

iv:2

004.

1081

3v1

[cs

.CL

] 2

2 A

pr 2

020

Page 2: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

2 Background and Related Work

2.1 Context Window of Word Embeddings

Word embeddings are computed from the co-occurrence information of words, i.e., contextwords that appear around a given word. Theembedding algorithm used in this work is theskip-gram with negative sampling (Mikolov et al.,2013c). In the skip-gram model, each word w inthe vocabulary W is associated with a word vectorvw and a context vector cw.1 The objective is tomaximize the dot-product vwt ·cwc for the observedword-context pairs (wt, wc), and to minimize thedot-product for negative examples.

The most common type of context is a lin-ear window. When the window size is setto k, the context words of a target wordwt in a sentence [w1, w2, ..., wt, ...wL] are[wt−k, ..., wt−1, wt+1, ..., wt+k]. The choice ofcontext is crucial to the resulting embeddings asit will change the co-occurrence statistics associ-ated with each target word. Table 1 demonstratesthe effect of the context window size on the near-est neighbor structure of embedding space; witha small window size, the resulting embeddingscapture functional similarity, while with a largerwindow size, the embeddings capture topical simi-larities.

Among the other types of context windows thathave been explored by researchers are linear win-dows enriched with positional information (Levyand Goldberg, 2014b; Ling et al., 2015a; Li et al.,2017), syntactically informed context windowsbased on dependency trees (Levy and Goldberg,2014a; Li et al., 2017), and one that dynamicallyweights the surrounding words with the attentionmechanism (Ling et al., 2015b). In this paper, wemainly discuss the most common linear windowand investigate how the choice of the window sizeaffects the isomorphism of two embedding spacesand the performance of cross-lingual transfer.

2.2 Cross-lingual Word Embeddings

Cross-lingual word embeddings aim to learn ashared semantic space in multiple languages. Onepromising solution is to jointly train the sourceand target embedding, so-called joint methods,by exploiting cross-lingual supervision signalsin the form of word dictionaries (Duong et al.,

1Conceptually, the word and context vocabularies are re-garded as separated, but for simplicity, we assume that theyshare the vocabulary.

Query word window size 1 window size 10phrases word

loanwords phraseswords morphemes phrase

verses ungrammaticalphonemes homographssynchronic totemismmechanistic typology

typological numerological categorizationsarchitectonic dialectology

dialectical fusional

Table 1: The top-5 nearest neighbors in English em-bedding spaces trained with different context windowsin our experiment. The smaller window size capturesfunctional similarities (-s, -cal, -ic), while the largercaptures topical similarities.

2016), parallel corpora (Gouws et al., 2015; Lu-ong et al., 2015), document-aligned corpora (Vulicand Moens, 2016).

Another line of research is off-line mapping-based approaches (Ruder et al., 2019), wheremonolingual embeddings are independentlytrained in multiple languages, and a post-hoc align-ment matrix is learned to align the embeddingspaces with a seed word dictionary (Mikolov et al.,2013b; Xing et al., 2015; Artetxe et al., 2016),with only a little supervision such as identicalstrings or numerals (Artetxe et al., 2017; Smithet al., 2017), or even in a completely unsuper-vised manner (Lample et al., 2018; Artetxe et al.,2018). Mapping-based approaches have recentlybeen popularized by their cheaper computationalcost compared to joint approaches, as they canmake use of pre-trained monolingual word embed-dings.

The assumption behind the mapping-basedmethods is the isomorphism of monolingual em-bedding spaces, i.e., the embedding spaces arestructurally similar, or the nearest neighbor graphsfrom the different languages are approximatelyisomorphic (Søgaard et al., 2018). Consideringthat the structures of the monolingual embeddingspaces are closely related to the choice of the con-text window, it is natural to expect that the contextwindow has a considerable impact on the perfor-mance of mapping-based bilingual word embed-dings.

However, most existing work has not providedempirical results on the effect of the context win-dow on cross-lingual embeddings, as their focus ison how to learn a mapping between the two embed-

Page 3: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

ding spaces. In order to shed light on the effect ofthe context window on cross-lingual embeddings,we trained cross-lingual embeddings with differ-ent context windows, and carefully analyzed theimplications of their varying performance on bothintrinsic and extrinsic tasks.

3 Experimental Design

3.1 Training Monolingual Embeddings

The experiment is designed to deal with multiplesettings to fully understand the effect of the contextwindow.Languages. As the target language, we chooseEnglish (En) because of its richness of resources,and as the source languages, we choose French(Fr), German (De), Russian (Ru), Japanese (Ja),taking into account the typological variety andavailability of evaluation resource.

Note that the language pairs analyzed in thispaper are limited to those including English, andthere is a possibility that some results may notgeneralize to other language pairs.Corpus for Training Word Embeddings. Totrain the monolingual embeddings, we use theWikipedia Comparable Corpora2. We choose com-parable corpora for the main analysis in order toaccentuate the effect of context window by set-ting an ideal situation for training cross-lingualembeddings.

We also experiment with different domain set-tings, where we use corpora from the news do-main3 for the source languages, because the iso-morphism assumption is shown to be very sensi-tive to the domains of the source and target corpora(Søgaard et al., 2018). We refer to those resultswhen we are interested in whether the same trendwith respect to context window can be observed inthe different domain settings.

For the size of the data, to simulate the settingof transferring from a low-resource language to ahigh-resource language, we use 5M sentences forthe target language (English), and 1M sentencesfor the source languages.4

Context Window. Since we want to measure theeffect of the context window size, we vary the

2https://linguatools.org/tools/corpora/wikipedia-comparable-corpora/

3https://wortschatz.uni-leipzig.de/en/download4We also experimented with very low-resource settings,

where the source corpus size is set to 100K, but the resultsshowed similar trends to the 1M setting, and thus we onlyinclude the result of the 1M settings in this paper.

window size among 1, 2, 3, 4, 5, 7, 10, 15, and 20.Besides the linear window, we also experi-

mented with the unbound dependency context (Liet al., 2017), where we extract context words thatare the head, modifiers, and siblings in a depen-dency tree. Our initial motivation was that, whilethe linear context is directly affected by differentword orders, the dependency context can mitigatethe effect of language differences, and thus mayproduce better cross-lingual embeddings. How-ever, the performance of the dependency contextturned out to be always in the middle betweensmaller and larger linear windows, and we foundnothing notable. Therefore, the following analysisonly focuses on the results of the linear contextwindow.Implementation of Word2Vec. Note that somecommon existing implementations of the skip-gram may obfuscate the effect of the window size.The original C implementation of word2vec andits python implementation Gensim5 adopt a dy-namic window mechanism where the window sizeis uniformly sampled between 1 and the speci-fied window size for each target word (Mikolovet al., 2013a). Also, those implementations re-move frequent tokens by subsampling before ex-tracting word-context pairs (so-called “dirty” sub-sampling) (Levy et al., 2015), which enlarges thecontext size in effect. Our experiment is basedon word2vecf,6 which takes arbitrary word-context pairs as input. We extract word-contextpairs from a fixed window size and afterward per-form subsampling.

We train 300-dimensional embeddings. For de-tails on the hyperparameters, we refer the readersto Appendix A.

3.2 Aligning Monolingual Embeddings

After training monolingual embeddings in thesource and target languages, we align themwith a mapping-based algorithm. To inducea alignment matrix W for the source and tar-get embeddings x, y, we use a simple super-vised method of solving the Procrustes problemarg min

W

∑mi=1 ‖Wxi − yi‖2, with a training word

dictionary (xi, yi)mi=1 (Mikolov et al., 2013b), with

the orthogonality constraint on W , length normal-ization and mean-centering as preprocessing for

5https://radimrehurek.com/gensim/6https://bitbucket.org/yoavgo/

word2vecf/src/default/

Page 4: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Figure 1: BLI performance in the comparable setting. The target window size is fixed and the source windowsize is varied.

the source and target embeddings (Artetxe et al.,2016).

The word dictionaries are automatically createdby using Google Translate. 7 We translate allwords in our English vocabulary into the sourcelanguages and filter out words that do not existin the source vocabularies. We also perform thisprocess in the opposite direction (translated fromthe source languages into English), and take theunion of the two corresponding dictionaries. Wethen randomly select 5K tuples for training and2K for testing. Although using word dictionariesautomatically derived from a system is currentlya common practice in this field, it should be ac-knowledged that this may sometimes pose prob-lems: the generated dictionaries are noisy, and thedefinition of word translation is unclear (e.g., howdo we handle polysemy?). It can hinder valid com-parisons between systems or detailed analysis ofthem, and should be addressed in future research.

For each setting, we train three pairs of alignedembeddings with different random seeds in themonolingual embedding training, as training wordembeddings is known to be unstable and differentruns result in different nearest neighbors (Wend-landt et al., 2018). The following results are pre-sented with their averages and standard deviations.

4 Bilingual Lexicon Induction

We first evaluate the learned bilingual embeddingswith bilingual lexicon induction (BLI). The taskis to retrieve the target translations with sourcewords by searching for nearest neighbors with co-sine similarity in the bilingual embedding space.

7https://translate.google.com/ (October 2019)

The evaluation metric used in prior work is usu-ally top-k precision, but here we use a more infor-mative measure, mean reciprocal rank (MRR) asrecommended by Glavas et al. (2019).

Fixed Target Context Window Settings. First,we consider the settings where the target contextsize is fixed, and the source context size is config-urable. This setting assumes common situationswhere the embedding of the target language isavailable in the form of pre-trained embeddings.

Figure 1 shows the result of the four languages.Firstly, we observe that too small windows (1 to 3)for source embeddings do not yield good perfor-mance, probably because the model failed to trainaccurate word embedding models with insufficienttraining word-context pairs that the small windowscapture.

At first, this result may seem to contradict withthe result from Søgaard et al. (2018). They trainedEnglish and Spanish embeddings with fasttext(Bojanowski et al., 2017) and the window size of2, and then aligned them with an unsupervisedmapping algorithm (Lample et al., 2018). Whenthey changed the window size of the Spanish em-bedding to 10, they only observed a very slightdrop on top-1 precision (from 81.89 to 81.28). Wesuspect that the discrepancy with our result is dueto the different settings. First of all, fasttextadopts a dynamic window mechanism, which mayobfuscate the difference in the context window.Also, they trained embeddings with full Wikipediaarticles, which is an order of magnitude largerthan ours; the fasttext algorithm, which takesinto account the character n-gram information ofwords, can exploit a non-trivial amount of subword

Page 5: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Figure 2: BLI performance for each PoS in the comparable setting.

Figure 3: BLI performance in the comparable setting.

overlap between the quite similar languages.Overall, we observe that the best context win-

dow size for the source embeddings increases asthe target context size increases, and increasingthe context sizes of both the source and target em-bedding seems beneficial to the BLI performance.Configurable Source/Target Context WindowSettings. Hereafter, we present the results whereboth the source and target sizes are configurableand set to the same. Figure 3 summarizes the resultof the same domain setting.

As we expected from the observation of the set-tings where the target window size is fixed, theperformance consistently improves as the sourceand target context sizes increase. Given that thelarger context windows tend to capture topical sim-ilarities of words, we hypothesize that the moretopical the embeddings are, the easier they are tobe aligned. Topics are invariant across differentlanguages to some extent as long as the corporaare comparable. It is natural to think that topic-oriented embeddings capture language-agnostic se-mantics of words and thus are easier to be aligned

Figure 4: BLI performance in the different domain set-ting.

among different languages.This hypothesis can be further supported by

looking at the metrics of each part-of-speech (PoS).Intuitively, nouns tend to be more representativeof topics than other PoS, and thus are expectedto show a high correlation with the window size.Figure 2 shows the scores for each PoS. 8 In alllanguages, nouns and adjectives show stronger (al-most perfect) correlation than verbs and adverbs.Different-domain Settings. The results so far areobtained in the settings where the source and tar-get corpora are comparable. When the corpora arecomparable, it is natural that topical embeddingsare easier to be aligned as comparable corporashare their topics. In order to see if the observa-tions from the comparable settings hold true fordifferent-domain settings, we also present the re-sult from the different-domain (news) source cor-pora in Figure 4.

8We assigned to each word its most frequent PoS tag inthe Brown Corpus (Kucera and Francis, 1967), followingWada et al. (2019).

Page 6: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Figure 5: BLI performance for each PoS in the different domain setting.

Figure 6: BLI performance with the top 500 frequentand rare words in the comparable setting.

Firstly, compared to the same-domain settings(Figure 3), the scores are lower by around 0.1 to0.2 points across the languages and context win-dows, even with the same amount of training data.This result confirms previous findings showingthat domain consistency is important to the iso-morphism assumption (Søgaard et al., 2018).

As to the relation between the BLI performanceand the context window, we observe a similar trendto the comparable settings: increasing the contextwindow size basically improves the performance.Figure 5 summarizes the results for each PoS. Theperformance on nouns and adjectives still accountsfor much of the correlation with the window size.This suggests that even when the source and targetdomains are different, some domain-invariant top-ics are captured by larger-context embeddings fornouns and adjectives.Frequency Analysis. To further gain insight intowhat kind of words receive the benefit of larger

Figure 7: BLI performance on the top 500 frequentand rare words in the different domain setting.

context windows, we analyze the effect of wordfrequency. We extract the top and bottom 500frequent words9 from the test vocabularies andevaluate the performance on them respectively.

The results of the comparable setting in eachlanguage are shown in Figure 6.

The scores for the frequent words (top500) arenotably higher than the rare words (bottom500).This confirms previous empirical results that ex-isting mapping-based methods perform signifi-cantly worse for rare words (Braune et al., 2018;Czarnowska et al., 2019).

With respect to the relation with the context size,both frequent and rare words benefit from largerwindow sizes, although the gain in the rare wordsis less obvious in some languages (Ja and Ru).

In the different domain settings, as shown inFigure 7, the rare words, in turn, suffer from larger

9The frequencies were calculated from our subset of theEnglish Wikipedia corpus.

Page 7: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

Figure 8: Downstream evaluations in the comparable settings. SA: sentiment analysis; DC: document classifica-tion; DP: dependency parsing. The window sizes of both the source and target embeddings are varied.

window sizes, especially for Fr and Ru, but theperformance on frequent words still improves asthe context window increases.

We conjecture that when training a skip-grammodel, frequent words observe many contextwords, and that would mitigate the effect of ir-relevant words (noise) caused by a larger windowsize and result in high-quality topical embeddings;however, rare words have to rely on a limited num-ber of context words, and larger windows just am-plify the noise and domain difference to result inan inaccurate alignment of them.

5 Downstream Tasks

Although BLI is a common evaluation methodfor bilingual embeddings, good performance onBLI does not necessarily generalize to downstreamtasks (Glavas et al., 2019). To further gain insightinto the effect of the context size on bilingual em-beddings, we evaluate the embeddings with threedownstream tasks: 1) sentiment analysis; 2) docu-ment classification; 3) dependency parsing. Here,we briefly describe the dataset and model used foreach task.Sentiment Analysis (SA). We use the Webis-

CLS-10 corpus10 (Prettenhofer and Stein, 2010),which is comprised of Amazon product reviewsin the four languages: English, German, French,and Japanese (no Russian data available). Wecast sentiment analysis as a binary classificationtask, where we label reviews with the scores of1 or 2 as negative and reviews with 4 or 5 aspositive. For the model, we employ a simpleCNN encoder followed by a multi-layer percep-trons classifier.Document Classification (DC). MLDoc11

(Schwenk and Li, 2018) is compiled from theReuters corpus for eight languages including allthe languages used in this paper. The task is afour-way classification of the news article topics:Corporate/Industrial, Economics,Government/Social, and Markets. Weuse the same model architecture as sentimentanalysis.Dependency Parsing (DP). We train deep bi-affine parsers (Dozat and Manning, 2017) with the

10https://webis.de/data/webis-cls-10.html

11https://github.com/facebookresearch/MLDoc

Page 8: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

UD English EWT dataset12 (Silveira et al., 2014).We use the PUD treebanks13 as test data.

The hyperparameters used in this experimentare shown in Appendix B.Evaluation Setup. We evaluate in a cross-lingualtransfer setup how well the bilingual embeddingstrained with different context windows transferlexical knowledge across languages. Here, wefocus on the settings where both the source andtarget context sizes are varied.

For each task, we train models with our pre-trained English embeddings. We do not updatethe parameters of the embedding during training.Then, we evaluate the model with the test data inother languages available in the dataset. At testtime, we feed the model with the word embeddingsof the test language aligned to the training Englishembeddings.

We train nine models in total for each settingwith different random seeds and English embed-dings, and we present their average scores andstandard deviations.Result and Discussion. The results from all thethree tasks are presented in Figure 8.

For sentiment analysis and document classifi-cation, we observe a similar trend where the bestwindow size is around 3 to 5 for the source En-glish task, but for the test languages, larger contextwindows achieve better results. The only devia-tion is the Japanese document classification, wherethe score does not show a significant correlation.We attribute this to low-quality alignments due tothe large typological difference between Englishand Japanese, which can be confirmed by the factthat the Japanese scores are the lowest across theboard.

For dependency parsing, embeddings withsmaller context windows perform better in thesource English task, which is consistent with theobservation that smaller context windows tend toproduce syntax-oriented embeddings (Levy andGoldberg, 2014a). However, the performance ofthe small-window embeddings does not transfer tothe test languages. The best context window forthe English development data (the size of 1) per-forms the worst for all the test languages, and thetransferred accuracy seems to benefit from largercontext sizes, although it does not always correlate

12https://universaldependencies.org/treebanks/en_ewt/index.html

13https://universaldependencies.org/conll17/

with the window size. This observation highlightsthe difficulty of transferring syntactic knowledgeacross languages. Word embeddings trained withsmall windows capture more grammatical aspectsof words in each language, which, as different lan-guages have different grammars, makes the sourceand target embedding spaces so different that it isdifficult to align them.

In summary, a general trend we observe here isthat good context windows in the source languagetask do not necessarily produce good transferrablebilingual embeddings. In practice, it seems betterto choose a context window that aligns the sourceand target well, rather than using the window sizethat just performs the best for the source language.

6 Conclusion and Future Work

Despite their obvious connection, the relation be-tween the choice of context window and the struc-tural similarity of two embedding spaces has notbeen fully investigated in prior work. In this study,we have offered the first thorough empirical resultson the relation between the context window sizeand bilingual embeddings, and shed new light onthe property of bilingual embeddings. In summary,we have shown that:

• larger context windows for both the sourceand target facilitate the alignment of words,especially nouns.

• for cross-lingual transfer, the best contextwindow for the source task is often not thebest for test languages. Especially for de-pendency parsing, the smallest context sizeproduces the best result for the source task,but performs the worst for test languages.

We hope that our study will provide insightsinto ways to improve cross-lingual embeddings bynot only mapping methods but also the propertiesof monolingual embedding spaces.

Acknowledgement

We thank the anonymous reviewers for their valu-able comments and suggestions. This work wassupported by JST CREST Grant Number JP-MJCR1513, Japan.

Page 9: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

ReferencesMikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.

Learning principled bilingual mappings of wordembeddings while preserving monolingual invari-ance. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Process-ing, pages 2289–2294, Austin, Texas. Associationfor Computational Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 451–462, Vancouver, Canada. Association for Computa-tional Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018.A robust self-learning method for fully unsuper-vised cross-lingual mappings of word embeddings.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 789–798, Melbourne, Aus-tralia. Association for Computational Linguistics.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! Asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers), pages 238–247, Baltimore, Maryland.Association for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectorswith Subword Information. Transactions of theAssociation for Computational Linguistics, 5:135–146.

Fabienne Braune, Viktor Hangya, Tobias Eder, andAlexander Fraser. 2018. Evaluating bilingual wordembeddings on the long tail. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 188–193, New Orleans, Louisiana. As-sociation for Computational Linguistics.

Paula Czarnowska, Sebastian Ruder, Edouard Grave,Ryan Cotterell, and Ann Copestake. 2019. Don’tForget the Long Tail! A Comprehensive Analysisof Morphological Generalization in Bilingual Lex-icon Induction. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 974–983, Hong Kong, China. As-sociation for Computational Linguistics.

Timothy Dozat and Christopher D Manning. 2017.Deep Biafine Attention for Neural DependencyParsing. In Proceedings of the International Con-ference on Learning Representations.

Long Duong, Hiroshi Kanayama, Tengfei Ma, StevenBird, and Trevor Cohn. 2016. Learning Crosslin-gual Word Embeddings without Bilingual Corpora.In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Processing,pages 1285–1295, Austin, Texas. Association forComputational Linguistics.

Goran Glavas, Robert Litschko, Sebastian Ruder, andIvan Vulic. 2019. How to (Properly) EvaluateCross-Lingual Word Embeddings: On Strong Base-lines, Comparative Analyses, and Some Misconcep-tions. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 710–721, Florence, Italy. Association forComputational Linguistics.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.2015. BilBOWA: Fast Bilingual Distributed Repre-sentations without Word Alignments. In Proceed-ings of the 32nd International Conference on Ma-chine Learning, volume 37, pages 748–756, Lille,France.

Henry. Kucera and W. Nelson. Francis. 1967. Compu-tational analysis of present-day American English.Brown University Press.

Guillaume Lample, Alexis Conneau, Marc AurelioRanzato, Ludovic Denoyer, and Herve Jegou. 2018.Word Translation without Parallel Data. In Proceed-ings of the International Conference on LearningRepresentations.

Omer Levy and Yoav Goldberg. 2014a. Dependency-Based Word Embeddings. In Proceedings of the52nd Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 302–308, Baltimore, Maryland. Associationfor Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014b. LinguisticRegularities in Sparse and Explicit Word Represen-tations. In Proceedings of the Eighteenth Confer-ence on Computational Natural Language Learn-ing, pages 171–180, Ann Arbor, Michigan. Asso-ciation for Computational Linguistics.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015.Improving Distributional Similarity with LessonsLearned from Word Embeddings. Transactionsof the Association for Computational Linguistics,3:211–225.

Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Alek-sandr Drozd, Anna Rogers, and Xiaoyong Du. 2017.Investigating Different Syntactic Context Types andContext Representations for Learning Word Embed-dings. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 2421–2431, Copenhagen, Denmark. As-sociation for Computational Linguistics.

Wang Ling, Chris Dyer, Alan W. Black, and IsabelTrancoso. 2015a. Two/Too Simple Adaptations ofWord2Vec for Syntax Problems. In Proceedings of

Page 10: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

the 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1299–1304, Denver, Colorado. Association for Computa-tional Linguistics.

Wang Ling, Yulia Tsvetkov, Silvio Amir, Ramon Fer-mandez, Chris Dyer, Alan W Black, Isabel Tran-coso, and Chu-Cheng Lin. 2015b. Not All ContextsAre Created Equal: Better Word Representationswith Variable Attention. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 1367–1372, Lisbon, Portu-gal. Association for Computational Linguistics.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Bilingual Word Representations withMonolingual Quality in Mind. In Proceedings ofthe 1st Workshop on Vector Space Modeling for Nat-ural Language Processing, pages 151–159, Denver,Colorado. Association for Computational Linguis-tics.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient Estimation of Word Repre-sentations in Vector Space. In Proceedings of theInternational Conference on Learning Representa-tions.

Tomas Mikolov, Quoc V Le, and Ilya Sutskever.2013b. Exploiting Similarities among Languagesfor Machine Translation. Computing ResearchRepository, arXiv:1309.4168.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013c. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119. Curran Associates, Inc.

Peter Prettenhofer and Benno Stein. 2010. Cross-Language Text Classification Using Structural Cor-respondence Learning. In Proceedings of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 1118–1127, Uppsala, Swe-den. Association for Computational Linguistics.

Sebastian Ruder, Ivan Vulic, and Anders Søgaard.2019. A Survey of Cross-lingual Word EmbeddingModels. Journal of Artificial Intelligence Research,65(1):569–630.

Holger Schwenk and Xian Li. 2018. A Corpus forMultilingual Document Classification in Eight Lan-guages. In Proceedings of the Eleventh Inter-national Conference on Language Resources andEvaluation (LREC 2018), Miyazaki, Japan. Euro-pean Language Resources Association (ELRA).

Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel Bowman, Miriam Connor,John Bauer, and Chris Manning. 2014. A GoldStandard Dependency Corpus for English. In Pro-ceedings of the Ninth International Conference on

Language Resources and Evaluation (LREC’14),pages 2897–2904, Reykjavik, Iceland. EuropeanLanguage Resources Association (ELRA).

Samuel L Smith, David H P Turban, Steven Ham-blin, and Nils Y Hammerla. 2017. Offline BilingualWord Vectors, Orthogonal Transformations and theInverted Softmax. In Proceedings of the Inter-national Conference on Learning Representations,pages 1–10.

Anders Søgaard, Sebastian Ruder, and Ivan Vulic.2018. On the Limitations of Unsupervised Bilin-gual Dictionary Induction. In Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 778–788, Melbourne, Australia. Associationfor Computational Linguistics.

Peter D. Turney and Patrick Pantel. 2010. From Fre-quency to Meaning: Vector Space Models of Se-mantics. Journal of Artificial Intelligence Research,37(1):141–188.

Ivan Vulic and Marie-Francine Moens. 2016. Bilin-gual Distributed Word Representations fromDocument-aligned Comparable Data. Journal ofArtificial Intelligence Research, 55(1):953–994.

Takashi Wada, Tomoharu Iwata, and Yuji Matsumoto.2019. Unsupervised multilingual word embeddingwith limited resources using neural language mod-els. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics,pages 3113–3124, Florence, Italy. Association forComputational Linguistics.

Laura Wendlandt, Jonathan K. Kummerfeld, and RadaMihalcea. 2018. Factors influencing the surpris-ing instability of word embeddings. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 2092–2102, New Orleans,Louisiana. Association for Computational Linguis-tics.

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.2015. Normalized Word Embedding and Orthog-onal Transform for Bilingual Word Translation. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1006–1011, Denver, Colorado. Associationfor Computational Linguistics.

Page 11: arXiv:2004.10813v1 [cs.CL] 22 Apr 2020

A The hyperparameters for training monolingual word embeddings

hyperparameterSource embeddings

(1M sentences)Target embeddings

(5M sentences)embedding size 300number of negative samples 15alpha (learning rate) 0.025 (linearly decayed during training)minimum word count 10 15number of iterations 10 5

B The hyperparameters for downstream tasks

B.1 Document Classification and Sentiment Analysis

hyperparameters

CNN Classifiernumber of filters 100ngram filter sizes 2, 3, 4, 5MLP hidden size 64

Training

optimizer Adamlearning rate 0.001 (halved each time the dev score stops improving)patience 3batch size 64

B.2 Dependency Parsing

hyperparameters

Graph-based Parser

LSTM hidden size 200LSTM number of layers 3tag representation dim 100arc representation dim 500pos tag embedding dim 50

Training

optimizer Adamlearning rate 0.001 (halved each time the dev score stops improving)patience 3batch size 32