neurabase for finding word similarities for finding word similarities.pdf · neurabase for finding...

Kim-Fong Ho

Kit-Yee Wong

Robert G. Hercus

NeuraBASE for Finding Word Similarities

Abstract

Word2vec [1] is a popular neural network based approach to learning distributed vector

representations for words. This paper aims to revisit the approach of employing the co-occurrence

based word-space model as an alternative approach for learning word similarities from large

datasets. The proposed method uses the NeuraBASE neuronal network as described in the

published patent [2].

Introduction

A classic approach in word-space modelling is to record the co-occurrence counts between words

within a dataset into a matrix [3]. However, this method becomes unfeasible for huge datasets; the

matrix experiences a quadratic growth with vocabulary expansion ( ). One quickly notes

that this storage method may be needlessly expensive; after all, some words will never be linked to

others, leaving many zero entries in the matrix. NeuraBASE provides an easy to use and easy to

customize alternative for this by only creating associations between co-occurring words. These

representations can subsequently be used to find words of similar semantic or syntactic meaning.

NeuraBASE takes a text corpus as input and produces a data structure containing all the required

relationships between the words.

It first constructs a dictionary of words from the training data and subsequently, learns the co-

occurrence frequencies between words. Once NeuraBASE has been trained, it can be used in many

natural language processing and machine learning tasks, such as, parallel text alignment and

concept extraction. A simple way to investigate the learned associations is to find similar words for

a user-specified word within a given corpus. In this paper, a simple approach similar to the Jaccard

similarity method is used for this purpose. For example, if the word of interest is 'spain', the

calculation of similarity will lead to the identification of the most similar words and their distances

to ‘spain’ in descending order:

Word Score

portugal 0.25433

italy 0.17578

france 0.17205

spanish 0.16916

romania 0.15957

castile 0.15519

invaded 0.15354

morocco 0.14561

philippines 0.14110

argentina 0.13768

provinces 0.13659

Training Setup

The dataset used in this experiment consists of the first 100MB of text from Wikipedia, which

contains approximately 17 million words and 258k unique words. Using the NeuraBASE toolbox

(http://www.neuramatix.com/download.php), training is performed in two passes over the corpus:

the first pass builds a cortical network of the vocabulary and records the frequency of each entry.

Rare words with frequencies below a set threshold of 5, as well as, 122 common connector words

(e.g. you, me, their, etc.) are subsequently flagged for omission from the next training phase,

resulting in an effective vocabulary of ~72K words.

Choosing a context window with size 10 (5 in either direction), the second pass creates associations

between the word located at the centre of the sliding context window and adjacent words located

within the window. Subsequently, the co-occurrence frequencies are updated.

Referring to the NeuraBASE Toolbox User Guide, the words are built into NeuraBASE using the

LearnSN function, which builds all the words into a cortical network named ‘Words_CN’, whereas

the co-occurrence links between words within the corpus are generated in an interneuronal

network named ‘WordLinks_IN’ using the LinkIN function.

An illustration of the process is shown in Figure 1 for a context window of size 4 (2 in either

direction).

Figure 1: NeuraBASE Approach in Forming Word Relationships

http://www.neuramatix.com/download.php

Word Similarity Measure

Basic Notations

Using k-skip-bigrams to represent the interneuronal links, a formal representation is given as

follows:

1. ( )

2. ( )

3. ( )

4. ( )

5. ( ) ( )

( )

6. ( ) ( ) ( ) ( )

( ) ( )

The joint probability score ( ) used in our calculations is similar to pointwise mutual

information (PMI):

( )

( ) ( )

( )

( ) ( )

( )

( ) ( )

If we choose to discard the log and constant terms, we arrive at the formula,

( )

( ) ( )

In this case, the choice of an additional ( ) at the numerator is purely arbitrary as they appear to

produce better results for our experiment.

Example Query

Instead of comparing all words in the vocabulary to the word of interest for a similarity measure, a

pool of word candidates is extracted from the links of links to the query word. An example is given

as follows:

Figure 2: The Pool of Candidate Words are extracted from the Second Level of Associations

1. Given the word of interest ‘bread’, we first obtain a word vector of maximum dimension n

sorted based on the joint probability scores for each link (or skip-bigram). In this example we

will set n to 100:

{

| ( )

2. In order to find candidate words, we will repeat Step 1 for every word in the set :

3. There is a wide range of other methods available for similarity computation, even though only

one is shown here. In this example, similarity between each candidate word wd and the query

word is calculated using th “min/m x” m su , loos ly g s w ight v sion of th

Jaccard similarity measure, as follows:

( )

∑

( ( ) ( ))

( ( ) ( )) ( )

There are a few things to note in the equation above. First of all, as opposed to ‘bread’, there is

no restriction imposed regarding the position of in the word vector for :

{

| ( )

Therefore, can be ranked at a position lower than in the word vector for and still

contribute to the similarity score calculation. This implies that the similarity score basically

calculates “how much matches ” th th n mutu l comp ison tw n th two

sets. Consequently, it becomes impossible to obtain the same score when comparing the two

words in a reversed order, i.e. ( ) ( ).

4. Th wo most simil to “ ” is th c n i t wo with th high st simil ity sco :

( )

Results and Discussion

Comparison with word2vec

A comparison between the output obtained using our approach against the word2vec approach is

given in the Appendix.

Word2vec is undoubtedly an impressive tool which has achieved state-of-the-art performance in

finding ‘word embeddings’. However, the authors noted that using the correct training parameters

such as vector space dimension size and the choice of learning algorithm for specific datasets is

required to achieve good results. The non-incremental nature of learning caused by the learning

rate decay regularisation method (and binary tree speed-up) also makes it more of a ‘one-off’

learning process. Any expansion of the vector size as well as additional entries into the vocabulary

or corpus requires retraining of the entire model (according to our understanding). Naturally, there

should be workarounds for these available; one such option would be to set a minimum learning

rate or to disregard new vocabulary when training on new datasets at the potential risk of reduced

performance. The objective of the algorithm itself is intuitive: words that are used in similar

contexts will be closer to each other in the language space; however, this intuition does not appear

to be equally obvious when it comes to the actual implementation.

On the other hand, our approach of only processing the vectors at the ‘query’ stage rather than the

‘build’ stage allows for incremental learning as well as giving us the option to customize the

similarity computation algorithms without having to retrain the model.

For detailed information or feedback, please email [email protected]

mailto:[email protected]

REFERENCES

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word

Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Robert Hercus. Neural network with learning and expression capability.

In: U. S. Patent 20090119236, 2009.

[3] Magnus Sahlgren. The Word-Space Model: Using distributional analysis to represent

syntagmatic and paradigmatic relations between words in high-dimensional vector spaces.

Doctoral thesis, Stockholm University, 2006.

Appendix

The following is a comparison of top 20 results between our method using NeuraBASE and word2vec for selected query words.

Note:

The word2vec training parameters have been set as follows:

Architecture: Continuous bag-of-words, vector size: 200, window size: 5 in either direction

Table 3: Comparison between the top 20 results obtained using NeuraBASE and word2vec

Query

Rank

1 student 0.06398 lawyer 0.47024 firing 0.12191 firing 0.41802 macintosh 0.21016 intel 0.46681 four 0.79729 four 0.66971

2 philosopher 0.05321 lecturer 0.44089 fires 0.11469 muzzle 0.38783 mac 0.17156 macintosh 0.42771 five 0.74859 five 0.62120

3 teaching 0.04941 student 0.43920 weapons 0.09505 jarama 0.36706 os 0.15506 ibm 0.41315 two 0.72236 six 0.51925

4 school 0.04817 widower 0.43565 guns 0.09000 seams 0.36572 pc 0.13798 pentium 0.40663 six 0.71376 gru 0.45639

5 founder 0.04477 clergyman 0.43272 tank 0.08618 bolts 0.36306 microsoft 0.12412 amiga 0.39183 seven 0.64750 simms 0.44842

6 instructor 0.04387 thinker 0.42993 combat 0.08356 uncontrollable 0.36014 ibm 0.12162 intellivision 0.37861 zero 0.62782 seven 0.43043

7 taught 0.04348 blacksmith 0.42918 explosive 0.08209 projectiles 0.35780 interface 0.11784 pc 0.35993 eight 0.62387 sq 0.41097

8 grade 0.03875 practitioner 0.42727 strike 0.08151 guns 0.35386 computers 0.11782 jef 0.35638 one 0.59077 lup 0.40167

9 teachers 0.03699 layman 0.42538 carrying 0.08146 propellers 0.35074 linux 0.11563 mac 0.35579 nine 0.48254 xxxxx 0.40072

10 composer 0.03648 niece 0.42347 air 0.08020 lift 0.35040 operating 0.11163 wozniak 0.34897 starting 0.32980 stp 0.39837

11 teach 0.03495 librarian 0.42090 missile 0.07979 wwi 0.34618 computer 0.11082 microsoft 0.34672 roughly 0.31867 two 0.39612

12 training 0.03390 freedman 0.42028 fired 0.07951 whipped 0.34549 compatibility 0.10822 dynamite 0.34394 year 0.31022 honorifics 0.39431

13 professor 0.03355 notary 0.41575 assault 0.07840 shechem 0.34497 windows 0.10734 borland 0.33859 approximately 0.30992 eight 0.39233

14 painter 0.03313 courtier 0.41554 storm 0.07771 dive 0.34354 platforms 0.10701 mcintosh 0.33169 until 0.30584 rol 0.39112

15 worked 0.03297 outsider 0.41110 enemy 0.07585 intramuscular 0.34003 intel 0.10645 absinth 0.33154 nd 0.30417 hyi 0.38923

16 students 0.03216 platonist 0.41048 causing 0.07322 brine 0.33900 amiga 0.10294 ericsson 0.32359 june 0.30054 pngimage 0.38703

17 friend 0.03201 apprenticeship 0.40783 weapon 0.07294 attrition 0.33751 jobs 0.10171 mattel 0.32254 average 0.29573 chpts 0.38110

18 aristotle 0.03074 masterpiece 0.40695 kill 0.07271 loading 0.33539 compatible 0.10015 sgi 0.32185 march 0.29248 donoghue 0.37823

19 mystic 0.03055 hitter 0.40645 burst 0.07262 quarantine 0.33479 software 0.09991 visio 0.32148 th 0.29202 pix 0.37812

20 ross 0.02977 gentleman 0.40466 arrows 0.07243 artillery 0.33460 portable 0.09968 diablo 0.32068 dated 0.28907 mj 0.37463

three (frequency 119608)

NeuraBase word2vec

teacher (frequency 671) fire (frequency 2557) apple (frequency 1500)

NeuraBase word2vec NeuraBase word2vec NeuraBase word2vec

Table 4: Comparisons between the top 20 results obtained using NeuraBASE and word2vec

Query

Rank

1 moles 0.07254 firth 0.41636 allegory 0.10141 sceptre 0.43982 son 0.30681 mother 0.72927 cheese 0.10337 wine 0.62061

2 marsupial 0.05761 azov 0.40188 fables 0.05919 sequel 0.37218 mother 0.23377 son 0.67530 wine 0.10308 vegetables 0.50552

3 molar 0.05614 fennel 0.39648 dictates 0.05881 retelling 0.37126 sons 0.17113 wife 0.67479 potatoes 0.09714 apples 0.50368

4 rodent 0.05479 camel 0.39565 parable 0.05781 fleck 0.36977 wife 0.16532 brother 0.66381 sauce 0.09621 beans 0.50284

5 grams 0.05177 boil 0.39487 sentimental 0.05691 misapplication 0.36843 daughter 0.15612 nephew 0.65945 foods 0.09371 cheese 0.50279

6 mol 0.04886 gulfs 0.39249 mmorpg 0.05635 voyaging 0.36500 god 0.15364 grandmother 0.62331 rice 0.09141 soup 0.50235

7 fractions 0.04880 girth 0.39208 fairy 0.05347 lipstick 0.36400 brother 0.15068 uncle 0.62317 flour 0.08546 grains 0.49829

8 lizard 0.04404 cession 0.39021 tale 0.05019 caricature 0.36294 whom 0.14671 grandson 0.61680 soup 0.08205 drink 0.49422

9 swan 0.04247 millimetre 0.39011 anecdote 0.04923 depiction 0.36118 throne 0.12655 grandfather 0.61605 vegetables 0.07852 roasted 0.48826

10 monkey 0.04089 ovules 0.38912 myth 0.04825 mazurek 0.35948 uncle 0.12228 niece 0.60081 meal 0.07812 onions 0.48034

11 trout 0.04065 eruption 0.38138 ballad 0.04557 philologist 0.35874 died 0.12047 daughter 0.59447 pork 0.07697 meat 0.47758

12 react 0.04053 zebras 0.37883 isis 0.04516 manilius 0.35673 sister 0.11867 girlfriend 0.59433 dishes 0.07686 stewed 0.47642

13 leopard 0.03840 gozo 0.37794 sings 0.04430 juvenal 0.35589 husband 0.11494 mentor 0.58004 cooking 0.07287 pork 0.47550

14 hemoglobin 0.03833 enjoyment 0.37740 colorful 0.04405 tuft 0.35444 younger 0.11366 parents 0.58003 fried 0.07245 sweets 0.47385

15 rat 0.03528 uterus 0.36936 sacrificing 0.04348 tetrahedron 0.35383 king 0.11305 cousin 0.56180 chicken 0.07222 fruit 0.46802

16 salamander 0.03426 lullaby 0.36814 vivid 0.04237 thesaurus 0.35325 jesus 0.11235 widow 0.56092 wheat 0.07184 cooked 0.46495

17 burrowing 0.03353 pug 0.36580 inversion 0.04064 portmanteau 0.35262 parents 0.11092 majesty 0.55683 meat 0.07169 baked 0.46407

18 glue 0.03331 peterhead 0.36504 afrikaans 0.04049 missa 0.35099 child 0.11040 successors 0.55654 sugar 0.07095 juices 0.46392

19 spider 0.03213 inducement 0.36222 narrative 0.04039 colloquialism 0.34994 heir 0.10957 colleague 0.54956 cake 0.06991 twigs 0.46162

20 marmoset 0.03188 remnants 0.36141 ballads 0.04003 pictorial 0.34955 grandson 0.10912 stepson 0.52620 fruit 0.06464 maize 0.45954

word2vecNeuraBase

bread (frequency 349)mole (frequency 101) fable (frequency 57) father (frequency 3918)

NeuraBase word2vec NeuraBase word2vec NeuraBase word2vec

neurabase for finding word similarities for finding word similarities.pdf · neurabase for finding...

Documents