lexbank: a multilingual lexical resource for low … ·  · 2016-09-13iv al tarouti, feras a....

155
LexBank: A Multilingual Lexical Resource for Low-Resource Languages by Feras Ali Al Tarouti M.S., King Fahd University of Petroleum & Minerals, 2008 B.S., University of Dammam, 2001 A Dissertation submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2016

Upload: halien

Post on 18-Apr-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

LexBank: A Multilingual Lexical Resource for Low-Resource

Languages

by

Feras Ali Al Tarouti

M.S., King Fahd University of Petroleum & Minerals, 2008

B.S., University of Dammam, 2001

A Dissertation submitted to the Graduate Faculty of the

University of Colorado at Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2016

Page 2: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

ii

© Copyright by Feras Ali Al Tarouti 2016All Rights Reserved

Page 3: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

iii

This dissertation for Doctor of Philosophy degree by

Feras Ali Al Tarouti

has been approved for the

Department of Computer Science

by

Jugal Kalita, Chair

Tim Chamillard

Rory Lewis

Khang Nhut Lam

Sudhanshu Semwal

Date

Page 4: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

iv

Al Tarouti, Feras A. (Ph.D., Computer Science)

LexBank: A Multilingual Lexical Resource for Low-Resource Languages

Dissertation directed by Professor Jugal Kalita

In this dissertation, we present new methods to create essential lexical resources for

low-resource languages. Specifically, we develop methods for enhancing automatically cre-

ated wordnets. As a baseline, we start by producing core wordnets, for several languages,

using methods that need limited freely available resources for creating lexical resources

(Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in

wordnets we create. Next, we introduce a new method to automatically add glosses to the

synsets in our wordnets. Our techniques use limited resources as input to ensure that they

can be felicitously used with languages that currently lack many original resources. Most

existing research works with languages that have significant lexical resources available,

which are costly to construct. To make our created lexical resources publicly available,

we developed LexBank which is a Web-based system that provides language services for

several low-resource languages.

Page 5: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

To my mother, father and my wife.

Page 6: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

vi

Acknowledgments

I would like to express my appreciation to my wife and the mother of my kids Omima for

the unlimited support she gave to me during my journey toward my Ph.D. I am also very

grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I

would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim

Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation.

Page 7: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

vii

Table of Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Assamese Language . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1.1 Assamese Script . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1.2 Assamese Morphology . . . . . . . . . . . . . . . . . . 5

1.2.1.3 Assamese Syntax . . . . . . . . . . . . . . . . . . . . . 6

1.2.2 Vietnamese Language . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.2.1 Vietnamese Script . . . . . . . . . . . . . . . . . . . . . 6

1.2.2.2 Vietnamese Morphology . . . . . . . . . . . . . . . . . 7

1.2.2.3 Vietnamese Syntax . . . . . . . . . . . . . . . . . . . . 8

1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Case Study: The Current Status and Challenges of processing information in Arabic 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Fundamental of Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Arabic Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Page 8: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

viii

2.2.3 Arabic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Literature Review 20

3.1 Automatic Construction of Wordnets . . . . . . . . . . . . . . . . . . . . . 20

3.2 Wordnet Management Tools . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Creating Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Automaticaaly Constructing Structured Wordnets 31

4.1 Constructing Core Wordnets . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Constructing Wordnet Semantic Relations . . . . . . . . . . . . . . . . . . 33

4.3 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Enhancing Automatic Wordnet Construction Using Word Embeddings 39

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Generating Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Removing Irrelevant Words in Synsets . . . . . . . . . . . . . . . . . . . . 42

5.5 Validating Candidate Relations . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 Selecting Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.7.1 Generating Vector Representations of Wordnets Words . . . . . . . 45

Page 9: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

ix

5.7.2 Producing Word Embeddings for Arabic . . . . . . . . . . . . . . . 47

5.8 Evaluation & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Selecting Glosses for Wordnet Synsets Using Word Embeddings 53

6.1 Creating Language Model Using Word Embedding . . . . . . . . . . . . . 53

6.2 Generating Vector Representation of Wordnet Synsets . . . . . . . . . . . . 53

6.3 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 56

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4.1 Using Synset2vec to Select Glosses for PWN Synsets . . . . . . . . 58

6.4.2 Using Synset2vec to Select Glosses for Arabic,Assamese and Viet-

namese Synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4.3 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 LexBank: a Multilingual Lexical Resource 65

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2.1 The system settings database . . . . . . . . . . . . . . . . . . . . . 66

7.2.1.1 Users_Info . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2.1.2 System_log . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2.2 The lexical resources database . . . . . . . . . . . . . . . . . . . . 67

7.2.2.1 CoreWordnet . . . . . . . . . . . . . . . . . . . . . . . . 68

7.2.2.2 Sem_Relations . . . . . . . . . . . . . . . . . . . . . . . 68

Page 10: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

x

7.2.2.3 WordnetGlosses . . . . . . . . . . . . . . . . . . . . . . 68

7.2.2.4 Sem_Relations_Eval_Data . . . . . . . . . . . . . . . . 69

7.2.2.5 Sem_Relations_Eval_Response . . . . . . . . . . . . . . 69

7.2.2.6 WordnetGlosses_Eval_Data . . . . . . . . . . . . . . . . 70

7.2.2.7 WordnetGlosses_Eval_Response . . . . . . . . . . . . . 70

7.3 Application layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.4 Web Interface Design & Implementation . . . . . . . . . . . . . . . . . . . 72

7.4.1 Registration Form . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.4.2 Log-in Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.4.3 The Main Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.4.4 Searching Wordnet By Lexeme Web Form . . . . . . . . . . . . . . 77

7.4.5 Searching Wordnet By OffsetPos Web Form . . . . . . . . . . . . . 78

7.4.6 Evaluating Semantic Relations Between Synsets Web Form . . . . 80

7.4.7 Evaluating Wordnet Synsets Glosses Web Form . . . . . . . . . . . 83

7.4.8 Users Management Web Form . . . . . . . . . . . . . . . . . . . . 85

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8 Conclusions 88

9 Future Work 91

9.1 Extending Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 91

9.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets . . . . 93

9.2 Integrating Part-of-speech Tagging into Wordnet Construction . . . . . . . 95

Page 11: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xi

9.3 Wordnet Expansion Using Word Embeddings . . . . . . . . . . . . . . . . 96

9.4 Producing Vector Representation for Multi-word Lexemes . . . . . . . . . 97

9.5 Vector Representation for Mulit-lingual Wordnets . . . . . . . . . . . . . . 97

Bibliography 98

Appendices 110

A Data Processing Software Code 110

A.1 computCosineSim.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2 GenerateVectorForSynset.py . . . . . . . . . . . . . . . . . . . . . . . . . 112

A.3 GenerateVectorForGloss.py . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A.4 ComputeGlossSynsetSimilarity.py . . . . . . . . . . . . . . . . . . . . . . 114

B Microsoft SQL Server Tables 115

C LexBank Utility Class 127

Page 12: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xii

List of Tables

3.1 A list of the Java libraries tested in (Finlayson, 2014). . . . . . . . . . . . . 25

3.2 A comparison between some of the Java libraries for accessing the PWN. . 26

4.1 Wordnet semantic relations. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Size, coverage and precision of the core wordnet we create for Arabic,

Assamese and Vietnamese. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Precision of the semantic relations established for our Arabic wordnet. . . . 37

5.1 An example of cosine similarity between words in a candidate synset . . . . 44

5.2 The weighted average similarity between related words in AWN. . . . . . . 47

5.3 Comparison between the weighted similarity average obtained using dif-

ferent word2�ec settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Comparison between the number of synsets in AWN and our Arabic word-

net using different threshold values. . . . . . . . . . . . . . . . . . . . . . 48

5.5 Precision of the Arabic wordnet we create. . . . . . . . . . . . . . . . . . . 49

5.6 Precision of the Assamese wordnet we create. . . . . . . . . . . . . . . . . 49

5.7 Precision of the Vietnamese wordnet we create. . . . . . . . . . . . . . . . 49

5.8 Examples of related words and their cosine similarity from our Arabic

wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Page 13: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xiii

5.9 Examples of related words and their cosine similarity from our Assamese

wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.10 Examples of related words and their cosine similarity from our Vietnamese

wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1 Meanings of the noun “spill” and its synonyms. . . . . . . . . . . . . . . . 55

6.2 Cosine similarity between the different synset vectors and glosses of the

word “abduction” in PWN. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 The precision of selecting glosses for PWN synsets . . . . . . . . . . . . . 60

6.4 Examples of Arabic glosses we produce in our Arabic wordnet. . . . . . . . 61

6.5 Examples of Assamese glosses we produce in our Assamese wordnet. . . . 62

6.6 Examples of Vietnamese glosses we produce in our Vietnamese wordnet. . 63

6.7 The precision of selecting glosses for Arabic, Assamese and Vietnamese

synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Page 14: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xiv

List of Figures

3.1 An overview of the CSS management tool, adapted from (Nagvenkar et al.,

2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 IWND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Core wordnet mapping to structured wordnet. . . . . . . . . . . . . . . . . 34

4.3 Creating wordnet semantic relations using intermediate wordnet. . . . . . . 35

4.4 The effect of missing synsets in recovering wordnet semantic relations us-

ing intermediate wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Percentage of synset semantic relations recovered for the Arabic, Assamese

and Vietnamese wordnets. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 A histogram of synonyms, semantically related words, and non-related

words extracted from AWN. . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1 An example of creating a vector for a wordnet synset that include more

than one word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 An example of creating vectors for wordnet synsets that share a single word. 56

7.1 An overview of LexBank system. . . . . . . . . . . . . . . . . . . . . . . . 65

7.2 LexBank web site map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Page 15: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xv

7.3 The registration web form . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.4 Sequence diagram of the registration process . . . . . . . . . . . . . . . . 75

7.5 The log-in web form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.6 Sequence diagram of the log-in process . . . . . . . . . . . . . . . . . . . 76

7.7 The main menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.8 The Web form for searching wordnet by lexeme. The form is showing the

result of searching the Arabic lexeme (���) which means Egypt. . . . . . 78

7.9 Sequence diagram of the process of searching wordnet using lexeme . . . . 79

7.10 The Web form for searching wordnet by OffsetPos. The form is showing

the result of searching the Arabic synset (08897065-n). . . . . . . . . . . . 80

7.11 Sequence diagram of the process of searching wordnet using OffsetPos. . . 81

7.12 The Web form for evaluating semantic relations between synsets in a word-

net. The form is showing an example of evaluating a hyponymy relation

between the two Assamese lexemes radiotelegraph and radio. . . . . . . . . 81

7.13 Sequence diagram of the process of evaluating the relation between two

lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.14 The Web form for evaluating wordnet synsets glosses. The form is showing

an example of evaluating Arabic synset (13108841-n). . . . . . . . . . . . 83

7.15 Sequence diagram of the process of evaluating the relation between two

lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.16 The Web form for managing users in LexBank. . . . . . . . . . . . . . . . 85

7.17 Sequence diagram of the process of managing users in LexBank. . . . . . . 86

Page 16: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

xvi

9.1 The IW approach for creating a new bilingual dictionary . . . . . . . . . . 92

9.2 Extending bilingual dictionaries using structured wordnets . . . . . . . . . 94

Page 17: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 1

INTRODUCTION

1.1 Motivation

A Lexical resource is a classified group of lexical units that provide some linguistic

information. The lexical units can be morphemes, words or multi-word phrases. The basic

unit of a lexical resource is usually called a lexical entry. Some lexical resources can

be used by humans directly while other lexical resources are machine readable. Lexical

resources are the base of most Natural Language Processing (NLP) applications.

There are many types of lexical resources. Based on its type, a lexical resource

can provide syntactical, morphological, phonological or semantic information. Lexicons,

unilingual dictionaries, bilingual dictionaries and wordnets are examples of lexical re-

sources. There are some few fortunate languages, such as English and Chinese, which

have relatively large number of high quality lexical resources. These languages are usually

called resource-rich. Most of the created lexical resources of the resource-rich languages

have been painstakingly manually created by researchers through many years. Unfortu-

nately, most of the other existing languages lack many of those lexical resources. Thos lan-

guages which lack lexical resources are called resource-low or resource-poor languages.

While some of those languages might have some resources, other languages barely have

any lexical resources. Especially poor in this context are the endangered languages around

the world.

Page 18: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

2

One important resource that is very helpful in computational processing and in human

language learning is a thesaurus providing synonyms and antonyms of words. An extended

version of a thesaurus that provides additional relations among words in the computational

context is usually called a wordnet. A wordnet is a structured lexical ontology of words

that groups words based on their meaning using sets that are called synsets. For example,

the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that

has the gloss: an aircraft without wings that obtains its lift from the rotation of overhead

blades. The wordnet connects synsets with each other based on semantic relations. Word-

nets are used in many applications such as word sense disambiguation, machine translation,

information retrieval, text classification and text summarization.

The Princeton WordNet (PWN) is the original English version of such a wordnet and

has been painstakingly produced with diligent manual work augmented by the development

of computational tools, over several decades at Princeton University. Similar complete

wordnets have also been produced for a small number of additional languages such as

French (Sagot and Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and

Watanabe, 2006). Efforts to produce wordnets for a variety of other languages have been

proposed, but most are moving slowly, such as the effort to construct the Asian wordnets

Charoenporn et al. (2008) and Indian wordnets (Bhattacharyya, 2010).

Another important type of resource is the bilingual dictionary, an essential tool for

human language learners. Most existing (online) bilingual dictionaries are between two

resource-rich languages or between a resource-rich language and a resource-poor language.

It is fortunate that many endangered languages have one bilingual dictionary, created usu-

ally by explorers, evangelists or other scholars. However, dictionaries or translators for

Page 19: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

3

translations between two resource-poor languages do not really exist. Wiktionary1, a dic-

tionary created by volunteers, supports over 171 languages, although coverage is poor for

many of them. The online translation machines developed by Google2 and Microsoft3 pro-

vide pairwise translations, including translations for single words, for 90 and 51 languages,

respectively. While this is a wide range of languages, these machine translators still leave

out many widely-spoken languages, not to mention endangered ones.

In previous work we focused on developing new techniques that leverage existing

resources for resource-rich languages to build bilingual dictionaries, and core wordnets

and other resources such as simple translators for resource-poor languages, including a few

endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources

in the next level by improving the functionality, quality and coverage of these resources.

We present several new techniques that we did not use in our previous work. Our ultimate

goal is to produce an integrated multilingual lexical resource available online, one that

includes several important individual resources for several languages. We believe that our

resources will help researches, speakers, learners and other users of these languages.

1.2 Research Focus

The goal of this dissertation is to create and make available multilingual lexical re-

sources for several languages by bootstrapping from a limited number of existing resources.

Our study has the potential not only to construct new lexical resources, but also to provide

support for communities using languages with limited resources. Additionally, our re-1http://en.wiktionary.org/wiki/Wiktionary:Main_Page2http://translate.google.com/3http://www.bing.com/translator

Page 20: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

4

search presents novel approaches to generate new lexical resources from a limited number

of existing resources.

The main focus of our work is to collect data from disparate sources, develop algo-

rithms for mining and integrating such data, produce lexical resources, and evaluate the

resources in regards to the quality and quantity of entries. To develop and test our ideas, we

work with a few languages with in-house expertise. These include Assamese (asm), Arabic

(arb), English (eng) and Vietnamese (vie). In Chapter 2 we present a detailed introduction

to Arabic. Next, we present a brief introduction to Assamese and Vietnamese.

1.2.1 Assamese Language

Assamese is an Indo-European language that are spoken by more than 15 million

people (Hinkle et al., 2013). It is mainly used in the Indian states of Assam, Arunachal

Pradesh, Meghalaya, Nagaland and West Bengal. Assamese has 4 dialects: Standard As-

samese, Jharwa, Mayang and Western Assamese (Gordon and Grimes, 2005). We present

a brief description of the script, morphology and syntax of Assamese.

1.2.1.1 Assamese Script

Assamese script consists of 37 consonants, 11 vowels, 147 conjuncts and a few punc-

tuation marks (Hinkle et al., 2013). Unlike English where the written letters might have

variable pronunciation, Assamese written letters have one pronunciation. A consonant that

does not occur at the end of a word is assumed to have implicit vowel a following it. How-

ever, when several consonants need to be pronounced together, they are usually written

using a new conjunct letter.

Page 21: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

5

When a vowel follows a consonant, the vowel is not written explicitly, but implicitly

as an operator. These operators is attached to consonants in different positions (Hinkle

et al., 2013). They can appear to the left, right, below or above the consonants. Foreign

words can appear in Assamese script as transliteration. However, It is not unusual to write

foreign words in foreign alphabets within a piece of Assamese text.

1.2.1.2 Assamese Morphology

Assamese morphology has two types of morphological transformations: derivational

and inflectional. Around 48% of the Assamese words are constructed using those two types

of transformation (Sharma et al., 2008). The derivational transformation in Assamese is

usually performed by changing the vowel component in the word, while the inflectional

transformation is performed by adding prefixes or suffixes to the word. Assamese is well-

known for its complex suffixes. It is common in Assamese that a word includes a sequence

of suffixes. Four to six suffixes in sequence are not uncommon (Saharia et al., 2009).

In Assamese, suffixes are used for many purposes. The most common purpose of

suffixes is determination (Sharma et al., 2008). In fact, a large number of the Assamese

suffixes are determiners. As in other languages, some determiners are attached to nouns

and pronouns to make them specific. This is similar to using this and that in English.

Unlike in many other languages, such as English, where affixes are used, determiners in

Assamese are also used to transfer single noun to plural.

Page 22: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

6

1.2.1.3 Assamese Syntax

Assamese has less firm syntax which means that it is considered a free word order

language. This means that sentences could be written in different word orders and still have

the same meaning. The normal form of a simple Assamese sentences is Subject+Object+

Verb (SOV) (Sarma, 2012), although other orders are acceptable.

1.2.2 Vietnamese Language

Vietnamese, the first language of Vietnam, is an Austroasiatic language that arose

in Indo-China (Thompson, 1987). It is the first language of more than 75 millions peo-

ple living in Vietnam (Gordon and Grimes, 2005). Also, due to emigration, it is the fist

language of many people living around the world, specially in East and Southeast Asia.

Vietnamese, which is called Annamese also, has five main dialects that differ mainly in

their sound systems. The five main dialects of Vietnamese are: Northern Vietnamese,

North-central Vietnamese, Mid-Central Vietnamese, South-Central Vietnamese and South-

ern Vietnamese (Wikipedia, 2016a). In the next sections, we present a brief description of

the script, morphology and syntax of Vietnamese.

1.2.2.1 Vietnamese Script

Old Vietnamese texts are written using Chinese characters. In the 17th century, the

Latin alphabet was introduced to Vietnamese by the French. By the beginning of the 20th

century, the Romanized version of Vietnamese became dominant (Thompson, 1987).

Page 23: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

7

Compared to other languages, Vietnamese has a large number of vowels. It has 11

single vowels in addition to three types of composed vowels: centering diphthongs, clos-

ing diphthongs and triphthongs (Gordon and Grimes, 2005). These vowels are created

by combining single vowels together. Vowels are modified by diacritics. The diacritics,

which can be written above or below a vowel, are used to specify the tone of the vowel.

These tones have different lengths, pitch heights, pitch melodies and phonations. There are

25 consonants in Vietnamese. Consonants are represented in written script by a variable

number of letters. Some of the consonants are represented using one letter and other conso-

nants are represented by a digraph, which is a combination of two letters. There are some

consonants which are represented by more than one digraph or letter (Wikipedia, 2016a).

1.2.2.2 Vietnamese Morphology

In Vietnamese, the majority of words are polysyllabic words (Noyer, 1998). Poly-

syllabic words are words composed of two or more syllables. The construction of polymor-

phemic words in Vietnamese is done in three ways: combining two words, adding affixes

to stem or reduplication. Words formed using reduplication morphology are constructed by

duplicating a word or a part of a word. There are a small number of affixes in Vietnamese.

Most of them are in the form of prefixes and suffixes. One distinct characteristic of Viet-

namese is that it does not have any number, gender, case and tense distinction (Wikipedia,

2016b). However, usually a noun classifier is used as a determiner and is added after the

word to specify those characteristics.

Page 24: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

8

1.2.2.3 Vietnamese Syntax

Vietnamese sentences follow the Subject+Verb+Object (SVO) word order. To dis-

tinguish between verbs and nouns in a Vietnamese sentence, a copula is used before the

nouns. Noun phrases are usually composed of a noun and a modifier. The modifier can

be a numerator, classifier, prepositional phrase or other description word. Like in other

languages, pronouns are used to substitute the nouns and noun phrases.

1.3 Research Contributions

The resources created by Khang’s PhD dissertation (Lam, 2015) and reported in (Lam

et al., 2014a,b, 2015b), have many holes. E.g., the wordnets have only synsets, which are

sets of synonyms for words. In this dissertation work, we develop algorithms and models

to automatically establish the semantic relations between synsets in our previously created

core wordnets for our languages of focus using both pre-existing resources, as well as by

bootstrapping with resources we create ourselves. Following are the contributions produced

by this thesis:

• We construct the rest of the structure for our core wordnets with acceptable qual-

ity. We focus on the construction of wordnet semantic relations such as Hypernyms,

Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms between the

synsets.We believe that our work contributes significantly to the repository of re-

sources for languages that lack them.

• We present a method to enhance the quality of wordnets, we create in the first task,

by filtering the mistakenly created synsets and relations. In this task, we use one

Page 25: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

9

of the state of the art techniques which is word embedding (Mikolov et al., 2013).

This method give a solution to the problem of wrong translation produced by the

translation method.

• We produce an approach to create a vector representation for synsets. This approach

aims to produce a better way for representing meaning. This representation can be

used in several areas. In this task we use it to automatically extract glosses from

corpora for wordnet synsets we create in the previous tasks. It, also, can be used in

the word-sense disambiguation (WSD) problem which occurs with words that have

multiple meanings.

• Then, based on the vector representation of synsets, we present a novel approach

to add a gloss for each synonym set (synset) in our core wordnets. A gloss is a

definition or a sentence that clarifies the meaning of the synset. Glosses are mostly

added manually by human or automatically generated using rule-based generation

approach (Cucchiarelli et al., 2004).

• Finally, we present LexBank which is a system that makes our created resources

available for public. We design and implement the system such that it provides use-

ful services for users that seek linguistics resources in a friendly manner. We aim

to make our system flexible and expandable so it can accommodate additional new

languages and resources.

Page 26: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 2

CASE STUDY: THE CURRENT STATUS AND CHALLENGES OF

PROCESSING INFORMATION IN ARABIC

Since Arabic is one of the languages we use in our experiments throughout this dis-

sertation, we present the current status of Arabic language processing as an example in this

chapter.

2.1 Introduction

According to Ethnologue (Gordon and Grimes, 2005), Arabic is the official language

of more than 223 million people in 25 countries which makes it one of the most widely-

spoken languages in the word. Arabic is the language of Islam, which is the religion of

1.6 billion people around the world. Muslims are required to use Arabic to read the Quran

(the Holy Book of Islam) and to perform the rituals of Islam. There are around 30 major

dialects in Arabic. These dialects have different phonologies, morphologies, syntax and

even lexicons (Habash, 2010). However, these dialects are not used as official languages

by themselves. They are used for informal speech. For formal writing and speaking, the

official Modern Standard Arabic (MSA) is used. MSA was developed based on Classical

Arabic, which is the language of historical literature. However, dialects are commonly used

for writing now-a-days in social media. But, they are rarely used in books, newspapers and

in literary writing. Even though most Arabs can speak MSA, it is not the natively spoken

language of any region (Diab and Habash, 2007). This coexistence between MSA and

Page 27: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

11

dialects is problematic for Arabic language processing. This happens to be a problem in

most of widely spoken languages in the world.

One important survey (Farghaly and Shaalan, 2009) discussed the importance of

research in the field of Arabic processing from two perspectives. First, the perspective of

non-Arabic speakers who need to process a huge amount of Arabic texts. The Department

of Homeland Security in the United States is a good example. With increasing security

risks, there is a crucial need to be able to understand the meaning of Arabic documents

and retrieve important information from them such as names, organization and places. The

second perspective is that of Arabic speakers. Machine translation, retrieving information,

summarization, and linguistic tools are some of the applications which are requested by

Arabic speakers.

In the rest of this chapter, we give a summary of the features that make the process-

ing of Arabic text so challenging and some of the solutions and resources that have been

designed to address these challenges. First, in Section 2, we discuss the fundamental issues

in Arabic which are the script, the morphological issues, and the syntactical issues. Then,

in Section 3, we discuss three of the most valuable resources for Arabic processing. These

are The Penn Arabic Treebank (PATB), The Prague Arabic Dependency Treebank (PADT),

and The Columbia Arabic Treebank (CATiB).

2.2 Fundamental of Arabic

In this section we discuss the script, morphology and syntax of Arabic.

Page 28: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

12

2.2.1 Arabic Script

Arabic is written as a right to left script. The Arabic script is also used by languages

such as Kurdish, Urdu, Persian and Pashto (Habash, 2010). One important aspect of Arabic

is that most of Arabic letters are composed of two parts: a base form and a mark. There

are three kinds of marks in Arabic letters. The first kind consists of dots which are used

to distinguish between letters that share the same base form. An example of letters that

share the same base form are the letters (�) “ba”,(�) “ta”, and (�) “tha”. The second kind

of mark is the Hamza mark (�) which can be written above some letters, as in (�) “u”, or

under some letters, as in (�) “I”. Unfortunately, people often misspell words by not writing

such marks making it hard to distinguish between similar letters and causing ambiguity in

the text. It is also important to notice that Hamza (�) can also be considered a letter by

itself. An example of a word that has the Hamza letter is the word (����) which means

“sky”. The third kind of mark is the Hamza mark that distinguishes the letter (�) “Kaf”

from the letter (�) “Lam”.

Most letters in Arabic have several shapes. The shape of a written letter is determined

based on the position of that letter in the word. Let us take the letter (�) “qaf” as an

example. If it appears at the beginning of the word, it will have the shape (��) whereas it

will have the shape (���) if it appears in the middle of the word, and the shape (��) if it

is at the end of the word. All word processors select the appropriate letter shape based on

the rules which govern these shapes, and therefore, there is only one key for each letter.

Inflectional morphology is also a factor that governs the shape of some Arabic letters.

The Arabic letter “Hamza” is a good example for that. The word (������), which means

Page 29: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

13

“friends”, becomes (�������) instead of (�������) when we add the letter (�) which

means the possessive pronoun “my”.

In Arabic, each letter is mapped to one unvarying sound, which makes it a phonetic

language. For example, the Arabic letter (��) always has the pronunciation /s/. On the

other hand, letter “s” in English has three pronunciations: /z/, /s/, and /sh/ as in “nose”,

“salt”, and “sugar”, respectively. However, in Arabic a short vowel may be added to the

letter to change its sound. There are three short vowels in Arabic, which means that each

letter has three more sounds in addition to the original sound. There are no dedicated letters

to represent short vowels. The short vowels may be specified in the written language using

optional diacritics. To show how the short vowels change the sound of Arabic letters, let us

look at the Arabic letter (��) again. We said that (��) is pronounced as /s/; however, if we

add the short vowel “Dhamma” it will be pronounced as “su” and it may be written, with

the “Dhamma” diacritic, as (���). If we add the short vowel “Kasra”, it will be pronounced

as “si” and it may be written with “Kasra” diacritic, as (���). Keep in mind that in MSA, the

writing of the diacritics is optional, although a change in a diacritic of a letter can change

the meaning of the word and may even change the morphological structure of the sentence.

Clearly, this a major source of ambiguity in Arabic processing (Diab and Habash, 2007).

Obviously, with all these problems caused by the Arabic script, Arabic input text

has to be pre-processed to enhance recognition during the actual processing. This pre-

processing, which is called normalization, aims to standardize different Arabic script varia-

tions. There are several solutions which have been proposed to normalize the Arabic script.

For example, (Larkey et al., 2002) normalized the corpus, the queries, and the dictionaries

of Arabic using the following steps. They first unified the encoding and removed punctua-

Page 30: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

14

tions in the text. Then they removed all the diacritics and the non-letters called “tatweel”.

After that, they removed the Hamza mark (�) from the letter “Alif” to standardize all the

variations (�),(�), and (�) to (�). Also, they replaced (��) with (�), (�) with (�), and (�)

with (�). The Stanford Natural Language Processing Group adopted a similar procedure in

the Stanford Arabic Statistical Parser (Green and Manning, 2010). The normalization pro-

cess, as you might expect, does not come without a price. Since all these removed marks

purpose to clarify ambiguity, the normalization of the variant scripts causes the ambiguity

probability to increase (Farghaly and Shaalan, 2009).

Unlike English, there are no silent letters in Arabic. An example of a silent letter

in English is the letter “p” in the word “pneumonia”. There are no new sounds in Arabic

produced by combining two letters. For instance, in English, “c” and “h” are combined

to produce three distinct sounds: the sound at the beginning of “cheese”, the sound at the

beginning of “character”, and the sound at the beginning of “chef.”

It is well known that the process of splitting text into sentences is an essential step

in many Natural Language Processing (NLP) applications. In English, this is relatively an

easy task since English sentences start with an uppercase letter and finish with a period.

However, splitting Arabic sentences is not as easy as in English since there is no capital

form for Arabic letters (Chinese, Japanese, and Korean have no capitalization too). In

addition, punctuation rules in Arabic are not strict; so many people do not use it properly. In

fact, Arabic writers excessively use coordinations, subordinations and logical connectives

to conjoin the sentences (Farghaly and Shaalan, 2009). Hence, it is not unusual for an

Arabic article to have a complete paragraph which does not include any periods other than

Page 31: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

15

the period at the end of the paragraph. Therefore, texts in the Arabic must go through

complicated preprocessing.

The lack of capitalization obviously makes it hard to detect named entities (Darwish,

2013) which is an essential part of Information Retrieval (IR). In English, extracting named

entities such as cities, names of people, addresses and organizations is done with the help of

capitalization and punctuation. For example, to recognize a name like “Barack H. Obama”,

a simple algorithm can be used to search for an uppercase word followed by an initial

with an optional period followed by an uppercase letter. We are not claiming that NER

in English is straightforward or simple in general, but since Arabic does not have these

features, new methods must be used to address the problem of named entity recognition

(Darwish, 2013).

2.2.2 Arabic Morphology

Arabic has a very rich and complex morphology (Attia, 2008). Similar to the other

Semitic languages, morphology in Arabic is of two types, derivational and inflectional.

Derivational morphology is the process of creating new words. This is done by mapping

a root to a pattern. The root holds the meaning while the pattern changes the structure of

the root generating a new word with a different part-of-speech. This type of derivational

morphology is called nonlinear morphology (Bhattacharya et al., 2005). On the other

hand, inflectional morphology is the process of modifying the words with features to create

plural, feminine, or definite forms of the word (Habash, 2010).

Page 32: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

16

A morpheme is “a linguistic form which bears no partial phonetic-semantic resem-

blance to any other form” (Bloomfield, 1933). Morphological processing in NLP is the

process of decomposing a word into morphemes. Relatively, this is an easy task in con-

catenative morphology. However, in languages with nonconcatenative morphology, like

Arabic, it is a much harder task. In Arabic, words are built by merging a consonantal root

and a vocalism (McCarthy, 1981). The root holds a semantic field while the vocalism

specifies the grammatical form. An example showing the nonconcatenative morphology

of Arabic would be the word (���) “katab” which means “to write”. It is composed by

associating the root (���) /k-t-b/ which has the meaning of “writing”.

Several approaches have been used to decompose Arabic words. The first approach

recovers the root by extracting all prefixes and suffixes from the word, then, stripping the

rest of the word using a lexicon of roots (Hlal, 1985). This approach is very common;

however, it requires a lexicon of all possible Arabic roots, prefixes, infixes and suffixes

(Beesley, 1996; Shaalan et al., 2006). Buckwalter introduced another approach in his mor-

phological analyzer (BAMA) (Buckwalter, 2004). Rather than recovering the root, BAMA

recovers the stem and considers it the main building block for the Arabic word. The stem

is recovered by just removing the prefixes and the suffixes. Therefore, BAMA decomposes

the Arabic word into three parts: Arabic stems, Arabic prefixes and Arabic suffixes.

The decomposition process searches for the prefixes and the suffixes in the word

that satisfy constraints governing the possibility of combining them with the stem in the

word. BAMA has a bidirectional transliteration schema from Arabic script to Latin script.

That means that developers can work with unstructured Arabic texts without any Arabic

language knowledge. For this reason, many recent statistical ANLP systems use BAMA as

Page 33: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

17

the foundation for machine translation and information retrieval. However, BAMA has the

limitation of giving a general analysis that includes all possible cases of the word without

considering the context of the input text. A more refined result can be obtained using a

disambiguation module that considers the context of the input text after eliminating the

incorrect analyses (Habash and Rambow, 2005).

Dialectal Arabic differs from MSA morphologically, lexically and phonologically

(Habash et al., 2013). Furthermore, there are no standard orthographies and no language

academies in dialectal Arabic. Therefore, the tools and resources designed for MSA do

not work with dialectal Arabic. Recently, several research efforts have focused on Arabic

dialectal texts (Habash and Rambow, 2005; Habash et al., 2013; Zaidan and Callison-

Burch, 2014). The state-of-the-art dialectal Arabic morphological analyzer is the Columbia

Arabic Language and dialect Morphological Analyzer (CALIMA) (Habash et al., 2013).

Arabic is an agglutinative language, which means that Arabic words usually include

affixes and clitics that represent different parts-of-speech. Let us take the word (���������)

“katabto ho” which means “I wrote it”. This word is a verb that has the subject and the

object attached to it. The subject is the diacritic on the fourth letter (�) “ta”, while the

object is the suffix (��) “ha”. This is just a simple example whereas words usually have

more complex structures that include other clitics to specify the gender, person, number and

voice. Hence, due to complex phonological rules, the decomposition of words in Arabic is

relatively more difficult.

Page 34: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

18

2.2.3 Arabic Syntax

According to (Habash, 2010), there are two kinds of sentences in Arabic: sentences

that start with verb (V-SENT), and sentences that start with noun (N-SENT). Verb-subject-

object (VSO) is the primary structure of a V-SENT sentence in the Classical and Modern

Standard Arabic. However, the object-verb-subject (OVS) and subject-verb-object are also

commonly used. As we mentioned before, Arabic is a pro-drop language which means that

the subjectless sentences are perfectly grammatical in Arabic. Also, unlike English, the

use of the equational sentences like “He a journalist”, are allowed without the need of a

“to be” verb. Russian, Hungarian, Hebrew, and Quechuan languages also allow this type

of sentences.

In Arabic, the structure of constituent questions is usually composed by starting with

a wh-phrase. However, it is grammatically correct if the constituent question does not start

with the wh-phrase. For example, the question (������ ���� ����) literally means “you

eat what yesterday?”. Furthermore, relative clauses in Arabic are connected using relative

pronouns. For example, in the sentence (������� ���� ����� �����) there are two clauses:

(����� �����) which means “I liked the house”, and (������� ����) which means “which

I bought”. The two clauses are connected using the relative pronoun (����) which means

“which”. The Arabic relative pronouns must agree with the noun which it modifies at the

second clause in number and gender.

Page 35: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

19

2.3 Summary

In this chapter, we presented a short overview of inofrmation processing in Arabic.

We summarized challenges that face developers and researchers when processing Arabic

text due to many of its features. The lack of capitalization, dropped subjects, missing

short vowels and the nonconcatenative nature are some of these features. In addition, there

are many dialects in Arabic, which are used in the informal speaking and writing. These

dialects must be treated differently when processing Arabic texts. Much research has been

conducted to address the challenges of Arabic text processing. Some valuable resources

and techniques have been presented for Arabic. However, more work needs to be done to

give Arabic developers and speakers the support they need.

Page 36: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 3

LITERATURE REVIEW

In this chapter, we provide a summary of the main existing approaches for creating

lexical resources. We focus on two types of lexical resources: wordnets and bilingual

dictionaries.

3.1 Automatic Construction of Wordnets

Wordnet is a lexical ontology of words. There are two ways to construct wordnets

for languages that do not possess such resources: manual construction and automatic con-

struction. We intend to use automatic construction using core wordnets we have created in

our earlier work (Lam et al., 2014a,b, 2015b) and other existing resources that are freely

available. Other efforts are underway to manually (or mostly manually) create wordnets in

a variety of languages, although progress seems slow all around.

High-quality wordnets have been developed for a small number of languages. Word-

nets, other than the Princeton WordNet (Fellbaum, 1998; Miller, 1995), are typically con-

structed by one of two approaches. The first approach, which is called the expansion ap-

proach, translates the PWN to target languages (Akaraputthiporn et al., 2009; Barbu, 2007;

Bilgin et al., 2004; Kaji and Watanabe, 2006; Lam et al., 2014b; Lindén and Niemi, 2014;

Oliver and Climent, 2012; Sagot and Fišer, 2008; Saveski and Trajkovski, 2010). In con-

trast, the second approach, which is called the merge approach, builds the semantic taxon-

Page 37: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

21

omy of a wordnet in a target language, and then aligns it with the Princeton WordNet by

generating translations (Borin and Forsberg, 2014; Gunawan and Saputra, 2010; Maziarz

et al., 2013; Rigau et al., 1998). To construct the taxonomic relations between words,

first definitions of words are retrieved from machine readable dictionaries. Then, a genus

disambiguation process, which is the process of finding a word with a broad meaning that

more specific words fall under, is performed using the definitions to construct a hierarchical

class of concepts. Next, classes are merged with the synsets in the PWN using a bilingual

dictionary to form the target wordnet.

The expansion approach dominates the merge approach in popularity. Wordnets gen-

erated using the merge approach may have different structures from the Princeton Word-

Net. In contrast, wordnets created using the expansion approach have the same structure

as the Princeton WordNet, which provides for a level of uniformity among them, pos-

sibly at the cost of some natural language-specific expressiveness (Leenoi et al., 2008).

Many approaches to construct wordnets are semi-automatic and, therefore, can be used

only for languages that have some existing lexical resources. Therefore, any attempt to

build wordnets for resource-poor languages using these methods would be doomed from

the start. Moreover, while wordnets are always difficult to evaluate, it is even harder to eval-

uate machine-created wordnets in resource-poor languages because these languages do not

have gold standards to compare with, and frequently do not have easily-accessible experts

to evaluate such resources.

Crouch clusters documents first using a complete link clustering algorithm and gener-

ates thesaurus classes or synonym lists based on user-supplied parameters (Crouch, 1990).

Curran and Moens evaluate the performance and efficiency of thesaurus extraction meth-

Page 38: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

22

ods and also propose an approximation method that provides for better time complexity

with little loss in accuracy (Curran and Moens, 2002a,b). Ramirez and Matsumoto develop

a multilingual Japanese-English-Spanish thesaurus using two freely available resources:

Wikipedia and the Princeton WordNet (Ramírez et al., 2013). They extract translation tu-

ples from Wikipedia articles in these languages, disambiguate them by mapping to wordnet

senses, and extract a multilingual thesaurus with a total of 25,375 entries. One thing we

must note about all these approaches is that they are resource-hungry, requiring a large cor-

pus of Wikipedia or non-Wikipedia documents and wordnets. For example, Lin works with

a 64-million word English corpus to produce a high quality thesaurus with about 10,000

entries (Lin, 1998). Ramirez and Matsumoto have the entire Wikipedia at their disposal

with millions of articles in three languages, although for experiments they use only about

13,000 articles in total (Ramírez et al., 2013). Furthermore, (Miller and Gurevych, 2014)

work with more than 19 thousands of Wiktionary senses and 16 thousands of Wikipedia

articles to produce a three-way alignment of WordNet, Wiktionary, and Wikipedia. When

we work with low-resource or endangered languages, we do not have the luxury of collect-

ing such big corpora or accessing even a few thousand articles from Wikipedia or the entire

Web. Many such languages have no or very limited Web presence. As a result, we have to

work with whatever limited resources are available.

In this work we propose approaches to generate synonyms, hypernyms, hyponyms

and some other semantic relations. To enhance the quality of wordnets we create, several

approaches are used to measure relatedness between concepts or words. Some potential

approaches for measuring semantic relationships are a dictionary-based approach (Kozima

and Furugori, 1993) and thesaurus-based approach (Hirst and St-Onge, 1998).

Page 39: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

23

Oliver (Vossen, 1998) presented approaches for constructing wordnets using the ex-

pand model and made them available through a Python toolkit (Oliver, 2014). The authors

designed three strategies that use three types of resources to construct wordnets: dictio-

naries, semantic networks (Navigli and Ponzetto, 2010) and parallel corpora. While the

construction approaches of wordnets using dictionaries and semantic networks were direct,

the authors used machine translation and automatic sense-tagging to construct their word-

nets using parallel corpora. A toolkit1 provides access to the three construction methods

besides access to some freely available lexical resources. To test their dictionary based

approach, the authors constructed wordnets for six languages: Spanish, Catalan, French,

Italian, German and Portuguese with precision between 48.09% and 84.8%. Using their

semantic network based approach, the authors constructed wordnets for the six languages

with precision between 49.43% and 94.58%. The parallel corpus based approach with

machine translation achieved precision between 70.26% and 93.81%, while with auto-

matic sense-tagging it achieved between 75.35% and 82.44%. The authors stated that their

automatically-calculated precision value is very prone to errors.

Another example of constructing wordnets using dictionary based methods is JAWS

(Mouton and de Chalendar, 2010). JAWS is a French wordnet for nouns constructed by

translating wordnet nouns using a bilingual dictionary and syntactic language model. The

construction of JAWS starts with copying the structure (the synsets with no words) of the

source wordnet. Then, the phrases that are available in the bilingual dictionary are used to

fill out the initial synsets. Finally, the language model is used to incrementally add new

phrases to JAWS. An improved version of JAWS is called WoNeF (Pradet et al., 2014).1http://lpg.uoc.edu/wn-toolkit

Page 40: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

24

The new improved wordnet includes parts of speech and was evaluated using a gold stan-

dard produced by two annotators. In addition, WoNef uses a better translation selection

algorithm that uses machine learning to select variable thresholds for translations.

In (Lam et al., 2014b), we presented three methods to construct wordnet synsets

for several resource-rich and resource-poor languages. We used some publicly available

wordnets, a machine translator and a single bilingual dictionary. Our algorithms translate

synsets of existing wordnets to a target language T, then apply a ranking method on the

translation candidates to find best translations in T. The approaches we used are applicable

to any language which has at least one existing bilingual dictionary translating from English

to it.

In the first approach, which we call the direct translation approach (DR), for each

synset in PWN, we directly translate the words from English to the target language. In

the second approach, which we call IW, we extract candidates from several intermediate

wordnets rather than just using PWN to disambiguate the translation. In the third approach,

which we call IWND, we try to reduce the number of bilingual dictionaries we use in the

second approach. When the intermediate wordnet is not PWN, we translate the extracted

words from the wordnets to English, and then we use a single bilingual dictionary to trans-

late the words from English to the target language. In all of these methods, after extracting

the candidates, we use a ranking method to select the best translations and insert them as a

synset in the traget wordnet.

Page 41: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

25

Library URLCICWN http://fviveros.gelbukh.com/wordnet.htmlextJWNL http://extjwnl.sourceforge.net/Javatools http://www.mpi-inf.mpg.de/yago-naga/javatools/Jawbone http://sites.google.com/site/mfwallace/jawbone/JawJaw http://www.cs.cmu.edu/~hideki/software/jawjaw/JAWS http://lyle.smu.edu/~tspell/jaws/JWI http://projects.csail.mit.edu/jwi/JWNL http://sourceforge.net/apps/mediawiki/jwordnet/URCS http://www.cs.rochester.edu/research/cisd/wordnet/WNJN http://wnjn.sourceforge.net/WNPojo http://wnpojo.sourceforge.net/WordnetEJB http://wnejb.sourceforge.net/

Table 3.1. A list of the Java libraries tested in (Finlayson, 2014).

3.2 Wordnet Management Tools

Maintaining wordnets is an important area of research. The manual construction of

a wordnet is an intensive process that requires a large number of specialists to work for

several years. Furthermore, a wordnet is not static. The meaning of many phrases change

through time and new phrases appear every year. For example, the country Sudan was

divided into two countries Sudan and South Sudan in 2011. If one searches the PWN 3.1 for

Sudan, only the senses corresponding to the old Sudan show up since the new sense has not

yet been added. Moreover, the representation of wordnets evolves over time. For example,

many old wordnets were upgraded to provide the XML representation. In addition, as this

section shows, many wordnets are built based on the PWN. Every time PWN gets updated,

these wordnets must be updated also to preserve the alignment with PWN. All the previous

issues show the need for wordnet maintenance tools.

One recent work on tools for maintaining wordnets is by (Mladenovic et al., 2014).

The tools are designed to provide for upgrade, cleaning, validation, search, import and

Page 42: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

26

export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another

recent work develops a Java library, which is called JWI, for accessing the PWN and com-

pares it with eleven other libraries is (Finlayson, 2014). The comparison between the li-

braries was based on five features: special requirements, used similarity metrics, ability to

edit the wordnet, whether they need to work with the Maven project or not, and forward-

compatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a sum-

mary of the comparison.

Metric

Stan

dalo

ne

Sim

ilari

tyM

etri

cs

Editi

ng

Mav

en

Min

imum

Java

CICWN Yes No No No 1.6extJWNL No No Yes Yes 1.6Javatools Yes Yes No No 1.6Jawbone Yes Yes No No 1.6JawJaw Yes Yes No No 1.5JAWS Yes No No No 1.4JWI Yes Yes No No 1.5JWNL No Yes No Yes 1.4URCS Yes No No No 1.6WNJN No No No No 1.5WNPojo No No No No 1.6WordnetEJB No No No No 1.6

Table 3.2. A comparison between some of the Java libraries for accessing the PWN.

Another wordnet management tool was also presented recently for the IndoWordNet2

(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management

Tool3 (CSS), provides an interactive user interface for creating new language synsets and2http://www.cfilt.iitb.ac.in/indowordnet/3http://indradhanush.unigoa.ac.in/conceptspace

Page 43: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

27

linking them to other Indian language wordnets. The CSS tool uses a role-based access

control to restrict the access to the wordnet. Figure 3.1 shows an overview of the CSS tool.

Figure 3.1: An overview of the CSS management tool, adapted from (Nagvenkar et al.,2014)

Sense marking is the process of tagging words with senses in corpus. It is a necessary

task in preparing training data for machine learning techniques. Since sense marking is an

intensive process, sense marking tools are very handy. For example, the Indian Institute

of Technology Bombay has developed a sense marker tool for the IndoWordNet (Prab-

hugaonkar et al., 2014). The sense marking tool shows a highlighted word in a piece of text

and asks the annotator to choose the most appropriate sense from the available senses. The

tool, also, allows the annotator to add new senses that do not exist in the wordnet.

Page 44: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

28

3.3 Creating Bilingual Dictionaries

Bilingual dictionaries are essential lexical resources which we use in our approaches.

The majority of low-resource languages have bilingual dictionaries to provide phrase trans-

lation between them and rich-resource languages. However, only relativity few bilingual

dictionaries are available for translation between low-resource languages. Several meth-

ods have been presented to automatically construct such dictionaries between low-resource

lanauges. Since wordnets we create in this dissertation are aligned with each others, we

believe that they can be good resources for phrase translation between languages. In this

section, we discuss some methods for automatically creating bilingual dictionaries.

Given two input dictionaries L1

-Lp and Lp-L2

, a naïve method to create a new bilin-

gual dictionary L1

-L2

may use Lp as a pivot using a straightforward transitive approach.

However, if a word has more than one sense, being a polysemous word, this method may

introduce incorrect translations. After computing an initial bilingual dictionary, past re-

searchers have used several approaches to mitigate the effect of ambiguity in word senses.

Methods used for disambiguation use wordnet distance between source and target words in

some way, look at dictionary entries in both forward and backward directions and compute

the amount of overlap to compute disambiguation scores (Ahn and Frampton, 2006; Bond

and Ogura, 2008; Gollins and Sanderson, 2001; Lam and Kalita, 2013; Shaw et al., 2013;

Soderland et al., 2010; Tanaka and Umemura, 1994).

Researchers have also merged information from several sources such as parallel cor-

pora or comparable corpora (Nerima and Wehrli, 2008; Otero and Campos, 2010) and a

wordnet (István and Shoichi, 2009; Lam and Kalita, 2013) to address the ambiguity prob-

Page 45: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

29

lem. Some researchers extract bilingual dictionaries directly from monolingual corpora,

parallel corpora or comparable corpora using statistical methods (Bouamor et al., 2013;

Brown, 1997; Haghighi et al., 2008; Héja, 2010; Ljubešic and Fišer, 2011; Nakov and Ng,

2009; Yu and Tsujii, 2009).

Obviously, the quality and quantity of existing resources strongly affect the accura-

cies of newly-created dictionaries. For instance, Nerima and Wehrli create new English-

German and English-Italian bilingual dictionaries with 21,600 and 26,834 entries, respec-

tively, from 76,311 entries in an English-French dictionary, 45,492 entries in a German-

French dictionary, and 36,672 entries in a French-Italian dictionary (Nerima and Wehrli,

2008). Given parallel corpora of Lithuanian consisting of 1,765,000 tokens and Hungarian

including 2,121,000 tokens, Heja can extract only 2,616 correct translation candidates with

accuracy over a certain threshold from 4,025 translation candidates (Héja, 2010). Thus,

new bilingual dictionaries created using current approaches have very few entries com-

pared to the size of the input dictionaries. Furthermore, most resource-poor languages do

not have any corpora, or even online documents. Some languages have only one very small

bilingual dictionary, such as the Karbi-English dictionary of 2,341 words.

In (Lam et al., 2015b), we present approaches to automatically build a large num-

ber of new bilingual dictionaries for low-resource languages, especially resource-poor and

endangered languages, using a single input bilingual dictionary. Our algorithms produce

translations of words in a source language to many target languages using publicly avail-

able wordnets and a machine translator (MT). Our approaches may produce any bilingual

dictionary as long as one of the two languages is English or has a wordnet linked to the

PWN. Using our approaches and starting with 5 available bilingual dictionaries, we cre-

Page 46: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

30

ated 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the

popular MTs: Google4 and Bing5.

3.4 Summary

In this chapter, we have discussed the existing methods for the automatic construc-

tion of wordnets. We have also discussed several tools and system for managing wordnets.

Moreover, we covered some of the approaches for automatically creating bilingual dictio-

naries.

4http://translate.google.com/5http://www.bing.com/translator

Page 47: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 4

AUTOMATICAALY CONSTRUCTING STRUCTURED

WORDNETS

The core idea behind wordnet is to group words which are synonyms, or roughly syn-

onymous, into lexical categories that are called synsets. Then, semantic relations between

these synsets are established in a hierarchical manner. In this chapter, we present a method

to automatically construct the wordnet semantic relations such as Hypernyms, Hyponyms,

Member Meronyms, Part Meronyms and Part Holonyms using PWN.

4.1 Constructing Core Wordnets

In (Lam et al., 2014b) we introduced an approach, which we refer to as the IWND

approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows,

in IWND, to create wordnet synsets for a target language T we used existing wordnets and

a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every

synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the

offset for a synset with a particular part-of-speech (POS). Notice here that each synset

may have one or more words, each of which may be in one or more synsets. Words in a

synset have the same sense. Then, we extracted the corresponding synsets for each offset-

POS from existing wordnets linked to PWN, in several languages. Next, we translated

the extracted synsets in each language to T to produce synset candidates using MT or a

Page 48: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

32

dictionary. Then, we applied a ranking method on these candidates to find the correct

words for a specific offset-POS in T.

Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b).

Page 49: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

33

The ranking method we used in (Lam et al., 2014b) is based on the occurrence count

of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as

below.

rankw =occurw

numCandidates ⇤ numDstWordNetsnumWordNets where:

- numCandidates is the total number of translation candidates of an offset-POS

- occurw is the occurrence count of the word w in the numCandidates

- numWordNets is the number of intermediate wordnets used, and

- numDstWordNets is the number of distinct intermediate wordnets that have words

translated to the word w in the target language.

4.2 Constructing Wordnet Semantic Relations

Synsets in wordnet are linked in hierarchal fashion. The hierarchy in wordnet is

established using the super-subordinate relation between synsets. For example, nouns are

linked using hyperonymy which is a relation between general synsets and specific one. An

example of a hyperonymy relation is the relation between the synsets {food, solid_food}

and {baked_goods}. Hyperonymy relation is transitive, for example, the synset {bread},

which is a hyponym of the synset {baked_goods}, is also a hyponym of the synset {food,

solid_food}. Table 4.1 shows the semantic relations available in wordnet(Wikipedia, 2015).

In (Lam et al., 2014b), we constructed core wordnets, which essentially means that

we created synsets with no connections between them. As Figure 4.2 shows, our goal is to

recover the taxonomy of synsets. To establish the semantic relations between the sysnets

Page 50: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

34

Phrase Type Relation Definition

Nouns

Hypernyms Y is a hypernym of X if every X is a (kind of) YHyponyms Y is a hyponym of X if every Y is a (kind of) XCoordinate terms Y is a coordinate term of X if X and Y share a

hypernymMeronyms Y is a meronym of X if Y is a part of XHolonyms Y is a holonym of X if X is a part of Y

Verbs

Hypernyms The verb Y is a hypernym of the verb X if theactivity X is a (kind of) Y

Troponyms The verb Y is a troponym of the verb X if theactivity Y is doing X in some manner

Entailments The verb Y is entailed by X if by doing X youmust be doing Y

Coordinate terms Those verbs sharing a common hypernym

Table 4.1. Wordnet semantic relations.

Figure 4.2: Core wordnet mapping to structured wordnet.

we created in (Lam et al., 2014b), we rely on Princeton WordNet (Fellbaum, 2005) as an

intermediate resource.

Page 51: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

35

As Figure 4.3 shows, to construct the links between synsets in our wordnet for lan-

guage T, we extract each synseti from wordnett and map it with synsetj , which is the cor-

responding synset in the Princeton WordNet. Then, for each synsetj in Princeton WordNet,

we extract each semantic relations rj and the linked synsetsk . Next, we check the availabil-

ity of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a relation

between synseti and synsetk to wordnett .

Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.

We must notice here that although we used some disambiguation methods when we

created the core wordnets, there still are words that are misplaced. This will cause some

false classification of synset relations. Another challenge is that translation leads to loss

of some information. For example, it is very important to distinguish between classes and

instances in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance

will not be translated into the target language as a class and vice versa. Furthermore, as

Figure 4.4 shows, since the core wordnets are automatically created, there will be some

Page 52: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

36

missing synsets that might not be available in the target languages. That is will lead to

fragments in the recovered links. All the previous needs to be observed and dealt with to

obtain accepted accuracy.

Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations usingintermediate wordnet.

4.3 Experiment and Evaluation

In this section, we use generate the semantic relations between synsets in three word-

nets: Arabic, Assamese and Vietnamese. We start by creating the core nets using the algo-

rithms we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets

for the three languages. Next we apply our method, which we is presented in Section 4.2,

to link between the synsets. The algorithm was able to recover a total of 206,766 relations

Page 53: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

37

Language Synsets Coverage Precision /4.00Arabic 93,383 59.95% 3.82Assamese 107,616 36.95% 3.78Vietnamese 55,451 36.20% 3.75

Table 4.2. Size, coverage and precision of the core wordnet we create for Arabic, Assameseand Vietnamese.

Relation PrecisionSimilarTo 75.62%Hypernym 70.41%Hyponym 71.23%MemberMeronym 77.54%PartHolonym 84.29%Average 75.82%

Table 4.3. Precision of the semantic relations established for our Arabic wordnet.

between the Arabic synsets, 139,502 relations between the Assamese synsets and 146,172

relations between the Vietnamese synsets. As Figure 4.5 shows, most of the recovered

relations are hyponym and hypernym relations.

To evaluate our algorithm, we evaluated the relations recovered for the Arabic word-

net. We asked three Arabic to evaluate a sample of 500 relations. The sample consists of

the following relations: 100 “hypernym” relations, 100 “hyponym” relations, 100 “simi-

lar to” relations, 100 “MemberMeronym” relations and 100 “PartHolonym” relations. The

evaluation done using a True and False questions where the True gives score of 1 and

False gives a score of 0 to the relation

As Table 4.3 shows, the precision of algorithm was between 70.41%, which was

for the “hypernym” relation, and 84.29% which was for the “PartHolonym” relation. The

average precision score was 75.82%.

Page 54: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

38

Figure 4.5: Percentage of synset semantic relations recovered for the Arabic, Assameseand Vietnamese wordnets.

4.4 Summary

In this chapter, we presented an approach that automatically construct semantic re-

lations between synsets in a wordnet. The approach depends on the PWN to establish the

links between the synsets. We conducted an experiment to evaluate our algorithm. Our

approach produces semantic relations between the Arabic synsets with 75.82% precision.

Page 55: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 5

ENHANCING AUTOMATIC WORDNET CONSTRUCTION USING

WORD EMBEDDINGS

In the previous chapters we have shown that a wordnet for a new language, possibly

resource-poor, can be constructed automatically by translating wordnets of resource-rich

languages. The quality of these constructed wordnets is affected by the quality of the

resources used such as dictionaries and translation methods in the construction process.

Recent work shows that vector representation of words (word embeddings) can be used to

discover related words in text. In this chapter, we propose a method that performs such

similarity computation using word embeddings to improve the quality of automatically

constructed wordnets.

5.1 Introduction

It is well known that one way to find out semantically related word is to use context as

lead (Firth, 1957; Harris, 1954). Words that share the same neighbors are usually somehow

related to each other. For example, consider the two sentences:

“He rides his bike to the park everyday” and

“He rides his bicycle to the park everyday”.

One can conclude that the words “bike” and “bicycle” are similar or semantically related

since they appeared in similar context. This observation led to the researches to what

Page 56: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

40

is called distributional methods which is widely used in recent days. In these methods,

also known as �ector semantics and word embeddin�s, co-occurrences of the words in a

corpus is represented as vectors in a multidimensional space forming a word-word matrix

(Jurafsky and Martin, 2009).

Since corpora consist of large number of distinct words, these vectors are usually

long and sparse vectors. The sparseness of the vectors is caused by the fact that a word

often co-occur with limited number of other words in a given corpus. For these reasons,

special algorithms are used to process and save these sparse vectors. Also, usually, the

co-occurrence of a word is limited to a specific window of words before and after the

word. According to (Jurafsky and Martin, 2009), there are two types of co-occurrence:

f irst �orderco�occurrence and second �orderco�occurrence. In the first type, are used

to describe words that appear next to each other, while in the second type, the words share

similar surrounding words.

In order to reduce the effect of stop words, i.e. words that co-occur with most of

the words, usually the pointwise mutual information measure (PMI ) (Fano and Hawkins,

1961) is used rather than using the pure co-occurrences. This measure considers the prob-

ability of the co-occurring of two words comparing to other pairs in the corpus. Usually,

the PMI between two words w1

and w2

is

PMI (w1

,w2

) = log

2

P (w1

,w2

)

P (w1

)P (w2

). (5.1)

Where:

P (w1

) is the probability of word w1

Page 57: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

41

P (w2

) is the probability of word w2

P (w1

,w2

) is the probability of w1

in context of w2

5.2 Similarity Metrics

There are many ways to compute similarity between vectors (Jurafsky and Martin,

2009). Next, we will list three of the common metrics used to measure similarity or relat-

edness between two vectors ~A and ~B with size N.

• Cosine Similarity: the most common measure used in natural language processing.

It produces similarity values from 0 to 1, when using the row co-occurrences or PMI ,

where words with cosine similarity value near 1 supposedly very similar and words

with cosine similarity value near 0 supposedly unrelated. Cosine similarity usually

measured using the next formula:

cosine ( ~A, ~B) =

PN

i=1

A

i

B

i

qPN

i=1

A

2

i

qPN

i=1

B

2

i

. (5.2)

• Jaccard Measure: which was introduce by (Jaccard, 1912) and adapted by (Grefen-

stette, 2012) to be used withe vectors. The Jaccard similarity is computed using the

following formula:

Jaccard

sim

( ~A, ~B) =

PN

i=1

min(Ai

,Bi

)P

N

i=1

min(Ai

,Bi

)(5.3)

• Dice Measure: which is originally used with binary vectors and was adapted by

(Curran, 2004) to be applied with semantic similarity. The Dice similarity measure

Page 58: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

42

can be computed using the next equation:

Dice

sim

( ~A, ~B) =2

PN

i=1

min(Ai

,Bi

)P

N

i=1

(Ai

+B

i

)(5.4)

5.3 Generating Word Embeddings

In order to validate the synsets we create using translation and obtain relations be-

tween them, we use the word2�ec algorithm (Mikolov et al., 2013) to generate word rep-

resentations from an existing corpus. The word2�ec algorithm uses a feedforward neural

network to predict the vector representation of words within a multi-dimensional language

model. Word2�ec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words

(CBOW). In the SG version, the neural network predicts words adjacent to a given word

on either side, while in the CBOW model the network predicts the word in the middle of a

given sequence of words. In the work presented in this section, we generate representations

of words using both models with several different vector and window sizes to obtain the

settings for the highest precision. The purpose of the steps discussed next is to improve the

quality of synsets produced by the translation process in addition to generating relations

among the synsets.

5.4 Removing Irrelevant Words in Synsets

We compute the cosine similarity between word vectors within each single synset in

TWN, the wordnet being constructed in language T , to filter false word members within

synsets. To filter the initially constructed synsets in TWN, we pick a threshold value �

Page 59: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

43

such that the selected words have cosine similarity larger than � with each other. Next, we

describe the filtering process we propose.

1. Let

synset

c

i

= {word

1

,word

2

,word

3

,word

4

} (5.5)

be a candidate synset to be potentially included in TWN.

2. We compute the cosine similarity between all the possible pairs of words in s�nsetci .

3. We extract the pair of words with the highest cosine similarity.

4. If this pair of words have cosine similarity larger than � , the pair is kept in the final

synset s�nseti , otherwise, s�nsetci itself is discarded. This may have been a low

quality candidate synset generated in the translation process.

5. Next, among the remaining words in s�nsetci , a word is kept if it has a connection

with any word in s�nseti with similarity higher than � .

For example, let us assume that the cosine similarity between the words in s�nsetci

are as shown in Table 5.1 and �=0.70. First, the pair with the highest cosine similarity,

(word1

,word2

) is kept in the final s�nseti since its cosine similarity is larger than � . Then,

word3

is discarded since it does not have any cosine similarity larger than � with any of the

words in the current final s�nseti . Finally, word4

is kept s�nseti since it does have a cosine

similarity with word1

that satisfies the threshold � .

Page 60: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

44

Pair Cosine Similarity(word

1

,word2

) 0.91(word

1

,word3

) 0.22(word

1

,word4

) 0.82(word

2

,word3

) 0.34(word

2

,word4

) 0.72(word

3

,word4

) 0.12

Table 5.1. An example of cosine similarity between words in a candidate synset

5.5 Validating Candidate Relations

Similarly, we compute the cosine similarity between words within pairs of semanti-

cally related synsets. This allow us to verify the constructed relations between synsets in

TWN. For example, let

s�nseti = {wordi1,wordi2,wordi3,wordi4}, and

s�nsetj = {wordj1,wordj2,wordj3,wordj4}

be synsets in TWN. And let

�i j be a candidate semantic relation between s�nseti and s�nsetj .

We compute the cosine similarity between all the possible pairs of words from s�nseti to

s�nsetj and obtain the maximum similarity obtained. Then, if this value is larger than a

threshold �� , then we retain the relation �i j , otherwise, we discard it. A pseudo code of the

validation algorithm is shown in Algorithm 1.

Page 61: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

45

Algorithm 1: Validating Semantic RelationData: s�nseti , s�nsetj , relation �i j , threshold ��

Result: retain or discard the relation �i jinitialization;Similarit�max 0;foreach wordi in s�nseti do

foreach wordj in s�nsetj dosim ComputeCosineSimilarity(wordi ,wordj);if sim > Similarit�max then

Similarit�max = sim;end

endendif Similarit�max < �� then

Discard(�i j) ;end

5.6 Selecting Thresholds

To pick the synset similarity threshold value � and the threshold �� for each semantic

relation we create, we compute the cosine similarity between pairs of synonym words,

semantically related words, and non-related words obtained from existing wordnets. Then,

based on the previous data, we select the threshold values that are associated with higher

precision and maximum coverage.

5.7 Experiments

In this section, we discuss the enhancement of the Arabic, Assamese and Vietnamese

wordnets we create using our method we described in the previous sections.

5.7.1 Generating Vector Representations of Wordnets Words

For generating vector representations of the Arabic Words we use the following freely

available corpora:

• Watan-2004 corpus (12 million words) (Abbas et al., 2011),

Page 62: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

46

• Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005), and

• 21 million words of Wikipedia1 Arabic articles.

We process and combine the three corpora into a single plain text file.

For both Assamese and Vietnamese, we used Wikipedia articles to generate the vector

representation for words. The size of the Assamese Wikipedia articles we used is 1.4

million of words, While the size of Vietnamese articles was 80 million words.

Figure 5.1: A histogram of synonyms, semantically related words, and non-related wordsextracted from AWN.

In order to compute the synset similarity threshold value � and the threshold for

each semantic relation �� , we use the freely available Arabic wordnet (AWN) (Rodríguez

et al., 2008). AWN was manually constructed in 2006 and has been semi-automatically

enhanced and extended several times. We start by extracting synonym words, semantically

related words, and non-related words from AWN. The Python program that we wrote to1https://ar.wikipedia.org

Page 63: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

47

Relation Weighted AverageSimilarity

Synonyms 0.28Hypernyms 0.22TopicDomains 0.23PartHolonyms 0.28InstanceHypernyms 0.08MemberMeronyms 0.29

Table 5.2. The weighted average similarity between related words in AWN.

compute the cosine similarity between the words is listed in Appendix A.1. Then, we

use the histogram representation of the cosine similarity of the previous sets of words to

set the thresholds. As Figure 5.1 shows, more than 67% of the non-related words have

cosine similarity less than 0.1, while about 23% of the synonym words in AWN have a

cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words

in AWN have cosine similarity less than 0.1. Table 5.2 shows the weighted average cosine

similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instance-

hypernyms, and member-meronyms in AWN where the frequency of the similarity value is

the weight.

5.7.2 Producing Word Embeddings for Arabic

In this part of this experiment, we use the word2vec algorithm to produce vector

representation of Arabic. We test the word2�ec algorithm with different window sizes to

select the window size that produces the highest similarity. We generate word embeddings

using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted

averages of the cosine similarity between the synonyms in AWN. The highest weighted

average we obtained was 0.288 with window size 3, while the weighted averages obtained

with window sizes 5 and 8 were 0.283 and 0.277 respectively. Then, we compare between

the SG and the CBOW with different vector sizes. Table 5.3 shows the weighted average

cosine similarity obtained between 16,000 pairs of synonyms in AWN using both variations

Page 64: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

48

Algorithm Vector Size Similarity AverageSG 100 0.289SG 200 0.258SG 500 0.194CBOW 100 0.288CBOW 200 0.259CBOW 500 0.195

Table 5.3. Comparison between the weighted similarity average obtained using differentword2�ec settings.

Threshold AWN Our Arabic WordNet0.000 5,941 17,349

0.100 3,433 2,073

0.288 2,471 943

0.500 1,190 271

0.750 209 13

Table 5.4. Comparison between the number of synsets in AWN and our Arabic wordnetusing different threshold values.

of word2�ec, with window size=3 and vector size set to 100, 200, and 500. We notice that

both versions produce almost similar results with a slight advantage to SG with the cost of

more execution time. However, for the corpus we use, smaller vector size produces better

precision.

5.8 Evaluation & Discussion

We compute cosine similarity between semantically related words extracted from

our initial Arabic, Assamese and Vietnamese wordnets produced in the previous chapter.

The language model to calculate the cosine similarity is created using CBOW with vector

size=100 and window size=3. Table 5.4 shows a comparison between the number of Arabic

synsets we create and the number of synsets in AWN.

We notice that the translation method we use produces high number of synsets com-

pared to the manually constructed AWN. However, the number of synsets sharply decreases

after filtering the initial synonyms using the method described in Section 5.3. Although

Page 65: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

49

Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1

Synonyms 34.8% 56.8% 78.4%Hypernyms 45.2% 57.2% 84.4%PartHolonym 50.8% 75.2% 90.4%Member-Meronym 40.8% 56.8% 79.6%

Overall 42.9% 61.5% 83.2%

Table 5.5. Precision of the Arabic wordnet we create.

Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1

Synonyms 52.0% 57.6% 88.0%Hypernyms 37.6% 49.6% 76.0%PartHolonym 51.2% 46.4% 82.4%Member-Meronym 62.4% 67.2% 81.6%

Overall 50.8% 55.2% 82.0%

Table 5.6. Precision of the Assamese wordnet we create.

our Arabic wordnet is automatically created, the number of synsets we create is 60% of the

number of synsets in the manually created AWN when filtering the synsets using �= 0.1.

We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms,

and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288,

and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0

represents the minimum score and 5 represents the maximum score. We compute precision

by taking the average score and converting it to a percentage. See Table 5.5.

Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1

Synonyms 31.2% 40.2% 57.6%Hypernyms 31.8% 39.0% 69.4%PartHolonym 32.2% 42.8% 75.0%Member-Meronym 22.0% 24.0% 73.8%

Overall 29.3% 36.5% 68.95%

Table 5.7. Precision of the Vietnamese wordnet we create.

Page 66: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

50

Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.

The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms

we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to

0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces

synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the

Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7. As shown in Tables

(5.8, 5.9, 5.10), our results suggest that using lower precision for producing synsets reduces

the quality of the other created semantic relations. Our results clearly show that pairs with

higher cosine similarity are more likely to be semantically related. It confirms the benefit

of combining the translation method with word embeddings in the process of automatically

generating new wordnets.

5.9 Summary

In this chapter, we discuss an approach for enhancing the automatically generated

wordnets we create for low-resource languages. Our approach takes advantage of word

embeddings to enhance the translation method for automatic wordnet creation. We present

Page 67: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

51

Table 5.9. Examples of related words and their cosine similarity from our Assamese word-net.

Table 5.10. Examples of related words and their cosine similarity from our Vietnamesewordnet.

Page 68: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

52

an application of our approach to producing new Arabic Wordnet. Our method automat-

ically produces Arabic synonyms with 78.4% precision and semantically related pairs of

words with up to 90.4% precision.

Page 69: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 6

SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD

EMBEDDINGSWord embedding is a way to represent words as vectors in a multi-dimensional space

such that related word are represented as vectors with similar direction. It has been shown

that this model can be used to discover relation between words effectively. In this chapter,

we introduce a method to represents wordnet synsets in similar way. A wordnet synset is

a group of synonym words grouped together because they all represent the same concept.

Our proposed method can be used in several NLP applications such as word-sense disam-

biguation and automatic wordnet construction. To test our method we use it in the task of

selecting glosses for wordnet synsets of several languages.

6.1 Creating Language Model Using Word Embedding

We start by creating word embeddings using a corpus and the word2�ec software

(Mikolov et al., 2013). word2�ec is a two-layer feedforward neural-network learning

model that produces multi-dimensional vector representation of words. There are two im-

plementations of this learning model: Skip-Gram (SG) implementation and Continuous

Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the

words around a given word, while in the CBOW implementation the model learns the word

within a given sequence of words.

6.2 Generating Vector Representation of Wordnet Synsets

In this section, we present our method to produce wordnet synsets. We build our

method based on the vectors of the synonym words produced by the word embedding

Page 70: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

54

method. We believe that combining the vectors of synonym words into one vector can

produce a way to represent meaning. Next, we describe our propose method to build the

vector representation of synsets, which we call s�nset2�ec.

Let

s�nseti = {word1

,word2

, ...,wordj} be a synset in wordnetx ,

{n1

,n2

, ...,nj} is the number of synsets for each word in s�nseti , and

{~V1

, ~V2

, ..., ~Vj} are the corresponding vectors for {word1

,word2

, ...,wordj} in the word

embedding model.

We identify two cases:

1. The first case is when a word, which does not have any synonyms, represents several

synsets i.e. have more than one meaning. Therefore, the vector that produced by the

word embedding is actually representing the combined meanings of the word. For

example, in PWN, the word “abduction” is the only word in both synset 00775460-

n, “the criminal act of capturing and carrying away by force a family member”, and

synset 00333037-n, “moving of a body part away from the central axis of the body”.

Hence, the vector for “abduction” actually represents both meanings.

2. The second case is when a word, which does have one or more synonyms, have

one or more meanings. In this case, the synonyms might or might not have other

meanings also. For example, the noun “spill” have four meanings in PWN and it

have 6 synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its

synonyms in PWN.

Obviously, to generate a combined vector for a synset, we need a way to limit the

effect of the other meanings that the synonyms might hold. To do so we start by solving

the second case where the synsets have more than one word. In this case, We normalize

the vector of each word by dividing its coordinates by the number of synsets that the word

Page 71: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

55

Synset Key Gloss Synonyms00076884-n a sudden drop from an upright position {spill, tumble, fall}00329619-n the act of allowing a fluid to escape {spill, spillage, release}

04277034-n a channel that carries excess waterover or around a dam or other obstruction {spill, spillway, wasteweir}

15049594-n liquid that is spilled {spill}

Table 6.1. Meanings of the noun “spill” and its synonyms.

belongs to. This reduces the noise when generating the synset vector caused by the other

meanings that a word can hold. We define the vector of s�nseti (~Vsi) as follows:

~Vsi =1

j· (~V

1

· 1

n1

+ ~V2

· 1

n2

+ ...+ ~Vj ·1

nj).

Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include

three words: spill, tumble and fall.

Figure 6.1: An example of creating a vector for a wordnet synset that include more thanone word.

Next, we produce vectors for the synsets that share a single word, i.e. words that do

not have any synonyms and have more than one meaning. In this case, for each synset,

we produce the synset vector by combing the word vector with the vector of a word in a

Page 72: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

56

related synset, e.g. a hypernym, a hyponym, or a meronym. For example, let s�nseti and

s�nsetk be synsets that both include the same single wordw . And let h1

be a word from the

hypernym of s�nseti and h2

be a word from the hypernym of s�nsetk . We define the vector

of s�nseti (~Vsi) as follows:

~Vsi =1

2

· (~Vw ·1

nw+ ~Vh

1

· 1

nh1

) .

Similarly, we define the vector of s�nsetk (~Vsk) as follows:

~Vsk =1

2

· (~Vw ·1

nw+ ~Vh

2

· 1

nh2

).

Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduc-

tion”. In Appendix A.2 we list a python implementation of the procedure.

Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.

6.3 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec

In this section, we give one usage example of our model. We show how our proposed

model can be used in the automatic selection of glosses for wordnet synsets. The automatic

Page 73: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

57

selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence

which is , usually, manually attached to a synset to clarify the meaning of the synset. This

short sentence can be a definition or an example sentence of one of the members of the

synset. We test our method using PWN and, then, apply it to automatically add glosses to

wordnets created in (Lam et al., 2014b).

In the foloowing steps, we present our method to select a gloss for s�nseti we defined

in section 6.2.

• Let G = {�1

,�2

, ...,��} be set of candidate glosses that include a word belongs to

s�nseti .

• To select the closest gloss to s�nseti from G we generate a vector for each gloss �z 2

G. We list a Python function for this step in Appendix A.3.

• Assume that the gloss �z consists of the words {w1

,w2

, ...,wd},

{m1

,m2

, ...,md} is the number of synsets for each word in �z , and

{~Vw1

, ~Vw2

, ..., ~Vwd} are the corresponding vectors for {w1

,w2

, ...,wd}.

• We compute the vector of gloss �z as follows:

~V�z =1

d· (~Vw1

· 1

m1

+ ~Vw2

· 1

m2

+ ...+ ~Vwd ·1

md).

• Then, we compute the cosine similarity between the vector of each gloss �z and ~Vsi .

We present a Python implementation for this step in Appendix A.4.

• Finally, we select the gloss with highest cosine similarity with ~Vsi .

For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs

to two synsets and does not have any synonyms, we notice that our algorithm was able to

distinguish between the two meanings and select the right gloss for both synsets.

Page 74: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

58

Synset Key Gloss CosineSimilarity

00333037-nthe criminal act of capturing and carrying awayby force a family member 0.172

moving of a body part away from the centralaxis of the body 0.214

00775460-nthe criminal act of capturing and carrying awayby force a family member 0.204

moving of a body part away from the centralaxis of the body 0.189

Table 6.2. Cosine similarity between the different synset vectors and glosses of the word“abduction” in PWN.

6.4 Evaluation

In this section, we introduce two forms of evaluation. First, we apply our method

to select glosses for the PWN synsets. In this case, we directly compare our results to the

actual manually attached glosses in PWN. Then, we apply our method to attach glosses to

wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to

evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese.

6.4.1 Using Synset2vec to Select Glosses for PWN Synsets

In order to evaluate our synset vector representation in the task of selecting glosses for

wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of

the glosses manually added to the synsets in PWN to automatically measure the precision of

our synsets representation. The following steps describe the evaluation process of selecting

glosses for PWN synsets.

• For each s�nseti in PWN, we construct a set of candidate glosses. The candidate

glosses are extracted from PWN using the following method. First the gloss attached

to s�nseti in PWN is added to the candidate set of glosses. Next, to generate negative

glosses for s�nseti , we extract words which belong to s�nseti and other synsets, i.e.

Page 75: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

59

words have the meaning of s�nseti and one ore more other meaning. This allow us to

examine the ability of the algorithm to differentiate between the different meanings

of synsets.

• We randomly selects two types of synsets from PWN: synsets that have single words,

i.e. synsets that are represented by only single words, and synsets that include multi-

ple synonym words.

• We generate the synset vectors using the algorithm we described in 6.2.

• Next, we generate the gloss vectors using the method we described in 6.3.

• Then, we compute the cosine similarity between s�nseti and each gloss in the candi-

date set.

• Finally, we select the gloss with the highest cosine similarity.

6.4.2 Using Synset2vec to Select Glosses for Arabic,Assamese and Viet-

namese Synsets

In this section, we examine the precision of our method by applying it for the pur-

pose of selecting glosses from corpora to attach to the wordnets we create in the previous

chapters. In this experiment, we used the wordnets of the languages: Arabic, Assamese

and Vietnamese. Next, we describe the steps of evaluating glosses selected by our method

for the synsets of the target languages:

• For each s�nseti in the target wordnet wordnett , we generate a set of candidate

glosses by extracting the set of sentences that include any member of s�nseti from

the corpora we described in Section 5.7.

Page 76: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

60

Synset Type Number of Synsets PrecisionSingle Member 1400 76.5%Multi Member 600 79.6%

Table 6.3. The precision of selecting glosses for PWN synsets

• We randomly selects two types of synsets from wordnett : synsets that have single

words, i.e. synsets that are represented by only single words, and synsets that include

multiple synonym words.

• We generate the synset vectors using the algorithm we described in 6.2.

• Next, we generate vectors for each sentence in the set of candidate glosses using the

method we described in 6.3.

• Then, we compute the cosine similarity between s�nseti and each sentence in the

candidate set.

• Next, the top 3 sentences with the highest cosine similarity with the s�nseti are se-

lected.

• Finally, 3 native speakers of the target language are asked to evaluate the selected

sentences using a 5 point scale.

6.4.3 Results & Discussion

As shown in Table 6.3, we used our algorithm to select glosses for 1400 single-

member synsets from PWN. The algorithm achieved 76.5% precision. Also, we used it to

select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in this

case.

In the second evaluation, we randomly selected 300 synsets from the Arabic, As-

samese and Vietnamese wordnets we create (100 synset each). For each synset, we ex-

tracted all the sentences that included any member of the synset from the corpora. The

Page 77: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

61

Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.

sentences were sorted according to the cosine similarity with the synset vector and the top

3 sentences where selected.

As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is

81.4% when selecting the sentences with the highest cosine similarity with the synset vec-

tor. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respec-

tively. The overall precision of selecting glosses using our method for the Arabic synsets is

72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along

with the their cosine similarity values.

The precision of our method for selecting glosses for the Assamese synsets is 85.2%

when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and

top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for

Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the

Assamese synsets along with the their cosine similarity values.

The top Vietnamese glosses selected by our method has 39.4% precision. The top 2

and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table

Page 78: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

62

Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet.

6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the

their cosine similarity values.

In general, the precision of the recently published algorithms (Apidianaki and Von Neu-

mann, 2013) for the task of multilingual word-sense disambiguation is arround 68.7%,

meaning that our algorithm is showing better performance for English, Arabic and As-

samese. However, we notice that our method perform poorly with Vietnamese. The reason

behind the poor results with Vietnamese is that Vietnamese words are not separated by

white spaces (Gordon and Grimes, 2005). That means that the meaning of most the words

can change based on the following words. This makes the process of generating the vectors

for both the synsets and sentences extremely difficult since word2�ec algorithm assumes

that words are separated by white spaces. The Same problem appears in the process of

automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a). One

Page 79: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

63

Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.

PrecisionWordnet Top 1 Top 2 Top 3 Overall

Arabic 81.4% 70.4% 65.8% 72.6%Assamese 85.2% 83.2% 84.6% 84.4%Vietnamese 39.4% 36.6% 37.0% 37.6%

Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets

possible solution to this problem is replacing the white spaces within the single Vietnamese

words with a special non-white character. This requires the existence of a language dictio-

nary to distinguish the words that include white spaces within them.

Page 80: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

64

6.5 Summary

In this chapter, we presented new method for selecting synset gloss from a corpus.

The method can be used for low-resource languages to attache glosses to wordnets con-

structed automatically. Our method present vector representation for wordnet synsets in

a multi-dimensional space. We construct a synset vector by grouping the word embed-

ding vector of each synonym in the synset. Our evaluation showed that our method selects

glosses with precision up to 84.4%.

Page 81: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 7

LEXBANK: A MULTILINGUAL LEXICAL RESOURCE

Figure 7.1: An overview of LexBank system.

7.1 Introduction

In this chapter, we discuss the design and implementation of LangBank: a system that

provides access to the multilingual lexical resources we create in this dissertation. We aim

to give public users the ability to access and use the resources that we have created in our

project. The system provides wordnet search services to several resource-low languages in

addition to bilingual dictionary look up services. In addition, the system receives evaluation

and feedback from users to improve the quality of the resources.

Page 82: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

66

As Figure 7.1 shows, the system is divided into three layers: Web interface, applica-

tion layer and database layer. The Web interface allows users to log into the system and

access the search services. The web interface, also, provides a control panel for adminis-

trators to allow them to manage the system. The application layer includes all the software

required to securely execute the users requests. The database layer has two databases: lex-

ical resources database and system database. The system database stores users information

and the system settings. The design of the system allows including new language resources

and easy modifications.

7.2 Database Design

LexBank uses two databases: one for storing the system settings and one for storing

the lexical resources. We have used Microsoft SQL Server to construct the databases. The

SQL code we used to construct the databases is listed in Appendix B. Next, we describe

each database in details.

7.2.1 The system settings database

There are two tables in the setting database: Users_Info and System_log. Next, we

describe both of the tables.

7.2.1.1 Users_Info

The Users_Info table contains information of the registered users. Following are the

fields contained in the Users_Info table:

• UserId: a unique short alias name, which is selected by the user, that is used to

identify users in the system.

• UserName: the full name of the user.

• UserEmail : the email address of the user.

Page 83: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

67

• UserPwd: the encrypted password used by the user to access the system.

• UserPri�: a text field that determine the privileges that the user has. There are two

levels of users in the system. The first level is administrator which has the privileges

of managing users and data in the system. The Second level is client which has the

privilege of browsing the available resources.

• UserStatus: this field specify the status of the user. The status can be Active, Inactive

or New.

7.2.1.2 System_log

The System_log table keep records of all the users activities in the system. This helps

us in maintenance and keeping track of the utilization of the system. The following fields

are contained in the System_log table:

• E�entId: a unique key that is used to identify the event.

• E�entDesc: a text description of the event.

• E�entTime: the date and time of the event.

• UserId: the identification key of the user who committed the event.

7.2.2 The lexical resources database

The lexical resources database contains the resources we produce in this thesis. For

each language supported by the system the database maintain tables for storing the core

wordnet, the semantic relations, the wordnet glosses, the evaluation data of the semantic

relations and the evaluation data of the wordnet glosses. Next, we describe each table in

this database.

Page 84: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

68

7.2.2.1 CoreWordnet

The CoreWordnet table stores the wordnet synsets we create in this thesis. The

core wordnet groups the synonym words into sets called synsets. In this table, synsets are

identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos

consists of two parts: byte offset used to locate the synset in the data file and the part of

speech of the synset. Following are the fields in the CoreWordnet table:

• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the

synset.

• Member : a word belongs to the synset.

7.2.2.2 Sem_Relations

Whereas the synonymy relation is stored in the CoreWordnet table, other semantic

relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As

we described in Section 4.2, the semantic relations are directed relations. Therefore, we

should maintain the direction by specifying the side of each synset in the relation. The

Sem_Relations table contain the following fields:

• Le f t_offset-pos: this field specify the offset-pos of the synset in the left side of the

relation.

• Relation: a text field that specify the relation between the left side and the right side

synsets.

• Ri�ht_offset-pos : the offset-pos of the synset in the right side of the relation.

7.2.2.3 WordnetGlosses

The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. Fol-

lowing are the fields of the WordnetGlosses table:

Page 85: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

69

• offset-pos: the offset-pos of wordnet synset.

• Gloss: a text field that contain the gloss of the synset.

7.2.2.4 Sem_Relations_Eval_Data

The Sem_Relations_Eval_Data table contains the semantic relations sample data

which is used in the evaluation. This table contains the following fields:

• RelationKey: a unique identification number used to identify the semantic relation

being evaluated.

• Le f t_offset-pos: the offset-pos of the synset in the left side of the relation being

evaluated.

• Word1: this field specify the word in the left side of the relation being evaluated.

• Relation: a text field that specify the type of relation being evaluated.

• Ri�ht_offset-pos: the offset-pos of the synset in the right side of the relation being

evaluated.

• Word2: this field specify the word in the right side of the relation being evaluated.

• COS: the cosine distance, as measured in Section 5.4, between the left word and the

right word in the relation being evaluated.

7.2.2.5 Sem_Relations_Eval_Response

The Sem_Relations_Eval_Response table contains the collected responses of the se-

mantic relations we produce from evaluators. This table consists of the following fields:

• AnswerKey: a unique integer number that is generated automatically to identify the

response.

Page 86: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

70

• RelationKey: the key of the semantic relation being evaluated.

• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator

to the semantic relation.

• UserId: identification key of the evaluator who evaluated the response.

7.2.2.6 WordnetGlosses_Eval_Data

The WordnetGlosses_Eval_Data table holds the wordnet glosses sample which being

evaluated by the users. The table includes the following fields:

• GlossKey: an automatically generated unique integer used to identify the gloss being

evaluated.

• offset-pos: the offset-pos of the wordnet synset.

• Word: the word which being used in the gloss to represent the wordnet synset.

• Sentence: the sentence selected as gloss for this wordnet synset.

• PWNGloss: the English gloss of the corresponding synset in PWN.

• CosSem: the cosine similarity between the selected sentence and the synset as mea-

sured in Section 6.3.

• GlossRank: an integer value that represents the rank of the gloss among the other

candidate glosses. The rank is assigned by the system to the gloss being evaluated

based on the CosSem value. Glosses with the highest CosSem value have a rank value

1.

7.2.2.7 WordnetGlosses_Eval_Response

Responses from the users for evaluating the wordnet glosses we produced in Section

6.3 are stored in the WordnetGlosses_Eval table. This table consists of the following fields:

Page 87: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

71

• AnswerKey: a unique integer number that is generated automatically to identify the

response.

• GlossKey: the key of the gloss being evaluated.

• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator

to the gloss.

• UserId: identification key of the evaluator who evaluated the gloss.

7.3 Application layer

In this section, we describe the main functions provided by LexBank. In order to

maintain simplicity, we implement most of the functions of the system in one utility class

(LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix

C, consists of the following methods:

• IsUserIdAvailable(): takes a userId and return true if this never been used by another

user before.

• EncryptPassword(): takes a plain text password and return an encrypted password.

• DecryptPassword(): takes an encrypted password and return a decrypted password.

• CreateNewUser(): takes the details of a new user and create an account for him by

string the data in the Users_Info table.

• IsAuthenticated(): takes the user identification and password and return true if it

match the user information in the users table.

• FindSynSet(): takes a lexeme and return a list of synsets that include this lexeme.

• FindSynSetLexemes(): takes an OffsetPos of a synset and return the list of lexemes of

this synset.

Page 88: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

72

• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and return

true if the synset is available in the spcified wordnet.

• FindSynSetRelations(): takes an OffsetPos of a synset and return all the semantically

related lexemes.

• FindGloss(): takes an OffsetPos of a synset and return the gloss of the synset.

• ReadRelation(): takes a RelationKey and return the details of the relation.

• ReadSynsetGloss(): takes a GlossKey and return the details of the gloss.

• EvaluateRelation(): takes RelationKey,Score and UserId and store them in the eval-

uation table of the semantic Relations.

• EvaluateGloss(): takes GlossKey,Score and UserId and store them in the evaluation

table of the wordnet glosses.

• LogEvent(): takes event description and store it in the System_log table.

• ChangeUserStatus(): takes UserId of a user and change his status to a specific new

status.

• RetrieveUsers(): a method that return a list of all the users in the system and their

information.

7.4 Web Interface Design & Implementation

In this section, we describe the design of the web interface of LexBank. The web

interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2

shows the site map of the web interface. The interface is accessed by the log-in web page

(frmLogin.aspx). New users need to register to gain access to the system. Registration can

be done by filling the registration web form (frmRegister.aspx). Once a user logged into the

Page 89: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

73

system, the main menu web page (frmMainMenu.aspx) is shown. The main menu include

links to access the services available in the system. In the following sections we describe

each web page in the system.

Figure 7.2: LexBank web site map

7.4.1 Registration Form

New users needs to register in the system uisng the registration form (frmRegis-

ter.aspx). As shown in Figure 7.3, a new user needs to provide: the full name, email,

email confirmation, user identification, password and password confirmation, then press

the Register button.

The registration process starts when a new user submit his information through the

registration web form. Once the registration form receive the information, it check if all the

fields met the requirements of the system. The requirements include a valid format for the

email address and the password. The requirements, also, include that the user identification

was never been used before by an existing user. If the information sent by the user pass

the validation process, the registration form calls the CreateNewUser() method from the

Page 90: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

74

Figure 7.3: The registration web form

utility class. The CreateNewUser() method uses the EncryptPassword() method to encrypt

the password, then it writes the data into the Users_info table. The registration process is

summarized in the sequence diagram shown in Figure 7.4.

7.4.2 Log-in Form

Registered users can login to the system using the log-in web page (frmLogin.aspx)

which is shown in Figure 7.5. User with an active account needs to provide his user identi-

fication and password to start the log-in process.

As shown in Figure 7.6, when the log-in web form (frmLogin.aspx) receives the

userid and the passowrd it calls the IsAuthenticated() method from the utility class. Then,

the password is encrypted using the EncryptPassword() and compared with the encrypted

Page 91: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

75

Figure 7.4: Sequence diagram of the registration process

Figure 7.5: The log-in web form

Page 92: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

76

Figure 7.6: Sequence diagram of the log-in process

password stored in the users table. If the userid and the password provided by the user

matched the userid and the password stored in the users table, the main menu of the web

interface is shown to the users, otherwise, an error message is shown to the user. The main

menu is shown in Figure

7.4.3 The Main Menu

The main menu include links to access the services available in the system. The

services presented by the web interface are:

• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).

• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).

• Evaluating semantic relations between synsets, provided by the web page (frmEval-

Relations.aspx).

• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).

Page 93: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

77

Figure 7.7: The main menu

• Users management, provided by the web page (frmManageUsers.aspx).

7.4.4 Searching Wordnet By Lexeme Web Form

The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a

lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the

following components:

• A text box used to allow the user to enter a lexeme.

• A drop menu to allow the user select the language.

• A list box for showing the synsets list of the entered lexeme.

• A list box for showing the synonyms of the entered lexeme.

• A list box for showing the related lexemes.

• A button to start the searching process.

The searching process, as shown in Figure 7.9, starts when the user submit a lexeme

and language to the frmWordnetSearch.aspx web form. Then, the method FindSynset()

Page 94: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

78

Figure 7.8: The Web form for searching wordnet by lexeme. The form is showing the resultof searching the Arabic lexeme (���) which means Egypt.

from the utility class is called to retrieve the synsets that include the entered lexeme and

show the result in the synsets list. Next, when the user selects a synset from the synsets

list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the

utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the

FindSynsetRelations() method to obtain the related lexemes and show them to the user in

the related lexemes list. The user also can extend the details of the synset shown in the

synset list and the related lemexes list by double clicking on the synset OffsetPos. This will

show the frmSynsetDetails.aspx web form which we will describe next.

7.4.5 Searching Wordnet By OffsetPos Web Form

Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx web form

which is shown in Figure 7.10. This web form consists of the following components:

Page 95: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

79

Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme

• A text box for entering the OffsetPos of the synset.

• A drop menu to allow the user select the language.

• A text box for showing the gloss of the synset.

• A text box for showing the English gloss of the synset.

• A list box to show the synonym list of the synset.

• A list box to show the related synsets and lexemes of the entered synset.

• A button to start the search process.

In this form, the user starts the process of searching wordnet by submitting the Off-

setPos of the synset and the target language to the frmSynsetDetails.aspx web form. The

Page 96: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

80

Figure 7.10: The Web form for searching wordnet by OffsetPos. The form is showing theresult of searching the Arabic synset (08897065-n).

web form calls the FindGloss() mehtod from the utility class to retrieve the gloss of the

synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to

obtain the synonym list and releated synsets of the input synset to show them in the form.

7.4.6 Evaluating Semantic Relations Between Synsets Web Form

The web form frmEvalRealtions.aspx allow users to evaluate semantic relations be-

tween lexemes and synsets in the system. The form shows the relation as a sentence and

asks the user to rate the correctness of the sentence using a Likert-type scale. The form

consists of the following components:

• A text box showing the relation key.

• A text box showing the relation in the form of a sentence.

Page 97: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

81

Figure 7.11: Sequence diagram of the process of searching wordnet using OffsetPos.

Figure 7.12: The Web form for evaluating semantic relations between synsets in a word-net. The form is showing an example of evaluating a hyponymy relation between the twoAssamese lexemes radiotelegraph and radio.

Page 98: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

82

• A text box showing the UserId of the evaluator.

• An option box that allow the user to rate the relation.

• A button to submit the score.

• A button to end the evaluation session.

Figure 7.13: Sequence diagram of the process of evaluating the relation between two lex-emes.

The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling

the ReadRelation() method from the utility class to show the relation details to the user.

When the user submit the score he assign to a relation, the evaluation form frmEvalReal-

tions.aspx store the score by calling the EvaluateRelation() method from the utility class.

Then, the evaluation form reads the next relation and show it to the user. The user can

stop the evaluation process by clicking the End Session button. The user have the option

to resume the evaluation process if he stopped any time he wish without re-evaluating the

relations he already evaluated.

Page 99: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

83

7.4.7 Evaluating Wordnet Synsets Glosses Web Form

Figure 7.14: The Web form for evaluating wordnet synsets glosses. The form is showingan example of evaluating Arabic synset (13108841-n).

The glosses of the wordnets is evaluated using the frmEvalGloss.aspx web form. To

evaluate a synset gloss, the form attach the English gloss of the synset obtained from the

PWN to the selected gloss in the target language. Then, the user is asked if the lexeme

in the selected gloss has the same meaning of the PWN gloss. This evaluation form is

composed of the following components:

• A text box showing the gloss key.

• A text box showing a lexeme from a synset, a candidate gloss written in the target

language, the English gloss of the synset.

• A text box showing the UserId of the evaluator.

• An option box that allow the user to rate the candidate gloss.

Page 100: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

84

• A button to submit the score.

• A button to end the evaluation session.

Figure 7.15: Sequence diagram of the process of evaluating the relation between two lex-emes.

The web form frmEvalGloss.aspx starts the evaluation process of glosses by calling

the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate

gloss and the English gloss of the synset being evaluated. Then, the web uses the previous

data to construct a question for the user. When the user submit the score he assign to the

candidate gloss, the evaluation form stores the score by calling the EvaluateGloss() method

from the utility class. Then, the evaluation form reads the next gloss and show it to the user.

The user can stop glosses evaluation process by clicking the End Session button. The user

can resume glosses evaluation process in any time he wish without re-evaluating the glosses

he already evaluated.

Page 101: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

85

Figure 7.16: The Web form for managing users in LexBank.

7.4.8 Users Management Web Form

To allow the administrators of LexBank to manage the users, we designed the frm-

ManageUsers.aspx web form. Access to this form is restricted to administrators. The form

list all registered users with their information. An administrator can activate the accounts

of new users using this form. Also, he can deactivate any user from the list. This form can

be extended in the future by adding more functionality. As shown in Figure 7.16, this form

consists of the following components:

• ID: the UserId of the user.

• Name: the full name of the user.

• Email: the email address of the user.

• Privilege: the privilege assigned to the user. This can be administrator or client.

• Status: the current status of the user.

• Change Status: a command link to change the current status of the user. The status

of the user can be change to be Inactive or Active.

As summarized in the sequence diagram shown in Figure 7.17, an administrator user

start the process of users management by trying to access the frmManageUsers.aspx web

Page 102: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

86

Figure 7.17: Sequence diagram of the process of managing users in LexBank.

form. The web form calls the method IsAdmin() from the utility calss to verify if the user is

authorized to access the form or not. If the user is not authorized, an error message is sent to

the user. Otherwise, if the user is authorized the web form calls the method RetrieveUsers()

to obtain the list of registered users in the system. Then, the administrator can select a user

from the list and click the change status link to change the current status of the user. Then,

the web form calls the ChangeUserStatus() method from the utility class to store the new

status and reload the updated users list in the screen.

7.5 Summary

In this chapter, we described the design and implementation of the LexBank, the mul-

tilingual lexical resource we produce in this thesis. The architecture of LexBank consists

of three layers: the database layer, the application layer and the web interface layer. The

Page 103: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

87

database layer consists of two databases: system settings database and resource database.

The application layer of the system is implemented using Microsoft C#. It provides admin-

istrative and resource access services to the web interface. The web interface is designed

and implemented using Microsoft Visual Studio 2012. The interface include web forms for

managing users and provide different wordnet search services in several languages. The

system can easily updated to accommodate other lingual services and languages.

Page 104: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 8

CONCLUSIONSIn this chapter, we summarize the main contributions of this dissertation. This dis-

sertation is motivated by the fact that so many languages around the word lack the compu-

tational lexical resources that are essential in natural language processing. Our first goal

in this dissertation is to develop automatic techniques, that rely on few available public

resources, for constructing wordnets for low-resource languages. A wordnet is a structured

lexical ontology of words that groups words based on their meaning using sets that are

called synsets. Wordnet is a very important lexical resource that is used in many applica-

tions, such as translation, word-sense disambiguation, information retrieval and document

classification. The second goal of this dissertation is to design and implement a system that

makes the lexical resources we produced available to the public. Next, we list the main

contributions of this dissertation.

• We have developed an approach for constructing structured wordnets. This approach

was developed by extending the approach for constructing the core wordnets pre-

sented by (Lam et al., 2014b). A core wordnet consists of only synsets that group

synonym words in sets with unique id. In a more comprehensive wordnet, these

synsets are semantically connected to represent the relation between the meaning of

the synsets. Our approach produces synsets that are semantically connected by se-

mantic relations. Examples of the semantic semantic relations we produced are: syn-

onyms, hypernyms, topic-domain related, part-holonyms and instance-hypernyms

and member-meronyms.

• We presented an approach for enhancing the quality of automatically constructed

wordnets. The approach is based on the vector representation of words (word em-

Page 105: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

89

beddings). Word embeddings is a machine learning technique that maps words to

real numbres vectors in a multi-dimensional space. Our approach uses the word2�ec

algorithm (Mikolov et al., 2013) to generate word representations from an exist-

ing corpus. The word2�ec algorithm is a feedforward neural network that predict

the vector representation of words within a multi-dimensional language model. Our

approach compute the cosine similarity, using word2�ec, between semantically re-

lated words in our constructed wordnets and filter any entries which do not satisfy a

pre-selected threshold value.

• We introduced s�nset2�ec, which is an algorithm for representing wordnet synsets

in a multi-dimensional space. Word embeddings provides an excellent vector repre-

sentation of words. However, words representation is effected by the fact that many

words have multiple meanings. In order to represent meanings rather than words, we

combine the vectors of synset lemxes into one vector that represent the meaning. We

believe that this vector representation can be used in many important applications.

For example, it can be used in of word-sense disambiguation, machine translation

and gloss selection for wordnet synsets.

• We used our algorithm s�nset2�ec to add glosses to our automatically constructed

synsets. Glosses are a very important part of wordnets. It is used to declare or

clarify the meaning of a synset in a wordnet. Gloss can be a definition statement

or an example sentence that shows the usage of the synonyms of the synset. To

select a gloss from a corpus for a synset, we used s�nset2�ec to generate vector

representations of candidate glosses and the synset. Then we compute the cosine

similarity between each candidate gloss and the synsets. Finally, we select the gloss

with highest cosine similarity with synset and attach it to the synset.

• We have developed LexBank which is a web application that give access for public

users to our created resources. LexBank provides useful services for users that seek

Page 106: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

90

linguistic assistance in a friendly manner. It, also, include evaluation web forms

that are used to gather feedback from human judges. The design of LexBank is

flexible and it can be easily expanded to accommodate additional new languages

and resources.

Page 107: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Chapter 9

FUTURE WORKIn this chapter, we propose some potential future work that can be done based on this

dissertation. The general goal of the proposed future word is to enhance the quality and

extend the coverage of the lexical resources. For example, we produced our core wordnets

based on machine translation and some small dictionaries. The quality of these wordnets

are limited by the resources we used to create them. It is well known that these resources

does not guarantee high coverage and accuracy for all of the target languages. Next, we list

some of the potential future work.

9.1 Extending Bilingual Dictionaries

In this section, we provide one more additional possible task that can be undertaken in

future work. We propose a new method to extend the bilingual dictionaries created in (Lam

et al., 2015b). To increase the coverage of the bilingual dictionaries, we take advantage of

the wordnets we have created in this dissertation. This section is divided into two parts.

In the first part, we describe the approach we used in (Lam et al., 2015b) to create the

bilingual dictionaries. In the second part, we describe the proposed method to extend these

bilingual dictionaries.

9.1.1 Related Work

In (Lam et al., 2015b) we have created a large number of new bilingual dictionar-

ies using intermediate core wordnets and a machine translator. A dictionary, or a lexicon,

as defined by (Landau, 1984), consists of sorted 2-tuple <LexicalUnit, Definition> en-

tries. Each entry is called LexicalEntry. The first part of a LexicalEntry is the phrase being

defined, while the second part is the definition of the phrase. The definition include the

Page 108: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

92

meaning of the LexicalUnit and usually have several Senses which is is a separate repre-

sentation of a single aspect of the meaning of a phrase. In (Lam et al., 2015b), the entries

in the dictionaries are of the form < LexicalUnit ,Sense1

>, < LexicalUnit ,Sense2

>,....

The approach for creating dictionaries using intermediate wordnets and a machine

translator (IW) is described as in Figure 9.1 and Algorithm 2.

Figure 9.1: The IW approach for creating a new bilingual dictionary

Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S

is a source language and D is a target language, given the dictionary Dict(S,R) where R

Page 109: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

93

is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from

Dict(S,R) and extract SenseR from it. Then, it retrieves all Offset-POSs of SenseR from

the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted

Offset-POSs are extracted from all the available intermediate wordnets. Then, the algorithm

construct a candidate set candidateSet for the final translations in language D by translating

all the extracted synonyms to language D using machine translation (Algorithm 3). There

are 2 attributes in each candidate in candidateSet : word which represents a translation in

language D, and rank which counts the occurrence of this translation. The rank attribute

is used to order the candidates in descending order where the top candidate is the best

translation. Finally, the sorted candidates are inserted into the new dictionary Dict(S,D)

(Algorithm 2, lines 8-10).

Algorithm 2: IW algorithmInput: Dict(S,R)Output: Dict(S, D)

1: Dict(S, D) := �2: for all LexicalEntry 2 Dict(S,R) do3: for all SenseR 2 LexicalEntry do4: candidateSet := �5: Find all Offset-POSs of synsets containing SenseR from the R Wordnet6: candidatSet = FindCandidateSet (Offset-POSs, D)7: sort all candidate in descending order based on their rank values8: for all candidate 2 candidateSet do9: SenseD=candidate.word

10: add tuple <LexicalUnit,SenseD> to Dict(S,D)11: end for12: end for13: end for

9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets

In this section, we propose a new method to extend dictionaries we created by (Lam

et al., 2015b) using the structured wordnets that we have created in this dissertation. The

Page 110: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

94

Algorithm 3: FindCandidateSet (Offset-POSs,D)Input: Offset-POSs, DOutput: candidateSet

1: candidateSet := �2: for all Offset-POS 2 Offset-POSs do3: for allword in the Offset-POS extracted from the PWN and other available WordNets

linked to the PWN do4: candidate .word= translate (word,D)5: candidate .rank++6: candidateSet += candidate7: end for8: end for9: return candidateSet

following steps, which are summarized in Figure 9.2, describe the proposed method to

extend the dictionaries.

Figure 9.2: Extending bilingual dictionaries using structured wordnets

• We start by extracting each input enrty Si from the source language S in the bilingual

dictionary from S to D.

Page 111: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

95

• Then, we retrieve the synsets list of Si from the wordnet of S .

• Next, we extract the corresponding synsets from the wordnet of D.

• For each synset member Dk we extracted from wordnet of D, we create a lexical

entry (Si ,Dk).

• Besides that, for each synset we extracted from wordnet of D, we extract the direct

hypernyms and we, also, create a lexical entry (Si ,Hl ).

• Finally, we add any lexical entry we have created in the previous steps to the bilingual

dictionary from S to D if it is not already exists in the dictionary.

9.2 Integrating Part-of-speech Tagging into Wordnet Construction

Since our approach for wordnet automatic construction is based on translation, some

of the generated synsets include words that are in the wrong part of speech form. One

solution is to use a Part-Of-Speech Tagger (POS Tagger) to correct the wrong form of the

words in the synset.

A POS Tagger is a computer program which is used to specify the part of speech

of words in a text written in some language. For example, the Stanford Part-Of-Speech

Tagger (Toutanova et al., 2003), which is freely available, provides part of speech tagging

for Arabic, Chinese, French, Spanish and German. Also, other POS Taggers are available

for Assamese (Saharia et al., 2009) and Vietnamese (Le-Hong et al., 2010). Since we are

dealing with low-resource languages, many languages does not have any POS Taggers and

, therefore, this approach is not applicable to them.

To correct the part of speech in the words within a synset, we propose the following

steps:

• For each synset s�nseti in a wordnet wordnetT , we extract the part of speech of the

synset from Offset-POS of s�nseti .

Page 112: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

96

• For each word wordj in s�nseti , we find out the part of speech of wordj and compare

it with the part of s�nseti . If the of parts of speech of wordj and s�nseti does not

match, we convert the form of wordj to the correct part of speech form and update

s�nseti .

9.3 Wordnet Expansion Using Word Embeddings

One possible way to automatically improve the coverage of a wordnet is by looking

for additional related words in a corpus using word embeddings. In Chapter 6 we intro-

duced s�nset2�ec which is a vector representation of synsets in a multi-dimensional space.

Taking advantage of s�nset2�ec, we believe it is possible to look for previously unknown

words that are semantically related to a synset and include them to the wordnet. Next, we

present a brief description of our idea.

• Assume that we would like to expand a wordnetwordnetT of languageT . First, word

embeddings for T is generated.

• Next, for each synset s�nseti inwordnetT , the vector for s�nseti ~Vi is generated using

s�nset2�ec.

• Then, all the words that are have cosine similarity value of a preselected threshold

� or less are extracted. From those words, only the words that does not have any

semantic relation with s�nseti is inserted into a candidate set Ci .

• Next, for each word wordj in Ci , a semantic relation rj is selected based on a classi-

fication approach.

• Finally, wordj is inserted into wordnetT and connected to s�nseti using semantic

relation rj .

Page 113: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

97

9.4 Producing Vector Representation for Multi-word Lexemes

One issue that appears when producing vector representation is that wordnet lexemes

can be multi-word phrases. Most of the existing tools for producing word embeddings

are single-word based. This means that they produce vectors for lexical units that are

surrounded by spaces. Therefore, when we try to generate a vector for a wordnet synset,

we avoid multi-word lexemes. An enhanced version of our approach of generating vectors

for wordnet synsets can be achieved by including a vector representation for multi-word

lexemes. The vectors of the single words within a multi-word lexeme should be aggregated

such that it represents one vector within the synset. However, one issue that rises is that

each single words within the multi-word lexeme might have several meanings when they

individually appear. Therefor, a careful research is needed to determine a good solution for

this problem.

9.5 Vector Representation for Mulit-lingual Wordnets

In this dissertation, we produced vector representation for the individual wordnets.

One work that might help in problems, such as wordnets expansion and machine transla-

tion, is the vector representation of aggregated wordnets of several languages. Since all

of wordnets we create in this dissertation are aligned with PWN, synsets having the same

Offset-Pos in different wordnets actually represents the same meaning. Therefore, we be-

lieve that combining the vectors of aligned synsets from different languages will produce

representation for the meaning within several language. One can use this representation to

discover the closest meaning of new words that are not included within the wordnets. This,

also, could be used in discovering a rough translation for words that are not included in a

dictionary.

Page 114: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

98

BIBLIOGRAPHY

M. Abbas and K. Smaili. Comparison of topic identification methods for arabic language.

In International Conference on Recent Advances in Natural Language Processing-

RANLP 2005, volume 14, 2005.

M. Abbas, K. Smaïli, and D. Berkani. Evaluation of topic identification methods on arabic

corpora. JDIM, 9(5):185–192, 2011.

K. Ahn and M. Frampton. Automatic generation of translation dictionaries using inter-

mediary languages. In Proceedings of the International Workshop on Cross-Language

Knowledge Induction, pages 41–44. Association for Computational Linguistics, 2006.

P. Akaraputthiporn, K. Kosawat, and W. Aroonmanakun. A Bi-directional Translation

Approach for Building Thai Wordnet. In Asian Language Processing, 2009. IALP’09.

International Conference on, pages 97–101. IEEE, 2009.

M. Apidianaki and R. J. Von Neumann. Limsi: Cross-lingual word sense disambiguation

using translation sense clustering. In Second Joint Conference on Lexical and Computa-

tional Semantics (* SEM), volume 2, pages 178–182, 2013.

M. A. Attia. Handling Arabic morphological and syntactic ambiguity within the LFG

framework with a view to machine translation. PhD thesis, University of Manchester,

2008.

E. Barbu. Automatic Building of Wordnets EdUArd BarbU* &: Verginica BarbU Mi-

TiTElU*** Graphitech Italy" Romanian Academy, Research Institute for Artificial In-

Page 115: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

99

telligence. Recent Advances in Natural Language Processing IV: Selected Papers from

RANLP 2005, 292:217, 2007.

K. R. Beesley. Arabic finite-state morphological analysis and generation. In Proceedings

of the 16th conference on Computational linguistics-Volume 1, pages 89–94. Association

for Computational Linguistics, 1996.

S. Bhattacharya, M. Choudhury, S. Sarkar, and A. Basu. Inflectional morphology synthesis

for bengali noun, pronoun and verb systems. Proc. of NCCPB, 8, 2005.

P. Bhattacharyya. Indowordnet. In In Proc. of LREC-10, 2010.

O. Bilgin, z. Çetinoglu, and K. Oflazer. Building a wordnet for Turkish. Romanian Journal

of Information Science and Technology, 7(1-2):163–172, 2004.

L. Bloomfield. Language. new york: Holt, rinehart and winston. A classic in linguistic

studies and the first serious attempt in the development of morphology. Pre-and post-

generative morphology conceptually were nurtured from the remarkable insights given

in this linguistic masterpiece, 1933.

F. Bond and K. Ogura. Combining linguistic resources to create a machine-tractable

Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127–136, 2008.

L. Borin and M. Forsberg. Swesaurus; or, the frankenstein approach to wordnet construc-

tion. In Proceedings of the Seventh Global WordNet Conference (GWC 2014), 2014.

D. Bouamor, N. Semmar, C. France, and P. Zweigenbaum. Using Wordnet and semantic

similarity for bilingual terminology mining from comparable corpora. In Proceedings of

the 6th Workshop on Building and Using Comparable Corpora, pages 16–23. Citeseer,

2013.

Page 116: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

100

R. D. Brown. Automated dictionary extraction for “knowledge-free” example-based trans-

lation. In Proceedings of the Seventh International Conference on Theoretical and

Methodological Issues in Machine Translation, pages 111–118, 1997.

T. Buckwalter. Issues in arabic orthography and morphology analysis. In Proceedings of

the Workshop on Computational Approaches to Arabic Script-based Languages, pages

31–34. Association for Computational Linguistics, 2004.

T. Charoenporn, V. Sornlertlamvanich, C. Mokarat, and H. Isahara. Semi-automatic com-

pilation of Asian WordNet. In 14th Annual Meeting of the Association for Natural Lan-

guage Processing, pages 1041–1044, 2008.

D. Christodoulakis, K. Oflazer, D. Dutoit, S. Koeva, G. Totkov, K. Pala, D. Cristea, D. Tufis,

M. Grigoriadou, I. Tsakou, and others. BalkaNet: A Multilingual Semantic Network for

Balkan Languages. In Proceedings of the 1st International Wordnet Conference, Mysore,

India, 2002.

C. J. Crouch. An approach to the automatic construction of global thesauri. Information

Processing & Management, 26(5):629–640, 1990.

A. Cucchiarelli, R. Navigli, F. Neri, and P. Velardi. Automatic Generation of Glosses in the

OntoLearn System. In LREC. Citeseer, 2004.

J. R. Curran. From distributional to semantic similarity. 2004.

J. R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Pro-

ceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages

59–66. Association for Computational Linguistics, 2002a.

J. R. Curran and M. Moens. Scaling context space. In Proceedings of the 40th Annual

Meeting on Association for Computational Linguistics, pages 231–238. Association for

Computational Linguistics, 2002b.

Page 117: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

101

K. Darwish. Named entity recognition using cross-lingual resources: Arabic as an example.

In ACL (1), pages 1558–1567, 2013.

M. Diab and N. Habash. Arabic dialect processing tutorial. In Proceedings of the Hu-

man Language Technology Conference of the NAACL, Companion Volume: Tutorial Ab-

stracts, pages 5–6. Association for Computational Linguistics, 2007.

R. M. Fano and D. Hawkins. Transmission of information: A statistical theory of commu-

nications. American Journal of Physics, 29(11):793–794, 1961.

A. Farghaly and K. Shaalan. Arabic natural language processing: Challenges and solutions.

ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14, 2009.

C. Fellbaum. A semantic network of English verbs. WordNet: An electronic lexical

database, 3:153–178, 1998.

C. Fellbaum. WordNet and Wordnets. In A. Barber, editor, Encyclopedia of Language and

Linguistics, pages 2–665. Elsevier, 2005.

M. A. Finlayson. Java libraries for accessing the Princeton WordNet: Comparison and

evaluation. In Proceedings of the 7th Global Wordnet Conference, pages 78–85, 2014.

J. R. Firth. {A synopsis of linguistic theory, 1930-1955}. 1957.

T. Gollins and M. Sanderson. Improving cross language retrieval with triangulated transla-

tion. In Proceedings of the 24th annual international ACM SIGIR conference on Research

and development in information retrieval, pages 90–95. ACM, 2001.

R. G. Gordon and B. F. Grimes. Ethnologue: Languages of the world, volume 15. SIL

international Dallas, TX, 2005.

S. Green and C. D. Manning. Better arabic parsing: Baselines, evaluations, and analysis. In

Proceedings of the 23rd International Conference on Computational Linguistics, pages

394–402. Association for Computational Linguistics, 2010.

Page 118: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

102

G. Grefenstette. Explorations in automatic thesaurus discovery, volume 278. Springer

Science & Business Media, 2012.

G. Gunawan and A. Saputra. Building synsets for Indonesian Wordnet with monolingual

lexical resources. In Asian Language Processing (IALP), 2010 International Conference

on, pages 297–300. IEEE, 2010.

N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological

disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Asso-

ciation for Computational Linguistics, pages 573–580. Association for Computational

Linguistics, 2005.

N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological analysis and

disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432, 2013.

N. Y. Habash. Introduction to arabic natural language processing. Synthesis Lectures on

Human Language Technologies, 3(1):1–187, 2010.

A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning Bilingual Lexicons

from Monolingual Corpora. In ACL, volume 2008, pages 771–779, 2008.

Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.

L. Hinkle, A. Brouillette, S. Jayakar, L. Gathings, M. Lezcano, and J. Kalita. Design and

evaluation of soft keyboards for brahmic scripts. ACM Transactions on Asian Language

Information Processing (TALIP), 12(2):6, 2013.

G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection

and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332,

1998.

E. Héja. Dictionary Building based on Parallel Corpora and Word Alignment. In Proceed-

ings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, 2010.

Page 119: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

103

Y. Hlal. Morphological analysis of arabic speech. In Workshop Papers Kuwait/Proceedings

of Kuwait Conference on Computer Processing of the Arabic Language, pages 273–294,

1985.

V. István and Y. Shoichi. Bilingual dictionary generation for low-resourced language pairs.

In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Pro-

cessing: Volume 2-Volume 2, pages 862–870. Association for Computational Linguistics,

2009.

P. Jaccard. The distribution of the flora in the alpine zone. New phytologist, 11(2):37–50,

1912.

D. Jurafsky and J. H. Martin. Speech and Language Processing (2Nd Edition). Prentice-

Hall, Inc., Upper Saddle River, NJ, USA, 2009. ISBN 0131873210.

H. Kaji and M. Watanabe. Automatic Construction of Japanese WordNet. Proceedings of

LREC2006, Italy, 2006.

H. Kozima and T. Furugori. Similarity between words computed by spreading activation

on an English dictionary. In Proceedings of the sixth conference on European chapter of

the Association for Computational Linguistics, pages 232–239. Association for Compu-

tational Linguistics, 1993.

K. N. Lam. Automatically Creating MultiLingual Resources. PhD thesis, University of

Colorado, Colorado Springs, Apr. 2015.

K. N. Lam and J. Kalita. Creating Reverse Bilingual Dictionaries. In HLT-NAACL, pages

524–528. Citeseer, 2013.

K. N. Lam, F. Al Tarouti, and J. Kalita. Creating Lexical Resources for Endangered Lan-

guages. In Proceedings of the 2014 Workshop on the Use of Computational Methods

Page 120: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

104

in the Study of Endangered Languages, pages 54–62, Baltimore, Maryland, USA, June

2014a. Association for Computational Linguistics.

K. N. Lam, F. A. Tarouti, and J. Kalita. Automatically constructing Wordnet synsets.

In 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014),

Baltimore, USA, June, 2014b.

K. N. Lam, F. Al Tarouti, and J. Kalita. Phrase translation using a bilingual dictionary and

n-gram data: A case study from vietnamese to english. In Proceedings of NAACL-HLT,

pages 65–69, 2015a.

K. N. Lam, F. Al Tarouti, and J. Kalita. Automatically Creating a Large Number of New

Bilingual Dictionaries. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Feb.

2015b.

S. I. Landau. Dictionaries. NY: Scribners, 1984.

L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving stemming for arabic informa-

tion retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th

annual international ACM SIGIR conference on Research and development in informa-

tion retrieval, pages 275–282. ACM, 2002.

P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol. An empirical study of

maximum entropy approach for part-of-speech tagging of vietnamese texts. In Traitement

Automatique des Langues Naturelles-TALN 2010, page 12, 2010.

D. Leenoi, T. Supnithi, and W. Aroonmanakun. Building a Gold Standard for Thai Word-

Net. In Proceeding of The International Conference on Asian Language Processing 2008

(IALP2008), pages 78–82, 2008.

D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th An-

nual Meeting of the Association for Computational Linguistics and 17th International

Page 121: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

105

Conference on Computational Linguistics-Volume 2, pages 768–774. Association for

Computational Linguistics, 1998.

K. Lindén and J. Niemi. Is it possible to create a very large wordnet in 100 days? an

evaluation. Language resources and evaluation, 48(2):191–201, 2014.

K. Lindén and L. Carlson. Finn WordNet-WordNet p\a a finska via översättning. Lexi-

coNordica, 17(17), 2010.

N. Ljubešic and D. Fišer. Bootstrapping bilingual lexicons from comparable corpora for

closely related languages. In Text, Speech and Dialogue, pages 91–98. Springer, 2011.

M. Maziarz, M. Piasecki, E. Rudnicka, and S. Szpakowicz. Beyond the transfer-and-merge

wordnet construction: plwordnet and a comparison with wordnet. In RANLP, pages

443–452, 2013.

J. J. McCarthy. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12

(3):373–418, 1981.

T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word

representations. In HLT-NAACL, pages 746–751, 2013.

G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38

(11):39–41, 1995.

G. A. Miller and F. Hristea. WordNet nouns: Classes and instances. Computational Lin-

guistics, 32(1):1–3, 2006.

T. Miller and I. Gurevych. Wordnet-wikipedia-wiktionary: Construction of a three-way

alignment. In LREC, pages 2094–2100, 2014.

M. Mladenovic, J. Mitrovic, and C. Krstev. Developing and Maintaining a WordNet: Pro-

cedures and Tools. In Proceedings of the 7th Global Wordnet Conference (GWC 2014),

pages 55–62, 2014.

Page 122: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

106

C. Mouton and G. de Chalendar. JAWS: Just another WordNet subset. Proc. of TALN’10,

2010.

A. S. Nagvenkar, N. R. Prabhugaonkar, V. P. Prabhu, R. N. Karmali, and J. D. Pawar. Con-

cept Space Synset Manager Tool. In Proceedings of the 7th Global Wordnet Conference,

pages 86–94, 2014.

P. Nakov and H. T. Ng. Improved statistical machine translation for resource-poor lan-

guages using related resource-rich languages. In Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1358–

1367. Association for Computational Linguistics, 2009.

R. Navigli and S. P. Ponzetto. BabelNet: Building a very large multilingual semantic

network. In Proceedings of the 48th annual meeting of the association for computational

linguistics, pages 216–225. Association for Computational Linguistics, 2010.

L. Nerima and E. Wehrli. Generating Bilingual Dictionaries by Transitivity. In LREC,

volume 8, pages 2584–2587, 2008.

R. Noyer. Vietnamese’morphology’and the definition of word. University of Pennsylvania

Working Papers in Linguistics, 5(2):5, 1998.

A. Oliver. Wn-toolkit: Automatic generation of wordnets following the expand model.

Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.

A. Oliver and S. Climent. Parallel corpora for Wordnet construction: machine translation

vs. automatic sense tagging. In Computational Linguistics and Intelligent Text Process-

ing, pages 110–121. Springer, 2012.

P. G. Otero and J. R. P. Campos. Automatic generation of bilingual dictionaries using inter-

mediary languages and comparable corpora. In Computational Linguistics and Intelligent

Text Processing, pages 473–483. Springer, 2010.

Page 123: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

107

N. R. Prabhugaonkar, J. D. Pawar, and T. Plateau. Use of Sense Marking for Improving

WordNet Coverage. In Proceedings of the 7th Global Wordnet Conference, pages 95–99,

2014.

Q. Pradet, G. de Chalendar, and J. B. Desormeaux. Wonef, an improved, expanded and

evaluated automatic french translation of wordnet. Proceedings of the 7th Global Word-

NetConference, Tartu, Estonia, 2014.

J. Ramírez, M. Asahara, and Y. Matsumoto. Japanese-Spanish thesaurus construction using

English as a pivot. arXiv preprint arXiv:1303.1232, 2013.

G. Rigau, H. Rodriguez, and E. Agirre. Building accurate semantic taxonomies from

monolingual MRDs. In Proceedings of the 17th international conference on Compu-

tational linguistics-Volume 2, pages 1103–1109. Association for Computational Linguis-

tics, 1998.

H. Rodríguez, D. Farwell, J. Ferreres, M. Bertran, M. Alkhalifa, and M. A. Martí. Arabic

wordnet: Semi-automatic extensions using bayesian inference. In LREC, 2008.

B. Sagot and D. Fišer. Building a free French wordnet from multilingual resources. In

OntoLex, 2008.

N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of speech tagger for assamese text. In

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36. Associa-

tion for Computational Linguistics, 2009.

R. C. S. K. Sarma. Structured and logical representations of assamese text for question-

answering system. In 24th International Conference on Computational Linguistics,

page 27, 2012.

Page 124: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

108

M. Saveski and I. Trajkovski. Automatic construction of wordnets by using machine trans-

lation and language modeling. In 13th Multiconference Information Society, Ljubljana,

Slovenia, 2010.

K. Shaalan, A. A. Monem, and A. Rafea. Arabic morphological generation from interlin-

gua. In Intelligent Information Processing III, pages 441–451. Springer, 2006.

U. Sharma, J. K. Kalita, and R. K. Das. Acquisition of morphology of an indic lan-

guage from text corpus. ACM Transactions on Asian Language Information Processing

(TALIP), 7(3):9, 2008.

R. Shaw, A. Datta, D. VanderMeer, and K. Dutta. Building a scalable database-driven

reverse dictionary. Knowledge and Data Engineering, IEEE Transactions on, 25(3):

528–540, 2013.

S. Soderland, O. Etzioni, D. S. Weld, K. Reiter, M. Skinner, M. Sammer, J. Bilmes, and

others. Panlingual lexical translation via probabilistic inference. Artificial Intelligence,

174(9):619–637, 2010.

K. Tanaka and K. Umemura. Construction of a bilingual dictionary intermediated by a third

language. In Proceedings of the 15th conference on Computational linguistics-Volume 1,

pages 297–303. Association for Computational Linguistics, 1994.

L. C. Thompson. A Vietnamese reference grammar, volume 13. University of Hawaii Press,

1987.

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging

with a cyclic dependency network. In Proceedings of the 2003 Conference of the North

American Chapter of the Association for Computational Linguistics on Human Language

Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

Page 125: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

109

P. Vossen. Introduction to eurowordnet. In EuroWordNet: A multilingual database with

lexical semantic networks, pages 1–17. Springer, 1998.

Wikipedia. Wordnet — wikipedia, the free encyclopedia, 2015. URL http://en.

wikipedia.org/w/index.php?title=WordNet&oldid=656664111.

[Online; accessed 22-April-2015].

Wikipedia. Vietnamese language — wikipedia, the free encyclopedia, 2016a. URL

https://en.wikipedia.org/w/index.php?title=Vietnamese_

language&oldid=731154067. [Online; accessed 30-July-2016].

Wikipedia. Vietnamese morphology — wikipedia, the free encyclopedia, 2016b.

URL https://en.wikipedia.org/w/index.php?title=Vietnamese_

morphology&oldid=730832239. [Online; accessed 30-July-2016].

K. Yu and J. Tsujii. Extracting bilingual dictionary from comparable corpora with de-

pendency heterogeneity. In Proceedings of Human Language Technologies: The 2009

Annual Conference of the North American Chapter of the Association for Computational

Linguistics, Companion Volume: Short Papers, pages 121–124. Association for Compu-

tational Linguistics, 2009.

O. F. Zaidan and C. Callison-Burch. Arabic dialect identification. Computational Linguis-

tics, 40(1):171–202, 2014.

Page 126: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Appendix A

DATA PROCESSING SOFTWARE CODE

A.1 computCosineSim.py

############################ Program to compute cosine similarity# between semantically related words in a WordNet# using Word2Vec# Author: Feras Al Tarouti# Date : Feb 4 2016

import unicodecsv as csvimport codecsimport gensimimport editdistance

word2vecmodel=gensim.models.Word2Vec.load_word2vec_format('VieVectors_SG_Size100_W5.bin', binary=True)

with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f:writer = csv.writer(f)writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2',

'COS','ld'])with open('LexBankVieSemRelatedWords.csv', 'rb') as f:

reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)firstline = Truerownum = 0for row in reader:

if firstline:firstline=False

else:print("Compute Similarity for pairs number: {0}".format(rownum))SynsetID1=row[0]Word1= row[1]Relation=row[2]SynsetID2=row[3]Word2=row[4]try:

cos= round(word2vecmodel.similarity(Word1,Word2),3)except Exception:

cos=00.00ld= editdistance.eval(Word1,Word2)

Page 127: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

111

newrow=[SynsetID1,Word1,Relation,SynsetID2,Word2,cos,ld]writer.writerow(newrow)

rownum =rownum +1

Page 128: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

112

A.2 GenerateVectorForSynset.py

############################ A function for computing a synset vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorForSynset(syn,thislemma):

FinalVector=np.zeros(100)VectorList=[] # define the vector set for this synsetLemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset

for lemma in LemmasList:if lemma != thislemma:

Vector= GenerateVectorForLemma(lemma)if np.count_nonzero(Vector)>0:

VectorList.append(Vector) # add vector of word to the synset Vector

# Find out if this synset have only one word,# in this case we have to find a related word and add it to the

vector setsif len(VectorList)<2:

#we need to find out a related synsetrelatedword=FindRelatedSyn(syn)if relatedword != "":

Vector =GenerateVectorForLemma(relatedword)if np.count_nonzero(Vector)>0:

VectorList.append(Vector) # add vector of word to the synset Vector

for vec in VectorList:FinalVector=np.add(FinalVector,vec)

# compute the averagenumbofVec= len(VectorList)scalar=np.divide(float(1),float(numbofVec))FinalVector=np.multiply(FinalVector, scalar)return FinalVector

Page 129: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

113

A.3 GenerateVectorForGloss.py

############################ A function for computing a gloss vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorFor(thisSentence,lemma):

VectorList=[] # define the vector set for this SentenceFinalVector=np.zeros(100)for word in thisSentence.split():

skip = Falseif word not in stopwrds and word != lemma:try:

Vector = word2vecmodel[word]NofSyns = FindNumberOfSyns(word)# Scale the vector base on the number of synsetsif NofSyns > 1:

thisScalar = np.divide(float(1),float(NofSyns))Vector = np.multiply(Vector, thisScalar)

VectorList.append(Vector)skip=False # we have this word in our model

except Exception:skip=True

if len(VectorList)>0:for vec in VectorList:

FinalVector=np.add(FinalVector,vec)numbofVec= len(VectorList)saclar=np.divide(float(1),numbofVec)FinalVector=np.multiply(FinalVector, saclar)

return FinalVector;

Page 130: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

114

A.4 ComputeGlossSynsetSimilarity.py

############################ A program for computing similarity between synset and gloss# Author: Feras Al Tarouti# Date : May 18 2016# First Step : Open the synset-gloss files, and read the sentence# Second Step : Generate the vector for the synset# Third Step : Generate the vector for the sentence# Fourth Step : Compute the cosine similarity between the synset vector# and the sentence vector# Fivth Step : Save the result###########################

with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb')as out_file:reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',')writer = csv.writer(out_file, encoding='utf-8')writer.writerow(['ID','CosSem'])rownum=0for row in reader:

if rownum!=0:print("Computing Cosine Similarity for Row numb: {0}".format(rownum)

)thisSenID = row[0] # read the current sentence IDthisSynset = row[1] # read the current synsetIDthisSynMem = row[2] # read number of members for this synsetthiswrd = row[3] # read the word used in this sentencethiswrdSyns = row[4] # read the number of synsets for this wordthisSentence = row[5] # read the current sentence

#Compute a vector for this synsetthisSynsetVector = GenerateVectorForSynset(thisSynset,"")

# Generate Vector for this sentencethisSentenceVector = GenerateVectorFor(thisSentence,"")

CosDistance = ComputeCosine (thisSynsetVector, thisSentenceVector)x=Decimal(CosDistance)if math.isnan(x):

CosDistance=0newrow=[thisSenID,CosDistance]

writer.writerow(newrow)

rownum=rownum+1

Page 131: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Appendix BMICROSOFT SQL SERVER TABLES

---- Database: `LexBank_System`---- ------------------------------------------------------------ Table structure for table `Users_Info`--USE [LexBank_System]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Users_Info]([UserId] [varchar](50) NOT NULL,[UserName] [varchar](100) NOT NULL,[UserEmail] [varchar](70) NOT NULL,[UserPwd] [varchar](max) NOT NULL,[UserPriv] [varchar](15) NOT NULL,[UserStatus] [varchar](15) NOT NULL,

CONSTRAINT [PK_Users_Info] PRIMARY KEY CLUSTERED(

[UserId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY =

OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Table structure for table `System_Log`--USE [LexBank_System]GO

SET ANSI_NULLS ON

Page 132: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

116

GO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[System_Log]([EventId] [int] IDENTITY(1,1) NOT NULL,[EventDesc] [varchar](200) NOT NULL,[EventTime] [datetime] NOT NULL,[UserId] [varchar](50) NOT NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Database: `LexBank_Resources`---- ------------------------------------------------------------ Table structure for table `Arabic_CoreWordnet`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL

) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_CoreWordnet`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

Page 133: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

117

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL

) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_CoreWordnet`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL

) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,

Page 134: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

118

[Right_Offset_Pos] [nvarchar](10) NOT NULL) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL

) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL

) ON [PRIMARY]

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGlosses`--

Page 135: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

119

USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGlosses`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGlosses`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

Page 136: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

120

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ON

Page 137: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

121

GO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO

Page 138: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

122

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

Page 139: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

123

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ON

Page 140: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

124

GO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Response`--

Page 141: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

125

USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO

Page 142: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

126

SET ANSI_NULLS ONGO

SET QUOTED_IDENTIFIER ONGO

SET ANSI_PADDING ONGO

CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFFGO-- --------------------------------------------------------

Page 143: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

Appendix C

LEXBANK UTILITY CLASS

1 using System;2 using System.Collections.Generic;3 using System.Linq;4 using System.Web;5 using System.Data;6 using System.Data.SqlClient;7 using System.Web.Configuration;8 using System.IO;9 using System.Text;

10 using System.Security.Cryptography;11

12 namespace LexBank201613 {14 public class LexBankUtils15 {16 private string LexBankConnectionString = WebConfigurationManager

.ConnectionStrings["LexBankData"].ToString();17

18 public Boolean IsUserIdAvailable(string UserId)19 {20 // This function takes user id and check if it is already

used or not21 Boolean result = false;22

23

24 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

25 {26 connection.Open();27 //28 // Create new SqlCommand object.29 //30 using (SqlCommand command = new SqlCommand("SELECT

UserId FROM Users_Info where UserId like @UserId",connection))

31 {32 // Define the parameters33 command.Parameters.AddWithValue("@UserId", UserId.

Trim());34 //35 // Invoke ExecuteReader method.36 //37 var firstColumn = command.ExecuteScalar();

Page 144: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

128

38 if (firstColumn == null)39 {40 result = true;41 }42 }43 }44 return result;45

46

47 }48

49 public string EncryptPassword(string PlanePassword)50 {51 string EncryptionKey = "LexBank";52 byte[] PlaneBytes = Encoding.Unicode.GetBytes(PlanePassword)

;53 using (Aes PasswordEncryptor = Aes.Create())54 {55 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(

EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e,0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76});

56 PasswordEncryptor.Key = PBKDF.GetBytes(32);57 PasswordEncryptor.IV = PBKDF.GetBytes(16);58 using (MemoryStream ms = new MemoryStream())59 {60 using (CryptoStream cs = new CryptoStream(ms,

PasswordEncryptor.CreateEncryptor(),CryptoStreamMode.Write))

61 {62 cs.Write(PlaneBytes, 0, PlaneBytes.Length);63 cs.Close();64 }65 PlanePassword = Convert.ToBase64String(ms.ToArray())

;66 }67 }68 return PlanePassword;69 }70

71 public string DecryptPassword(string EncryptedPassword)72 {73 string EncryptionKey = "LexBank";74 byte[] DecryptedBytes = Convert.FromBase64String(

EncryptedPassword);75 using (Aes PasswordEncryptor = Aes.Create())76 {77 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(

EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e,0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76});

78 PasswordEncryptor.Key = PBKDF.GetBytes(32);79 PasswordEncryptor.IV = PBKDF.GetBytes(16);80 using (MemoryStream ms = new MemoryStream())

Page 145: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

129

81 {82 using (CryptoStream cs = new CryptoStream(ms,

PasswordEncryptor.CreateDecryptor(),CryptoStreamMode.Write))

83 {84 cs.Write(DecryptedBytes, 0, DecryptedBytes.

Length);85 cs.Close();86 }87 EncryptedPassword = Encoding.Unicode.GetString(ms.

ToArray());88 }89 }90 return EncryptedPassword;91 }92

93 public Boolean CreateNewUser(string UserId, string UserName,string UserEmail, string UserPwd)

94 {95 Boolean result = false;96 string UserPriv = "client";97 string UserStatus = "New";98 using (SqlConnection connection = new SqlConnection(

LexBankConnectionString))99 {

100 connection.Open();101 //102 // Create new SqlCommand object.103 //104 using (SqlCommand command = new SqlCommand("INSERT INTO

Users_Info VALUES(@UserId,@UserName,@UserEmail,@UserPwd,@UserPriv,@UserStatus)", connection))

105 {106 // Define the parameters107 command.Parameters.AddWithValue("@UserId", UserId.

Trim());108 command.Parameters.AddWithValue("@UserName",

UserName.Trim());109 command.Parameters.AddWithValue("@UserEmail",

UserEmail.Trim());110 command.Parameters.AddWithValue("@UserPwd", UserPwd.

Trim());111 command.Parameters.AddWithValue("@UserPriv",

UserPriv.Trim());112 command.Parameters.AddWithValue("@UserStatus",

UserStatus.Trim());113 //114 // Invoke ExecuteNonQuery method.115 //116 int c = 0;117 try118 {119 c = command.ExecuteNonQuery();120 if (c == 1)

Page 146: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

130

121 result = true;122 }123 catch (Exception e)124 {125

126 }127

128

129 }130

131 }132

133

134

135

136 return result;137 }138

139 public bool IsAuthenticated(string userid, string userpassword)140 {141

142 bool result = false;143 SqlConnection LexBankDataConnection = new SqlConnection(

LexBankConnectionString);144 SqlCommand AuthCommand = new SqlCommand("Select UserId,

UserPriv, UserStatus from Users_Info where UserId=@userid and UserPwd=@userpassword",LexBankDataConnection);

145 AuthCommand.Parameters.AddWithValue("@userid", userid);146 AuthCommand.Parameters.AddWithValue("@userpassword",

EncryptPassword(userpassword.Trim()));147 LexBankDataConnection.Open();148 SqlDataReader reader = AuthCommand.ExecuteReader();149 while (reader.Read())150 {151 string UserStatus = reader["UserStatus"].ToString();152 if (UserStatus == "Active")153 {154 result = true;155 LogEvent("Login", DateTime.Now, userid.Trim());156

157 }158 }159 return result;160 }161

162 public List<string> FindSynSet(string lexeme, string WordNet)163 {164

165 List<string> result = new List<string>();166

167 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

168 {

Page 147: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

131

169 connection.Open();170 //171 // Create new SqlCommand object.172 //173 using (SqlCommand command = new SqlCommand("SELECT *

FROM " + WordNet + " where Member like @lexeme",connection))

174 {175 // Define the parameters176 command.Parameters.AddWithValue("@lexeme", lexeme.

Trim());177 //178 // Invoke ExecuteReader method.179 //180 SqlDataReader reader = command.ExecuteReader();181 while (reader.Read())182 {183 result.Add(reader.GetString(0).Trim());184

185 }//end while186

187 } //end the second using188 }//end the first using189 return result;190 }191

192 public List<string> FindSynSetLexemes(string OffsetPos, stringWordNet)

193 {194

195 List<string> result = new List<string>();196

197 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

198 {199 connection.Open();200 //201 // Create new SqlCommand object.202 //203 using (SqlCommand command = new SqlCommand("SELECT *

FROM " + WordNet + " where Offset_Pos like@OffsetPos", connection))

204 {205 // Define the parameters206 command.Parameters.AddWithValue("@OffsetPos",

OffsetPos.Trim());207 //208 // Invoke ExecuteReader method.209 //210 SqlDataReader reader = command.ExecuteReader();211 while (reader.Read())212 {213 result.Add(reader.GetString(1).Trim());214

Page 148: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

132

215 }//end while216

217 } //end the second using218 }//end the first using219 return result;220 }221

222 public Boolean IsSynSetAvailable(string OffsetPos, stringWordnet)

223 {224 // This function takes synsetID and check if it is included

in a Wordnet225 Boolean result = false;226

227

228 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

229 {230 connection.Open();231 //232 // Create new SqlCommand object.233 //234 using (SqlCommand command = new SqlCommand("SELECT

Offset_Pos FROM " + Wordnet.Trim() + " whereOffset_Pos like @OffsetPos", connection))

235 {236 // Define the parameters237 command.Parameters.AddWithValue("@OffsetPos",

OffsetPos.Trim());238 //239 // Invoke ExecuteReader method.240 //241 SqlDataReader reader = command.ExecuteReader();242

243 if (reader.Read())244 result = true;245

246 }247

248

249 }250

251

252 return result;253

254

255 }256

257 public Dictionary<string, string> FindSynSetRelations(stringOffsetPos, string WordNet, string RelationsTable)

258 {259

260 Dictionary<string, string> result = new Dictionary<string,string>();

Page 149: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

133

261

262

263 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

264 {265 connection.Open();266 //267 // Create new SqlCommand object.268 //269

270 using (SqlCommand command = new SqlCommand("SELECT *FROM " + RelationsTable.Trim() + " whereLeft_Offset_Pos like @OffsetPos", connection))

271 {272 // Define the parameters273 command.Parameters.AddWithValue("@OffsetPos",

OffsetPos.Trim());274 //275 // Invoke ExecuteReader method.276 //277 SqlDataReader reader = command.ExecuteReader();278

279 string Relation = "";280

281 int c = 0;282 while (reader.Read())283 {284 if (IsSynSetAvailable(reader.GetString(2).Trim()

, WordNet))285 {286 Relation = reader.GetString(1).Trim() + " :

" + reader.GetString(2).Trim();287 string RelatedOffsetPos = reader.GetString

(2).Trim();288 List<string> RelatedLexemes =

FindSynSetLexemes(RelatedOffsetPos,WordNet);

289

290 foreach (string lexeme in RelatedLexemes)291 {292 c++;293 result.Add(RelatedOffsetPos + c.ToString

(), Relation + "-->" + lexeme);294

295 }296

297 }298

299

300 }//end while301

302

303

304

Page 150: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

134

305 } //end the second using306 }//end the first using307

308 return result;309

310 }311

312 public string FindGloss(string OffsetPos, string GlossTable)313 {314 string result = "Gloss is not available";315

316 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

317 {318 connection.Open();319 //320 // Create new SqlCommand object.321 //322 using (SqlCommand command = new SqlCommand("SELECT *

FROM " + GlossTable + " where Offset_Pos like@OffsetPos", connection))

323 {324 // Define the parameters325 command.Parameters.AddWithValue("@OffsetPos",

OffsetPos.Trim());326 //327 // Invoke ExecuteReader method.328 //329 SqlDataReader reader = command.ExecuteReader();330 while (reader.Read())331 {332 result=reader.GetString(1).Trim();333

334 }//end while335

336 } //end the second using337 }//end the first using338

339

340

341 return result;342

343

344 }345

346 public List<string> ReadRelation(string RelationKey, stringRelationDataTable)

347 {348 // This method reads a relation and return it to be

evaluated349

350 List<string> Result = new List<string>();351

352 try

Page 151: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

135

353 {354 SqlConnection MyConnection = new SqlConnection(

LexBankConnectionString);355

356 string Sqls = "SELECT [RelationKey], [Word1] , [Relation], [Word2] FROM " + RelationDataTable + " where [RelationKey] = @RelationKey";

357 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection);

358 DataTable MyTable = new DataTable();359 using (SqlDataAdapter Myadapter = new SqlDataAdapter(

Mycommand))360 {361

362 Myadapter.Fill(MyTable);363

364 if (MyTable.Rows.Count > 0)365 {366

367 for (int x = 0; x < 4; x++)368 {369

370 Result.Add(MyTable.Rows[0][x].ToString());371

372 }373

374 }375

376 }377

378 return Result;379 }380

381 catch (Exception ex)382 {383 return Result;384 }385

386 }387

388 public List<string> ReadSynsetGloss(int GlossKey,stringTableName)

389 {390 // This method Read a synset gloss from the table and return

it to be evaluated391

392 List<string> Result = new List<string>();393

394 try395 {396 SqlConnection MyConnection = new SqlConnection(

LexBankConnectionString);397

Page 152: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

136

398 string Sqls = "SELECT [GlossKey], [Word] , [Sentence], [PWN_Gloss] FROM " + TableName + " where [GlossKey]=@GlossKey";

399 DataTable MyTable = new DataTable();400 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection

);401 Mycommand.Parameters.AddWithValue("@GlossKey",GlossKey);402 using (SqlDataAdapter Myadapter = new SqlDataAdapter(

Mycommand))403 {404 Myadapter.Fill(MyTable);405 if (MyTable.Rows.Count > 0)406 {407

408 for (int x = 0; x < 4; x++)409 {410 Result.Add(MyTable.Rows[0][x].ToString());411 }412

413 }414

415 }416

417 return Result;418 }419 catch (Exception ex)420 {421 return Result;422 }423

424 }425

426 public Boolean EvaluateRelation(int RelationKey, int Score,string UserId, string EvaluationTable)

427 {428

429 try430 {431

432 SqlConnection MyConnection = new SqlConnection(LexBankConnectionString);

433

434 string sqls = "INSERT INTO " + EvaluationTable + " ([RelationKey],[Score] ,[UserID]) values (@RelationKey,@Score,@UserId)";

435 var command = new SqlCommand(sqls, MyConnection);436 command.Parameters.AddWithValue("@RelationKey",

RelationKey);437 command.Parameters.AddWithValue("@Score", Score);438 command.Parameters.AddWithValue("@UserId", UserId.Trim()

);439 MyConnection.Open();440 command.ExecuteNonQuery();441 MyConnection.Close();

Page 153: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

137

442 return true;443

444 }445

446 catch (Exception ex)447 {448 return false;449 }450

451 }452

453 private Boolean EvaluateGloss(int GlossKey, int Score, stringUserId, string EvaluationTable)

454 {455

456 try457 {458

459 SqlConnection MyConnection = new SqlConnection(LexBankConnectionString);

460

461 string sqls2 = "INSERT INTO " + EvaluationTable + " ([GlossKey],[Score] ,[UserID]) values (@GlossKey,@Score,@UserId)";

462 var command = new SqlCommand(sqls2, MyConnection);463 command.Parameters.AddWithValue("@GlossKey", GlossKey);464 command.Parameters.AddWithValue("@Score", Score);465 command.Parameters.AddWithValue("@UserId", UserId);466

467 MyConnection.Open();468 command.ExecuteNonQuery();469 MyConnection.Close();470 return true;471

472 }473

474 catch (Exception ex)475 {476 return false;477 }478

479 }480

481 public void LogEvent(string EventDesc, DateTime EventTime,string UserId)

482 {483 using (SqlConnection connection = new SqlConnection(

LexBankConnectionString))484 {485 connection.Open();486 //487 // Create new SqlCommand object.488 //

Page 154: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

138

489 using (SqlCommand command = new SqlCommand("INSERT INTOSystem_Log([EventDesc], [EventTime], [UserId])VALUES(@EventDesc, @EventTime, @UserId)", connection))

490 {491 // Define the parameters492 command.Parameters.AddWithValue("@EventDesc",

EventDesc.Trim());493 command.Parameters.AddWithValue("@EventTime",

SqlDbType.DateTime).Value = EventTime;494 command.Parameters.AddWithValue("@UserId", UserId.

Trim());495 //496 // Invoke ExecuteNonQuery method.497 //498 //try499 //{500 command.ExecuteNonQuery();501 //}502 //catch (Exception e)503 //{504

505 //}506

507 }508 }509

510 }511

512 public void ChangeUserStatus(string UserId, string NewStatus)513 {514 using (SqlConnection connection = new SqlConnection(

LexBankConnectionString))515 {516 connection.Open();517 //518 // Create new SqlCommand object.519 //520 using (SqlCommand command = new SqlCommand("UPDATE

Users_Info SET UserStatus=@UserStatus WHERE UserId=@UserId", connection))

521 {522 // Define the parameters523 command.Parameters.AddWithValue("@UserId", UserId.

Trim());524 command.Parameters.AddWithValue("@UserStatus",

NewStatus.Trim());525 //526 // Invoke ExecuteNonQuery method.527 //528 //try529 //{530 command.ExecuteNonQuery();531 //}

Page 155: LexBank: A Multilingual Lexical Resource for Low … ·  · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource

139

532 //catch (Exception e)533 //{534

535 //}536

537 }538 }539

540 }541

542 public DataTable RetrieveUsers()543 {544 DataTable result = new DataTable();545

546 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))

547 {548 connection.Open();549 //550 // Create new SqlCommand object.551 //552 using (SqlCommand command = new SqlCommand("SELECT [

UserId], [UserName], [UserEmail], [UserPriv], [UserStatus] FROM [Users_Info]", connection))

553 {554 SqlDataAdapter dadapter = new SqlDataAdapter(command

);555 dadapter.Fill(result);556

557 }558 }559 return result;560

561 }562

563 }564 }