automatic hebrew text vocalizationelhadad/nlpproj/hebrew-vocalization/... · of hebrew, hebrew text...

80

Upload: others

Post on 26-Mar-2020

104 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Ben-Gurion University of the Negev

Faculty of Natural Science

Department of Computer Science

AUTOMATIC HEBREW TEXT

VOCALIZATION

Thesis submitted as part of the requirements for the

M.Sc. degree of Ben-Gurion University of the Negev

by

Eran Tomer

January 2012

Page 2: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Subject: Automatic Hebrew Text Vocalization

Writen By: Eran Tomer

Advisor: Michael Elhadad

Department: Computer Science

Faculty: Natural Science

Ben-Gurion University of the Negev

Signatures:

Author: Eran Tomer Date

Advisor: Michael Elhadad Date

Dept. Committee Chairman Date

I

Page 3: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Abstract

Written Hebrew involves an exceptionally high ambiguity rate due to the lack of vowel

markings in the vast majority of modern Hebrew texts. Each letter may be vocalized and

pronounced in multiple ways. This letter level ambiguity naturally increases word level

ambiguity, which complicates many other natural language tasks. Automatic translation

of Hebrew, Hebrew text to speech and many other NLP tasks may bene�t from a reliable

system which restores vocalization signs.

In this research we aim to implement tools that may be used to enhance the capabili-

ties of an automatic vowel restoration system. We focus on handling verbs. The Hebrew

verb system has the most complex morphology and vocalization mechanisms of all parts of

speech. We �rst present a comprehensive generation mechanism, which produces vocalized

and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an

indication of which pattern the verb follows. Given a classi�cation of verbs into about 300

distinct clusters, our generation mechanism generates fully vocalized in�ected verbal forms.

Using our implementation of this method, we have produced a lexical resource for modern

Hebrew that includes all in�ected forms for over 4,000 distinct verbs. This database contains

about 250,000 fully in�ected and fully vocalized verbal forms, with an accuracy estimate of

over 99.4

In the second part, we address the task of automatic segmentation of vocalized words

into syllables. This task is a necessary component of Text to Speech systems. We estimate

the accuracy of the syllable segmentation algorithm taught in classical Hebrew grammar

books (Behor's 5 rules of shva classi�cation). We observe that this algorithm only succeeds

in about 80%-85% of the verbs we tested. In the case of verbs, if we introduce additional

knowledge, we manage to make this task succeed at over 99.4% accuracy. This success rate

requires as additional input to the syllable segmentation algorithm, the vocalized origin form

of the verb. This �nding provides strong support to indicate that phonological mechanisms

in Hebrew rely on a construction mechanism (that is, syllable segmentation starts from a

base form + in�ection form and not directly from the fully in�ected form).

II

Page 4: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Finally, we address the task of classifying the non-vocalized base form of a verb into one

of the 300 in�ection paradigms needed to determine the verb's full vocalization. We use

supervised learning techniques and letter-features: position of the letters in the base form

and classi�cation of the letters in guttural/non guttural families. Surprisingly, this classi�-

cation mechanism only succeeds at about 70% accuracy over a sample of several thousand

verbs. This �nding contradicts traditional grammar intuitions that indicate that simple

letter pattern rules predict the verb vocalization. We �rst con�rm that the simpler task of

classifying into Binyanim (the 7 basic major verb patterns in Hebrew) can be achieved on

the basis of simple letter rules (with accuracy over 90%). We then investigate the hypothesis

that corpus-level features may help the task of classi�cation of non-vocalized verbal base

forms into one of the 300 in�ection paradigm. Corpus-level features capture ad-hoc letter

patterns observed over a very large corpus of fully in�ected non-vocalized verbs (gathered

over a corpus of over 50M words). We �nd that such features improve verb classi�cation

by about 5%. This �nding indicates that the verb in�ection mechanism in Hebrew is more

irregular than can be assumed.

The several resources (software and databases) we have produced provide important

stepping stones and resources towards implementation of a fully automatic vowel restoration

system in Hebrew. The experimental data we have gathered also provides new insights on

the nature of the verbal word formation system in Hebrew.

III

Page 5: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Acknowledgements

I would like to thank my advisor, Prof. Michael Elhadad for all his ideas and support, and

the member of the Academy of the Hebrew Language, Prof. Jacob Ben Tulila, who helped

me understand the basics of manual and automatic vocalization. All the NLP team at Ben-

Gurion University, and especially Dr. Meni Adler and Dr. Yoav Goldberg, who eased my

ventures in implementing my SVM based classi�ers. My parents and my wife for all their

help and understanding, and last, to my (newly born) baby girl, Dani, for sleeping so well

at nights, and by that making the writing of this work possible.

IV

Page 6: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Contents

Abstract II

Acknowledgements IV

I 2

1 Introduction 3

1.1 Domain - Natural Language Processing . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Natural Language Processing in Hebrew . . . . . . . . . . . . . . . . . 4

1.1.2 Natural Language Processing in Hebrew - Related Work . . . . . . . . 5

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Motivation and Contributions 7

2.1 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Development of a Hebrew Text-To-Speech System . . . . . . . . . . . 7

2.1.2 Generation of Vocalized Text for Teaching Usages . . . . . . . . . . . 8

2.1.3 Improving Automatic Translation Systems . . . . . . . . . . . . . . . . 8

2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 The In�ectional Model for Verbs . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Creating a Shva Classi�cation Algorithm . . . . . . . . . . . . . . . . 10

2.3.3 Missing Training/Testing Sets . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.4 Evaluation of Generation and Syllable Segmentation Accuracy . . . . 10

2.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

V

Page 7: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

2.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Previous Work Regarding Hebrew Vocalization 13

3.1 Commercial/Non-academic Vowel Restoration Systems . . . . . . . . . . . . . 14

3.2 Commercial/Non-academic Attempts on Syllable Segmentation . . . . . . . . 18

3.3 Academic Attempts on Vowel Restoration . . . . . . . . . . . . . . . . . . . . 19

3.4 Academic Attempts on Syllable Segmentation . . . . . . . . . . . . . . . . . . 20

3.5 Generation in Hebrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II 22

4 Background - Principles of Vocalization in Hebrew 23

4.1 Linguistic De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Hebrew Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.2 Vocalization Signs in Hebrew . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.3 Syllables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.4 Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.5 De�cient Spelling vs. Plene Spelling . . . . . . . . . . . . . . . . . . . 29

4.1.6 Vocalization Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 The Case of Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Patterns - Binyanim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2 In�ection Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Datasets 35

5.1 Verbs List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Morphologically Analysed Corpora . . . . . . . . . . . . . . . . . . . . . . . . 35

III 36

6 Method 37

6.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2.1 Following Behor's Footsteps - Heuristic Approach . . . . . . . . . . . . 40

6.2.2 Shva Classi�cation According to the Base Form . . . . . . . . . . . . . 42

6.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

VI

Page 8: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

7 Experiments and Results 49

7.1 Generation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.2 Syllable Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.2.1 Following Behor's Footsteps - Heuristic Approach . . . . . . . . . . . . 50

7.2.2 Syllable Segmentation According to the Base Form . . . . . . . . . . . 51

7.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.3 Results of Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . 52

7.3.1 Base Forms Classi�cation into Binyanim (patterns) . . . . . . . . . . 52

7.3.2 Classi�cation of Base Forms to In�ection Tables . . . . . . . . . . . . 54

7.3.3 Classi�cation of Base Forms to In�ection Tables with Corpus Level

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

IV 58

8 Conclusions and Future Work 59

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.2.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.2.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.4 Automatic Vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Data Set Example - Base Forms Correlated to In�ection Tables 64

B Data Set Example - In�ected Vocalized Verbs 65

C Data Set Example - In�ected Verbs Segmented into Syllables 66

Bibliography 67

VII

Page 9: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

List of Tables

1.1 Ways to vocalize ספר! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 Hebrew letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Additional letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 The types of consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Vocalization vowels and semi-vowels . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Dagesh sound manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6 Long vs. short vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.7 Application of the syllables and vowels rule . . . . . . . . . . . . . . . . . . . 31

4.8 A complete paradigm from the Paal pattern - in�ections of שפ|!"כ|! . . . . . . . 34

4.9 A complete paradigm from the Paal pattern - in�ections of גד!"ל! . . . . . . . 34

6.1 The valid forms of מועמד! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Features created on the second classi�cation phase for the base form הÇuר¯ז! . . 48

7.1 Generation error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.2 Syllable segmentation comparison . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.3 Base form distribution among patterns . . . . . . . . . . . . . . . . . . . . . . 52

7.4 Classi�cation of base forms to patterns - confusion matrix . . . . . . . . . . . 54

7.5 Identi�cation of exact in�ection table - comparison of baseline manipulations 54

7.6 Classi�cation of base forms to in�ection tables with corpus level features . . . 57

VIII

Page 10: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

List of Figures

3.1 Ambiguity in vocalization of Hebrew text . . . . . . . . . . . . . . . . . . . . 15

3.2 The little prince vocalization according to Nakdan Text . . . . . . . . . . . . 17

3.3 The little prince vocalization according to Snopi - Automatic Nikud . . . . . 17

3.4 The little prince vocalization according to Nikuda . . . . . . . . . . . . . . . . 18

4.1 Hebrew patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.1 Base form distribution among patterns . . . . . . . . . . . . . . . . . . . . . . 53

7.2 In�ection table confusion pairs . . . . . . . . . . . . . . . . . . . . . . . . . . 55

IX

Page 11: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

List of Algorithms

1 The syllables and vowels rule ( והתנועות! ההברות (כלל . . . . . . . . . . . . . . . 30

2 The Behor scheme for Shva classi�cation . . . . . . . . . . . . . . . . . . . . . 41

3 Our heuristic for Shva classi�cation . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Edit distance matrix calculation . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Initiation of the edit distance Matrix . . . . . . . . . . . . . . . . . . . . . . . 43

6 The edit distance δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 The Hebrew adapted δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8 Corpus level features extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 48

9 Uni�cation of clusters of in�ection tables which cause confusion . . . . . . . . 56

10 Supervised learning for automatic vocalization . . . . . . . . . . . . . . . . . 63

1

Page 12: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Part I

2

Page 13: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 1

Introduction

This research aims to investigate and develop new methodologies in Natural Language Pro-

cessing (NLP), regarding vocalization of Hebrew text. Our main goals are:

� Automatic generation of vocalized words.

� Automatic word segmentation into syllables.

� Handling unknown words in our generation model.

The strategy we use includes a hybrid approach, that combines a rule-based model, statistical

methods and machine learning tools.

1.1 Domain - Natural Language Processing

The �eld of computational linguistics, attempts to model and study languages using com-

putational techniques. The diverse challenges confronted by computational linguistics re-

searchers include machine translation, automatic text-summarization, speech-to-text, text-

to-speech and many more.

Throughout the years many linguistic resources as dictionaries, lexicons and other forms

of labelled text, were developed, mainly for English, yet some resources do exist for other

languages as well. The �rst attempts to accomplish tasks in NLP were based mostly on

formalizing deterministic models that are based on linguistic knowledge, for example, as [20,

Ch.12] describes the attempt to model English into a Context Free Grammar (CFG), or

the attempt by Klein and Simmons to develop an automatic tagging system, by manually

gathering hand-crafted rules [22]. Later on, the probabilistic path of NLP began blooming,

and the use of corpora-based, statistical data, became widespread [20, Ch.14]. Today, the

use of machine learning, in its various forms, is playing a key role in NLP state of the art

achievements.

3

Page 14: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

1.1.1 Natural Language Processing in Hebrew

Computational linguistics in Hebrew is considered relatively di�cult for two main reasons:

The �rst is the lack of existing large-scale Hebrew linguistic resources, making supervised

learning techniques generally harder to apply. The second di�culty is caused by the rich

morphology and high ambiguity level of Hebrew. According to [3, p.39] the average ambigu-

ity rate for a Hebrew word is approximately 2.7, vs. English (1.41), French (1.7) or Spanish

(1.25).

The high ambiguity rate in written Hebrew results, among others, from the lack of

vocalization signs in the Hebrew standard writing system. Therefore, a given Hebrew word,

may have an astonishing number of di�erent meanings. For example, Table 1.1, displays

the di�erent ways to vocalize the word ,ספר! and the corresponding meanings, morphological

analysis and pronunciations.

The high ambiguity rate poses a signi�cant obstacle for many computational linguistic

tasks as text-to-speech, automatic morphological analysis, machine translation and many

other key tasks.

Table 1.1: Ways to vocalize ספר!

Word POS? Morphology Meaning Pronunciationספר! Verb Past-Masculine-3rd person-Singular counted safarספר! Verb Imperative-Masculine-2nd person-Singular count sforספר! Noun Singular book seferספר! Verb Past-Masculine-3rd person-Singular (he) told siperסuפר! Verb Past-Masculine-3rd person-Singular (Was) told suparספר! Verb Imperative-Masculine-2nd person-Singular tell saperספר! Noun Singular hairdresser saparספר! Verb Past-Masculine-3rd person-Singular (He) cut hair siperסuפר! Verb Past-Masculine-3rd person-Singular (his) hair was cut suparספר! Verb Imperative-Masculine-2nd person-Singular cut hair saperספר! Noun Singular border, frontier sfarספר! Noun Singular narrative siper

? Part Of Speech

In some cases, Hebrew texts do contain vocalization signs, the Bible and some ancient

writings, poetry, educational material and even some encyclopaedias include vocalization

signs. In such cases, the ambiguity rate drops signi�cantly, yet such texts are rather scarce,

and (apart of encyclopaedias) rarely use modern Hebrew grammar.

Moreover, such resources rarely exist in digital form, and transforming printed vocalized

Hebrew text into digital form involves either a great deal of manual labour, or via OCR (see

[41]) a substantial number of errors.

4

Page 15: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

1.1.2 Natural Language Processing in Hebrew - Related Work

In spite of the di�culties, much has been accomplished in Natural Language Processing in

Hebrew:

� The Hebrew TreeBank: The tree bank [35] includes some 5000 sentences with

complete segmentation and POS-tagging from the "Ha'aretz" daily newspaper.

� MILA - Knowledge Center for Processing Hebrew: MILA [19] o�ers a com-

prehensive corpora of plain Hebrew text, a limited corpora of spoken Hebrew, lexicons

and tools for tokenization, morphological tagging, and morphological disambiguation.

� Word Segmentation: Word segmentation in Hebrew is a challenging task because

of the agglutinative nature of Hebrew - some parts of speech can be glued, as pre�xes

or su�xes, to other words. For example:

1. In Hebrew, there are 7 particles " "מ|!","ש!","ה!","ו!","כ|!","ל!","ב! (letters that do not

appear by themselves, but come as pre�xes to a word), in many cases more than

one such pre�x is valid. Of course, these letters are also used as regular letters

to form words. For example the phrase "!Mומש Nמכא" which means "from here and

from there", should be segmented as [ !M[ו!][מ|!][ש [ !N[מ|!][כא.

2. Hebrew verbs tend to receive pre�xes and su�xes corresponding to their morphol-

ogy. For example the word " "אהבתיה! (Ahavtiha) is composed of "אהב!" (loved),

"תי!" (I) and "ה!" (her), s.t. "אהבתיה!" actually means "I loved her".

3. Nouns also receive su�xes that indicate number, gender and de�niteness.

Much has been accomplished in Hebrew words segmentation [14, 15], but a complete

solution is yet to be achieved. Word segmentation is not a unique challenge in Hebrew,

substantial research of Chinese word segmentation was, and still is, conducted [16, 18,

25, 30].

� Morphological tagging tool: The task of morphologically tagging a given text in-

volves attaching a tag of morphological attributes to every word in the text. In 2007

Adler and Elhadad [3] de�ned a morphological tag-set for Hebrew and developed a

full morphological analyzer for Hebrew Text. The morphological analyzer provides

segmentation to morphemes and POS-tags at about 94% accuracy, and full morpho-

logical disambiguation in 91% accuracy.

5

Page 16: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

1.2 Overview

In Part I, we introduce our general area of research and outline the motivation, previous

related work, research questions and our contributions. Part II presents some Hebrew lin-

guistic background and the datasets we use. Part III describes our method, experiments and

results, and Part IV concludes our work and suggests future work. In addition, we present

in the appendix samples from the datasets we produced throughout this work .

6

Page 17: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 2

Motivation and Contributions

2.1 Practical Applications

The motivation for an automatic vocalization mechanism includes various practical topics,

in which success is only partial today.

2.1.1 Development of a Hebrew Text-To-Speech System

A text-to-speech (TTS) system aims to convert written text into speech. The TTS task

consists of two main parts, one involves processing the words in the text into normalized

words, and the second includes the actual synthesis of voice (based upon the normalized

words). A normalized word attempts to represent the way a word should be read. For

example, the word lead in the sentence "I eat lead for breakfast", should be pronounced as

led (and not leed), this should be deducible from the normalized form of lead.

TTS for non-vocalized Hebrew text is complex mainly because Hebrew non-vocalized

words tend to be very ambiguous, and, therefore, a given non-vocalized word, may be

pronounced in many di�erent ways. The high ambiguity rate in pronunciation is caused by

several main Hebrew attributes:

1. Hebrew's agglutinative nature, and the resulting di�culty in word segmentation.

2. Each Hebrew letter can be pronounced in various ways depending on its vocalization

(up to 12 pronunciations per letter!).

3. Each Hebrew word is pronounced with a certain emphasis, which can not be derived

directly from the non-vocalized word and its morphology. Moreover, Hebrew word

pronunciation is dependent of the syllables composing the word. Each syllable is

pronounced at a frequency corresponding to the vocalization of the letters forming the

syllable. Segmenting a non-vocalized word to its syllables is di�cult since there exist

no clues for deciding if a given letter plays a vowel or a consonant role.

7

Page 18: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Vocalized and syllabi�ed words seem like a good representation of normalized Hebrew

text by which all these issues are either completely solved or drastically simpli�ed. Therefore,

the lion's share of the �rst part of the TTS task may be resolved by creating a program that

returns vocalized text, given the standard non-vocalized input. Once the goal of creating

such a program is achieved, we expect the formation of a TTS system to naturally follow.

The TTS task is well studied in various languages (see [38]); moreover it was implemented

successfully in languages such as English and Spanish (see for example the Festival Speech

Synthesis System [39]). In Hebrew, a few TTS systems exist as Kolan by melingo [21, 26], the

open source Qaryan Hebrew TTS [2] and some others. In [21], 98% accuracy was reported

for Kolan, yet the manner in which this accuracy was measured was not speci�ed. Other

systems do not provide descriptive data assessing their capabilities.

2.1.2 Generation of Vocalized Text for Teaching Usages

Vocalized words are commonly used for teaching Hebrew. The idea is reducing ambiguity,

and the vast number of pronunciation possibilities. Vocalizing words manually is regarded

as a complicated lost art, which may be practically applied by only a handful of scholars.

An automatic vocalization generator could satisfy the need of vocalized words for teaching.

2.1.3 Improving Automatic Translation Systems

Modern automatic translation systems, that translate Hebrew to other languages tend to

err when confronting words that have more than one meaning. For example, the word ,תמונה!

may either be תמוÉה! (Tmunah) that means "picture", or תמונªה! (Temuneh) that means "(you)

will be appointed". In such cases a translation system in most cases chooses the word that

is more frequent. Google Translate, for example, translates the sentence לתפקיד!" תמונה "אתה

to "You picture the job" instead of "you will be appointed to the job".

Translating vocalized text is a much simpler task due to the dramatic reduction of

ambiguity. In other words, a vocalized Hebrew word has a very limited number of meanings

(one - in the absolute majority of cases).

8

Page 19: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

2.2 Objectives

2.2.1 Generation

Given a non-vocalized base word (an absolute state noun or a verb base form), we intend

to automatically generate all appropriate in�ections of the word and their corresponding

morphological characteristics.

Obviously, accomplishing the above is a tremendous task that involves a great deal of

manual labour, therefore, we focus on a sub-objective which we will aim to accomplish in

this work. We aim to confront the generation problem for verbs only.

2.2.2 Syllable Segmentation

Given a vocalized word, we intend to present a scheme for syllable segmentation. Once

a word is segmented into syllables, a TTS system could use the vocalized and segmented

output as its normalized text input and determine for each syllable its appropriate duration,

volume and stress.

2.2.3 Unknown Verbs Classi�cation

The general task of vocalization takes non-vocalized text as input, and returns the fully

restored vocalization of the text. The vocalized text could then be further analysed with

relative ease, thanks to the signi�cant reduction of words ambiguity.

The task of vocalization requires either a large corpus of vocalized text (which we did

not manage to acquire in the scope of this work) to apply supervised learning, or a large

lexical resource with full vocalization. In addition to the resource we intend to develop via

our generation mechanism, we will provide a system for classifying unknown verbs into their

corresponding pattern (Binyan) and into the speci�c corresponding in�ection table.

9

Page 20: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

2.3 Challenges

2.3.1 The In�ectional Model for Verbs

The Hebrew verb in�ectional model is composed of a complex network of transformations

and vocalization schemas. Implementing such a model involves the manual de�nition of over

250 in�ection tables, where each table includes over 60 in�ection schemes (in average), that

form the actual verb in�ections.

2.3.2 Creating a Shva Classi�cation Algorithm

Segmenting a word into syllables relies heavily on identifying the Shva instances in the word

as Shva Na or Shva Nach. For this reason we must de�ne a method for classifying Shva

instances for one of the two types.

2.3.3 Missing Training/Testing Sets

In order to apply machine learning techniques to confront 2.2.3, we must obtain some labelled

dataset that correlates in�ection tables and base forms of verbs. To our knowledge there

exist no such digital dataset, therefore, we will have to manually gather such a dataset.

2.3.4 Evaluation of Generation and Syllable Segmentation Accu-

racy

Due to the lack of a comprehensive vocalized lexicon (that includes in�ections), we have to

manually check a substantial amount of vocalized words and their corresponding morphology.

This manual check will enable us to assess the accuracy of our generation model and our

resulting, fully vocalized and morphologically disambiguated, in�ections list. Similarly, there

exist no automatic way to assess the precision of our syllable segmentation algorithm, hence

manual checking of a representative group, can not be avoided.

10

Page 21: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

2.4 Research Questions

In this research we intend to address two questions regarding vocalization and syllable

segmentation in Hebrew:

� How complex must be the computational model for verb full morphological and vocal-

ization generation? and how much lexical knowledge and exceptions are required to

cover the Hebrew verbs lexicon?

� How complex is syllable segmentation? and what level of knowledge is required for

successful segmentation?

2.5 Contributions

2.5.1 Resources

� A corpus of vocalized songs: Vocalized data is hard to obtain, yet Hebrew literature

often make use of vocalized words. The web site http://www.zemereshet.co.il includes

over 2,500 fully vocalized Hebrew songs, with over 50 words in each song on average.

As part of this work we have gathered these vocalized songs into one vocalized corpora

which may be used in the future. Yet, literature is far from optimal source of vocalized

data, with high rate of names, places, borrowed words from other languages and non-

typical grammar.

� A collection of vocalized and morphologically tagged verbs: We provide a

collection of over 240k vocalized verb in�ections along with a set of corresponding

morphological attributes including tense, gender, person, number and spelling. More-

over, in case more then one valid form of the in�ection exists, we provide all valid

forms.

� A collection of verbs segmented into syllables: We introduce a collection of over

240k vocalized verb in�ections that were automatically segmented into syllables. A

word in the collection is segmented correctly in probability of 99.33% and a syllable is

segmented correctly in probability of 99.5%.

11

Page 22: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

2.5.2 Methods

� In�ection tables implementation: We present the Java implementation of over

250 in�ection tables. Each in�ection table takes a verb in its past, masculine, 3rd

person, singular, de�ciently spelled form (base form) and generates all in�ections with

the corresponding morphology.

� Syllable segmentation: We introduce two methods for syllable segmentation of a

given vocalized word. One method takes the vocalized word only, as input, while the

second method also takes the base tense form (the masculine, 3rd person, singular,

de�ciently spelled in�ection in the same tense). The �rst method proved to be accurate

in 81% of the cases, and the second method was accurate in 99.33% of the cases (per

word accuracy).

� Classi�cation of unknown base forms into our generation model: We provide

two methods for the classi�cation of unknown base forms. The �rst method classi�es

a base form to its corresponding pattern (Binyan) in accuracy larger than 90%. The

second method classi�es a base form to its corresponding in�ection table with accuracy

of about 70%.

12

Page 23: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 3

Previous Work Regarding Hebrew

Vocalization

The task of Hebrew text automatic vocalization was confronted by few companies and aca-

demic researchers. The vowel restoration process may be viewed as the uni�cation of three

independent tasks:

� Word Segmentation (optional): As discussed in 1.1.2, in many cases Hebrew words

are formed by the concatenation of some pre�xes and su�xes, which may indicate par-

ticipant, preposition and more, to words. Therefore, a simple dictionary that includes

words with all these possible pre�xes and su�xes would have to be very big. According

to [4] the number of basic words in Hebrew is around 90,000, yet every noun has 10

su�x modi�ers which determine participant, and 7 possible pre�xes (particles) which

indicate preposition, de�nitive letter etc, and some more combinations of these pre-

�xes are also valid. For verbs the case is even worse, as a base form of a verb may be

in�ected into about 60 in�ections (as described in 6.1), while each in�ection may be

augmented with some possible pre�xes. Overall, as stated in [21], the total number of

words in a dictionary which includes in�ections and other pre�xes and su�xes is esti-

mated to be around 70M. By using a word segmentation mechanism, certain pre�xes

and su�xes are separated from the core of the word, and the size of the dictionary may

be dramatically reduced by a factor of ∼1000 (90k vs. 70M). Generally, a system that

does not use word segmentation as part of its vocalization process may su�er from a

high rate of unknown words.

13

Page 24: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

� Suggestion of possible vocalizations: This task involves the gathering of a collec-

tion of possible vocalizations for a given word in the text. Obviously, if each possible

vocalized word in this collection is also associated with its corresponding morphology,

the next selection phase may be performed more easily.

� Selection of a certain vocalization: Given a non-vocalized word and a collection

of the possible ways to vocalize it, in this task, one vocalized form of the word should

be selected from the collection. The selection may be performed according to part of

speech agreement, morphology annotation agreement and other statistical and context

dependent schemes.

Some systems ignore the selection phase, resulting in a semi-automatic vocalization soft-

ware (meaning the user of the software is the one who performs the selection). Yet the holy

grail is obviously a completely automatic vowel restoration system.

3.1 Commercial/Non-academic Vowel Restoration Sys-

tems

Nakdan Text טקסט!) Nנקד) [9, 21] which was originally developed by the Israeli Center for ed-

ucational technology, is marketed today by the Melingo company, and is claimed to vocalize

Hebrew text at over 97% accuracy. As [13] states, it is unclear how this accuracy is mea-

sured (per character/per word), and no information was published regarding the methods

used by Nakdan Text text. In his work [23], Kontorovich examines the gender agreement in

some simple examples vocalized by Nakdan Text, and derives that Nakdan Text is not using

a generation model and is mostly based on lookup tables and ad-hoc rules. The resulting

vocalized text is, therefore, in some cases surprisingly wrong, for example two verbs that

belong to the same paradigm may by vocalized by di�erent vocalization templates.

Another commercial system is Snopi - Automatic Nikud [8] ( אוטומטי! ניקוד - ,(סנופי! which

shows exactly the same results as Nakdan Text for Kontorovich examples. Again, we were

not able to locate any speci�cation of the methods used in Snopi - Automatic Nikud, yet

due to the similarity of the results to Nakdan Text, we believe Kontorovich conclusions

concerning Nakdan Text are also valid in this case. According to other informal tests we

conducted over Snopi - Automatic Nikud it seem to use a rather poor POS-tagger (if any)

and a very limited word segmentation scheme.

14

Page 25: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Auto Nikud [36] ניקוד!) (אוטו is a semi-automatic vocalization software, it lets the user

select the appropriate vocalization for a given word out of all the vocalized words it knows

(which were most probably automatically collected from some vocalized corpus). According

to Auto Nikud website, it includes a list of about 200k vocalized words.

The Nakdanit (נקדנית!) commercial system o�ers automatic, semi-automatic and manual

vocalization. Nakdanit is (according to its website description), a completely dictionary-

based software, a word is automatically vocalized according to the �rst corresponding entry

in the dictionary. The dictionary of Nakdanit includes 260k vocalized words that are 99%

validated to be correct according to the website.

In 2010, another software called Nikuda נ¢קuד¯ה! [34], came to light. Nikuda clearly states

it does not use any linguistic rules, but rely solely on a database of vocalized words. The

database includes words from the Bible, poetry and words manually vocalized by Nikuda

users. In practice, some informal tests we have conducted suggest Nikuda is inferior to

the previous systems. Many unknowns and mistakes were present in examples the previous

systems vocalized successfully. To be speci�c, Nikuda's database oriented method, causes

many unnecessary unknowns. For example, the sentence אתמול! חשב ילד (that means "a child

thought yesterday") is vocalized correctly as אתמול! חש°ב ,יªלד but in the sentence שחשב ילד

אתמול! (that means "a child that thought yesterday") on the other hand, Nikuda treats שחשב!

as unknown, although it may be deduced from חשב! via word segmentation.

Figure 3.1: Ambiguity in vocalization of Hebrew text

? The little prince, by Antoine de Saint-Exupery, translated to Hebrew by Arieh Lerner.

? Words with more than one possible way for vocalization are underlined.

15

Page 26: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

We conclude this brief summary of previous commercial and non-academic work regard-

ing automatic vocalization, with a comparison of three of the automatic vocalization tools.

In Figure 3.1, we have a text that includes 88 non-vocalized Hebrew words, where words

which may be vocalized in more than one way are underlined. Some 51 words in the text

actually do have such ambiguity in vocalization (about 58%, very much like the 55% ambi-

guity rate over 40,000 words from the Ha'aretz newspaper reported by Levinger [24]). The

ambiguities in the text and the mistakes made by the di�erent systems were carefully an-

notated according to [1, 4, 5, 11, 31]. Our accuracy measure was calculated in a word-wise

manner (no word segmentation was applied).

Figure 3.2 displays the resulting vocalization of our text by Nakdan Text. We note 9

words are vocalized incorrectly and one word for which spelling was alternated (about 89%

success rate). Three of the mis-vocalized words are non-words, one of these is ,מה! which

should be vocalized מה! according to a decision of the academy for the Hebrew language (see

2.5.3 in [1]). The other two non-words are בתמהה! and בעÉו³ה! instead of בתמהה! and בעÉו³ה!

correspondingly. In both cases the error is in the vocalization of the ב! in the beginning of

the word. The ב! in these cases is not a part of the actual word, but a particle (a formative

letter which means "in") that is glued to the word. These mistakes are surprising, since in

general, the vocalization of formative letters is relatively simple [11], under the assumption

that ב! is recognized as a formative letter. Since the rest of the word is vocalized correctly,

and since the word is not ambiguous, we deduce the ב! may be simply recognized as a

formative letter, and, therefore, the reasoning of its vocalization should be immediate. As

mentioned, one word ( (ספור! in Nakdan Text output, was spelled di�erently as ,(ספר!) in the

output. According to the suggested vocalization it is clear that Nakdan Text produced the

correct word ספור!) and ספר! are both valid forms of the same word in Hebrew), yet for some

mysterious reason Nakdan Text preferred an alternative spelling (as if ספור! is misspelled, as

opposed to [4]).

16

Page 27: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Figure 3.2: The little prince vocalization according to Nakdan Text

? Vocalization mistakes are in red, non-words are in orange and alternatively spelled words

are in pink.

Figure 3.3 displays the resulting vocalization of our text by Snopi - Automatic Nikud.

As displayed, 10 vocalization mistakes are present and 2 words were not vocalized (overall

about 86% success rate). We note the odd mistakes regarding the vocalization of some

formative letters which were present in Nakdan Text are absent here, yet other mistakes

which may result from an inferior dictionary or from the absence of simple in�ectional rules

(for participants su�xes for example) appear.

Figure 3.3: The little prince vocalization according to Snopi - Automatic Nikud

? Vocalization mistakes are in red and Non-vocalized words are in blue.

17

Page 28: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Figure 3.4 displays the resulting vocalization of our text by Nikuda. Here, 13 vocalization

errors appear (one of these is the mis-vocalization of מה! as seen in Nakdan Text), and 10

words were left non-vocalized (overall about 74% success rate). Again, the odd mistakes

made by Nakdan Text are absent, yet as expected in a system which is completely dictionary-

based, the number of unknowns is dramatically increased, and, therefore, more words are

left non-vocalized. We notice all unknown words include formative letters or participants

su�xes, again as one should expect.

Figure 3.4: The little prince vocalization according to Nikuda

? Vocalization mistakes are in red, non-words are in orange and Non-vocalized words are in

blue.

3.2 Commercial/Non-academic Attempts on Syllable Seg-

mentation

The only system we are aware of is a syllable segmentation component in the TTS system

for Hebrew text by Melingo company. The TTS system is called Kolan (!Nקול) [21, 26]

and according to the Melingo web site [26], the syllable segmentation component in Kolan

achieves 98% accuracy in average. Yet, it is unclear if this measure relates to word accu-

racy or syllable accuracy. No additional descriptive information regarding the methods and

datasets used by Melingo were available.

18

Page 29: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

3.3 Academic Attempts on Vowel Restoration

In 2001 Kontorovich [23] attempted to use HMM for Hebrew text vocalization. Kontorovich

used parts of the Bible (Westminster Hebrew Morphological Database 2001) for learning

and testing, such that 90% of the data was used for training and the remaining 10% for

testing. Kontorovich applied three experiments, the �rst included the gathering of a context

free list of vocalized words along with their frequency in the training set. Then a word in

the test set was vocalized according to the most frequent word (with similar letters) in the

list. 77% of the testing words were vocalized correctly. The second experiment included

a list of frequencies for the vocalized words along with their corresponding part of speech

tag. Then vocalization is assigned for a word in the test set according to highest frequency

of a word in the list with similar letters and a similar POS tag. Yet, the POS tags for the

words in the testing set were taken as given from the Westminster Database, which is clearly

not applicable for general, non-tagged texts. Here, 79% of the testing words were vocalized

correctly. The third experiment used an HMM with 14 hidden states (corresponding to the

14 POS tags used by Westminster), here 81% of the words were vocalized correctly.

In 2002, Gal [13] aimed to develop a robust system that enables vowel restoration for

both Hebrew and Arabic. The corpora used, included the Hebrew Bible (Westminster) and

the Qur'an (publicly available from the sacred text archive), such that 90% were used for

training and the remaining 10% for testing. As Kontorovich, Gal used a frequency-based

lookup table to set a baseline, achieving accurately vocalized words in 68% and 74% of the

cases, for Hebrew and Arabic correspondingly. Next, Gal used a bigram HMM which used

the previous word as context, doing so Gal achieves 81% and 86% accuracy for Hebrew and

Arabic correspondingly. Another interesting angle noted by Gal was the similarity in modern

phonology of some Hebrew diacritics. Armed with this notion, Gal clustered the Hebrew

vowels into six sound groups, which enabled an impressive improvement of (partial) vowel

restoration - 87%. Obviously, for some usages as text to speech or for reading assistance

as mentioned in [37] restoring the sound group of a vowel is su�cient. For example, the

distinction between !Ë and !Ë is not relevant in modern Hebrew phonology - both signs are

pronounced 'a'.

19

Page 30: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

In 2003, Spiegel and Volk used a neural network to address Hebrew automatic vocal-

ization [37]. Again, the Westminster Database was exploited, and a half of the book of

Genesis was used as corpora with 90% and 10% training/testing split. In their system,

Spiegel and Volk restored exactly one vowel per letter, therefore, Dagesh, Mapik and also

apparently Shin Dots (see 4.1.2 for further details on these diacritics) restoration were not

addressed. The neural network used, included a hidden layer with 300 nodes and was given

non-vocalized words with a corresponding morphology as input (no additional contextual

input is used). Spiegel and Volk report 74% accuracy per letter once morphology is not

used, 85% letter accuracy with morphology and 87% accuracy per letter once vowels are

clustered, as previously achieved by Gal [13].

In other languages some attempts were made to restore vowels. In 2005, Yarowsky

attempted to restore accent in Spanish and French [40], and [28, 32, 42] that attempted

restoring vowels in Arabic. Yet, these seem to be signi�cantly simpler than the Hebrew

vocalization problem due to the sheer number of existing Hebrew vowel types and the higher

ambiguity rate [3].

Generally, it seems that the lack of publicly available, modern, vocalized data, poses a

signi�cant obstacle for automatic vocalization. As both Kontorovich and Gal claim a sub-

stantial percentage of the mistakes performed by their systems resulted from the irregularity

of the (ancient) text that was used as corpora. On the other hand, all Hebrew related systems

discussed, ignored the problem of segmenting the input text into words, and relied on the

segmentation given by the Westminster Database. More over, some systems [23, 37] directly

use the manually assigned POS tags and morphology given in Westminster Database. For

these reasons, it seems di�cult to assess the true capability of these systems when handling

modern, non-annotated text.

3.4 Academic Attempts on Syllable Segmentation

In 2001, Müller [27] used a probabilistic CFG (PCFG) for automatic detection of syllable

boundaries in Dutch, the PCFG was automatically assembled according to a pronunciation

dictionary which provides syllable boundaries (to the best of our knowledge such a resource

is not available for Hebrew). Müller reports 96.4% of the words were syllabi�ed correctly.

In 2002, Finkel and Stump [12] attempted to mark the stressed syllable in Hebrew verbs

as part of their work. The method they use incorporates the heuristic developed by the

linguistic, Rabbi-Eliyahu-Behor (which is described in detail in 6.2.1). Yet, no measure of

success is provided.

In 2008 Bartlett, Kondrak and Cherry [7] used structured SVMs for automatic syllab-

i�cation of English, German and Dutch. Training data was gathered from the CELEX

20

Page 31: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

annotated dataset (same as Müller). Here 85.5% 99% and 98% of the words were syllabi�ed

correctly for English, German and Dutch respectively.

Some attempts were also made regarding less known languages as Uyghur [33], where

Saimaiti and Feng describe a rule-based algorithm that achieves 98.7% word accuracy.

3.5 Generation in Hebrew

In 2002, Finkel and Stump [12] used default inheritance hierarchies to model the in�ectional

system of Hebrew verbs. The hierarchy was expressed by the formal language KATR,

such that given a query which includes some morphosyntactic attributes, the output is a

fully vocalized verb. Moreover, Finkel and Stump also distinct Shva Na and Shva Nach,

and provide the stress location in their output. Yet, seemingly no accuracy checks were

conducted for any of the parts of this system, and, therefore, the quality of the output is

unclear. In addition, the writers state only a portion from the total generation task for verbs

was actually implemented. Moreover, it seems the "out of the ordinary" types of in�ections,

are the ones left out, so it is di�cult to assess the true advantage of using KATR.

In 2010, Dannélls and Camilleri [10] implemented a mechanism for generating verb

in�ections. This mechanism was implemented for both Hebrew and Maltese, which are

both Semitic languages with some resembling characteristics. In their system, Dannélls and

Camilleri do not provide vocalization, therefore the number of in�ection patters ought to

be implemented, for completely covering the Hebrew verb in�ectional model, signi�cantly

drops. Again, not all types of verb in�ection patterns were implemented and no accuracy

checks were conducted.

3.6 Conclusions

Overall, to the best of our knowledge, it seems neither commercial systems nor academic

attempts on automatic vocalization have tried using an extensive generation mechanism as

part of their methods. As noted by Kontorovich [23], existing systems which rely on partial

or inconsistent dictionaries may be substantially crippled. Moreover, a generative model

simpli�es error pruning and the speed at which a fully tagged dictionary may be assembled.

Concerning syllable segmentation, due to the lack of existing annotated data in Hebrew,

in this work we will attempt to develop an accurate rule-based system for syllabi�cation.

Regarding verb classi�cation into patterns (Binyanim) or paradigms, we are not aware

of any previous work.

21

Page 32: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Part II

22

Page 33: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 4

Background - Principles of

Vocalization in Hebrew

4.1 Linguistic De�nitions

The following linguistic description is based in general on [1, 4, 5, 6, 11, 29, 31].

4.1.1 Hebrew Letters

The Hebrew alphabet uses 22 letters and 5 more �nal letters as displayed in Table 4.1. In

addition, modern spoken and written Hebrew uses 2 more letters and one more �nal letter

as displayed in Table 4.2.

4.1.2 Vocalization Signs in Hebrew

Unlike other languages, vowels in Hebrew are not displayed as independent letters, but as

vocalization signs decorating the letters. Hebrew vocalization signs may be used to de�ne

several attributes for a given letter:

The Letter Function as a Consonant

Consonants are clustered into types that indicate the part in the diction system, by which the

consonant is pronounced, as displayed in Table 4.3 [29]. The �ve consonants that originate

from the throat, are called the guttural consonants הגרוניות!) ,(האותיות and their in�uence over

vocalization is substantial, unlike most of the other consonants.

23

Page 34: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 4.1: Hebrew letters

Name Writing Final Spoken sound Additional spoken soundAlef א! � A �Bet ב! � B V

Gimel ג! � G �Dalet ד! � D �Hey ה! � H �Vav ו! � W VZain ז! � Z �Het ח! � No parallel in English �Tet ט! � T �Yod י! � Y �Kaf כ|! !K K No parallel in English

Lamed ל! � L �Mem מ|! !M M �Nun נ|! !N N �

Samech ס! � S �Ayin ע! � No parallel in English �Peh פ|! !P P FTsadi צ|! !Z TS �Kuf ק! � K �Reish ר! � R �Shin ש! � S SHTav ת! � T �

Table 4.2: Additional letters

Name Writing Final Spoken sound Additional spoken soundJimel ג!' � J �Chadi צ|!' '!Z CH �

A letter functioning as a consonant will either be vocalized with a Shva (!Ë), or if it is

the last letter of the word it will not be vocalized. The letters Kaf and Tav are exceptions

- a consonant Kaf that is the last letter in the word will be vocalized with a Shva, and a

consonant Tav that is the last letter in the word will be vocalized with a Shva if it is a 2nd

person, past, female, singular, verb.

There exist two types of Shva in Hebrew, Shva Nach ( נח! (שווא and Shva Na ( נע! .(שווא

The distinction between the two types of Shva is necessary for syllable segmentation and

for Dagesh Kal ( קל! (דגש positioning.

24

Page 35: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 4.3: The types of consonants

Origin LettersThroat א,ה,ח,ע,ר!Palate ג,י,כ|,ק!Tongue ד,ט,ל,נ|,ת!Teeth ז,ס,צ|,ש,ש!Lips ב,ו,מ|,פ|!

A Shva may be identi�ed as Shva Na נע!) (שווא or Shva Nach נח!) (שווא according to the

following rules:

� A Shva is a Shva Na if it vocalizes the �rst letter in the word.

� A Shva is a Shva Nach if it vocalizes the last letter in the word.

� Two consecutive Shva instances at the end of the word are both Shva Nach.

� Any other type of Shva may be identi�ed as Na or Nah by its presence in the origin

form of the word (the absolute state for a noun, and the in�ection with the same tense

in its singular, masculine, 3rd person form, for a verb).

� If the Shva is present in the origin form, it is a Shva Nach.

� If the Shva vocalizes the letter that was last in the origin form and this letter was

not vocalized or vocalized by a Patah Ganuv, it is a Shva Nach.

� Otherwise it is a Shva Na.

For example, the word נ�בלבל! (Nevalbel), that means "(We) will confuse", includes two

appearances of Shva, the �rst is a Shva Na (since it is positioned as the �rst letter of the

word), the second is a Shva Nach (due to its presence in י�בלבל! - the singular, masculine,

3rd person, future in�ection).

In most cases a letter vocalized by Shva is pronounced as a consonant, yet few letters

vocalized with Shva denote it may be pronounced as an "E" vowel. This phenomenon is

relatively common for Shva Na, for example in נ�בלבל! (Nevalbel) the �rst Shva is pronounced

as "E". And rather rare for Shva Nach, e.g in מעד�ת! (Maadet) that means "(you feminine)

lost balance" the �rst Shva is a Shva Nach and is pronounced as "E".

25

Page 36: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

The Letter Function as a Certain Vowel

Hebrew uses 9 vocalization marks to describe vowels, and 3 more semi-vowels (Hataf Ka-

mats, Hataf Patah and Hataf Segol). In the past each vocalization sign corresponded to a

unique vowel sound, modern Hebrew on the other hand, uses the same sound for multiple

vocalization marks, as shown in Table 4.4.

Semi-vowels (Hataf Patah, Hataf Kamats and Hataf Segol) are basically vowels that

are pronounced very similarly to their corresponding vowel (Hataf Patah to Patah, Hataf

Kamats to Kamats Katan and Hataf Segol to Segol) only in a shorter manner. In modern,

spoken Hebrew, the distinction of vowels and semi-vowels is mild, and so vowels from the

same sound group (see Table 4.4), are pronounced in the same manner.

There exist two unique vowels that can not be applied for any letter, the Suruk and

Holam Male can only be applied to Vav ( ו! and .(ו!

Another exception is the case of Patah Ganuv, Patah Ganuv is a Patah that vocalizes

either Hey, Het or Ain that is the last letter in the word. Such a Patah is pronounced

di�erently, its "A" sound is pronounced before the sound of the letter it vocalizes, unlike all

the other vowels.

Table 4.4: Vocalization vowels and semi-vowels

Sound group Vocalization sign NameA !Ë Kamats

!Ë Patah!Ë Hataf Patah (semi-vowel)

E !Ë Segol!Ë Tsere!Ë Hataf Segol (semi-vowel)

I !Ë HirikU !uË Kubuts

ו! ShurukO !Ë Holam

ו! Holam Male!Ë Kamats Katan!Ë Hataf Kamats (semi-vowel)

26

Page 37: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Di�ering the Pronunciation of the Letter

Several vocalization signs change the way certain letters are pronounced.

� Dagesh : Basically, the Dagesh ( !Ì) is used for emphasizing letters, yet modern Hebrew

neglected this emphasis for most of the letters. Today, Dagesh is noticeable only when

it vocalizes 3 letters - Bet, Kaf and Peh. Table 4.5 displays the in�uence of the Dagesh

on the pronunciation of these letters. Several letters can not be applied with Dagesh:

Alef ,(א!) Hey ,(ה!) Het ,(ח!) Ain (ע!) and Reish .(ר!) A Dagesh may belong to one of

two types, Dagesh Kal קל!) (דגש and Dagesh Hazak חזק!) .(דגש

Dagesh Hazak is either aDagesh that is structured in the general pattern corresponding

to the word (Mishkal for nouns and Binyan for verbs), or a Dagesh resulting of some

linguistic phenomena. For example, the Noun מתÉה! (Matana), that means "present"

includes a Dagesh Hazak that is structured in the Mishkal of the noun, while the

Dagesh in the word Éחתי! (Nahatty), which means "(I) landed", is created through the

following linguistic phenomenon called uni�cation: +תי! Éחת! -> Éחתתי! -> Éחתי!

Dagesh Kal may appear only in a small subset of letters Bet ,(ב!) Gimel ,(ג!) Dalet ( ,(ד!

Kaf ( ,(כ|! Peh ( (פ|! and Tav ,(ת!) and only either if the letter is the �rst one in the word,

or if it follows a Shva Nach. For example, the word תש תלב! (Tishtalev) that means

"(you) will �t in", includes 3 occurrences of Dagesh. The �rst is Dagesh Kal, since it

vocalizes the letter Tav and it is positioned as the �rst letter of the word. The second

Dagesh is also a Dagesh Kal, since it vocalizes a Tav that follows a Shva Nach. The

third Dagesh is not a Dagesh Kal, since it vocalizes a Lamed and not one of the ,ב! ,ג!

,ד! ,כ|! ,פ|! ת! letters, therefore, it is a Dagesh Hazak.

Table 4.5: Dagesh sound manipulations

Name With/Without Dagesh Letter soundBet ב! B

ב! VKaf כ! K

כ|! No parallel in EnglishPeh פ! P

פ|! F

� Mapik : The Mapik (ה!) sign indicates a consonant Hey at the end of the word. The

Mapik emphasises the pronunciation of a Hey vocalized by it, yet its a�ect in modern

Hebrew is rather mild. In many cases, Mapik denotes a female possessor.

� Shin dots: In vocalized text the letter Shin will always be accompanied by a right

or a left Shin dot ש!) or .(ש! The Shin dot indicates whether the letter should be

pronounced as SH ( (ש! or as S .(ש!)

27

Page 38: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

4.1.3 Syllables

Each Hebrew word is composed of one or more sequences of letters called syllables. A syllable

is a phonological entity that is pronounced in one e�ort. Each syllable includes exactly one

vowel, and may contain one or more consonants.

There exist two conventions for segmenting a word into syllables [11], one regards letters

vocalized by Shva Na, Hataf Patah, Hataf Kamats and Hataf Segol as vowels, where the

second regards these as consonants. For example, let us segment the word רואי®נ�תי! (Roayanti),

that means "(I) was interviewed", to syllables. By the �rst convention, the Hataf Patah ( !Ë)

is regarded as a vowel, and, therefore, the word is segmented to רו!-א!-י®נ�!-תי! (Ro-a-yan-ti).

The second convention though, treats !Ë the same way as Shva, and, therefore, the resulting

segmentation is -תי! רו!-אי®נ�! (Ro-ayan-ti).

Syllables are divided into two types, open syllables and closed syllables. An open syllable

is a syllable ending with a vowel, and a closed syllable ends with a consonant. For example,

in -אי®נ�!-תי! ,רו! the syllable אי®נ�! is considered closed since it ends with a Shva, while תי! and רו!

are considered open because they end with a vowel. Note that the syllable תי! is regarded

as ending with a vowel, although it ends with a non-vocalized letter, this accrues because

there exist four cases where a vowel is not created by a single letter and its associated vowel

marking. These cases are Ëי! (Hirik Male), Ëי! (Tsere Male), Ëו! (Holam Male) and Ëו! (Shuruk

Male). In two of these cases Ëי!) and (Ëי! the syllable ends with a non-vocalized letter, and

yet, the vowel is regarded as open due to the phonological role of these forms - as vowels.

Generally, the segmentation of a word is performed according to the way the word is

pronounced, each exhalation e�ort corresponds to a syllable. Another way to segment a word

into syllables is by using vowels (as mentioned, each syllable includes exactly one vowel),

and consonants (Shva Na denotes the beginning of a syllable, and Shva Nach indicates a

syllable ending) markings as indicators.

28

Page 39: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

4.1.4 Stress

Generally, a Hebrew word is stressed according to one of two stress schemes, Milel (מלעיל!)

or Milra .(מלרע!)

The Milel stress scheme denotes that the stress is located on the syllable preceding the

last syllable, and Milra denotes the stress location is on the last syllable.

For example, the word תינקת! (Tinoket), that means "baby girl", is pronounced by the

Milel stress scheme, and, therefore, is pronounced תי!-נ!-קת! (Ti-no-ket), underscore denotes

the stress position. On the other hand, מור»ה! (Moreh), that means "teacher", is stressed

according to Milra, and, therefore, is pronounced מו!-ר»ה! (Mo-reh).

The vast majority of words in Hebrew are pronounced by the Milra stress [29], and a few

words in modern Hebrew are not stressed by either Milel or Milra (meaning the stress is

positioned at an earlier syllable). Stress may sometimes be inferred, following the inversion

of the the syllables and vowels rule, presented in 4.1.6.

4.1.5 De�cient Spelling vs. Plene Spelling

Hebrew spelling scheme includes in many cases more than one valid form to write a word.

The letters Vav ( (ו! and Yod (י!) may, in some cases, be omitted from the word's spelling.

For example, the word אירפא! (Erape), that means "(I) will be healed", may also be written

in the following (de�cient) way - .ארפא!

Actually, in past times only the de�cient spelling was regarded as valid, but due to

the di�culty in reading a non-vocalized, de�ciently spelled word, the Plene spelling evolved

around the middle ages. Today, Plene writing dominates common written Hebrew, yet many

texts use a mixture of Plene and de�cient spelling.

4.1.6 Vocalization Rules

In linguistic view, determining the way a given word should be vocalized in, relies heavily

on the general sound of the word, the location of the stress and the segmentation of the

word into syllables. Surprisingly, even when these are known, determining the vocalization

is still regarded as a di�cult task, this results mainly from two issues:

� Each vowel sound in Hebrew, corresponds to more than one vocalization sign.

� Some vocalization signs, gradually changed throughout history, such that nowadays

they do not invoke any change to the pronunciation of the word. Obviously, a correct

and complete vocalization does include these signs.

29

Page 40: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 4.6: Long vs. short vowels

Vowel typeSound group Long vowel Short vowel

A !Ë !ËE Ëי! , !Ë !ËI Ëי! !ËU Ëו! !uËO Ëו! !Ë

The syllables and vowels rule [11] If pronunciation, stress and the syllable segmenta-

tion are known, a guideline for general vocalization called "The syllables and vowels rule"

והתנועות!) ההברות (כלל can be applied (see Algorithm 1). The syllables and vowels rule de-

termines the type of vowel (a short vowel or a long vowel as shown in Table 4.6) a given

syllable will get. Since the pronunciation of the syllable is known, the correct vocalization

for the syllable can be selected based on the sound group.

Algorithm 1 The syllables and vowels rule והתנועות!) ההברות (כללRequire: A stressed/non-stressed syllable (s)

if s is a non-stressed syllable thenif s is an open syllable then

return Vocalize s with a long vowel (according to Table 4.6)else

return Vocalize s with a short vowel (according to Table 4.6)end if

elsereturn In most cases s should be Vocalized with a long vowel (according to Table4.6), yet the number of exceptions is considerable

end if

Despite the above, the syllables and vowels rule serves only as a general guideline for

vocalizing Hebrew words, and does not give a complete solution for vocalization due to

several reasons:

� Hebrew includes a vast number of special or unique cases in which the syllables and

vowels rule can not be applied, and specially costumed rules must be used, therefore,

vocalization requires considerable acquaintance of such exceptional cases.

� The rule assumes that one who is using it knows for each syllable if it is an opened

or a closed syllable. This obviously may prove to be di�cult, given a non-vocalized

word.

� The syllables and vowels rule does not determine when semi-vowels should be used.

In Table 4.7 the syllables and vowels rule is applied for several words as an example.

30

Page 41: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 4.7: Application of the syllables and vowels rule

Word Pronunciation Meaning Syllables Stressed syllable Vocalizationעכבר! Achbar Mouse עכ|!-בר! / Ach-bar בר! / bar עÇבר!נהר! Nahar River נ|!-הר! / Na-har הר! / har Éהר!ספר! Sefer Book ס!-פר! / Se-fer ס! / Se ספר!

?לילה! Laila Night לי!-לה! / Lai -la לי! / Lai לי�לה!?דלת! Delet Door ד!-לת! / De-let ד! / De ד»לת!

? Denotes an exception to the syllables and vowels rule.

4.2 The Case of Verbs

A Hebrew verb is composed of a stem under some in�ection. The stem includes 3 letters

in most cases, but 4 and 5 letter stems also exist. A stem may be in�ected into an actual

Hebrew verb, by adding/removing/replacing letters, and by setting vocalization signs, the

resulting in�ection corresponds to a set of morphological attributes:

� Tense: Past / Beinoni (Participle) / Present / Future / Imperative.

� Gender: Masculine / Feminine / Both.

� Person: First / Second / Third.

� Number: Singular / Plural.

The base form of a verb is regarded as the in�ection corresponding to the morphology

past, masculine, 3rd person, singular. The formation of a verb in�ection is usually perceived

as a transformation of the base form, or of the stem.

Each set of morphological attributes derives a "typical" way of in�ection, meaning in

many cases a given morphology (and a pattern - !Nבני) derive the addition/removal of certain

letters, and a speci�c pattern of vocalization signs. Yet, since Hebrew includes many out of

the ordinary verbs, vocalization of in�ected verbs remains a di�cult task.

Since the Hebrew verb in�ections include added/removed letters and vocalization signs

that indicate the in�ection's morphology, the number of possible in�ections, may be ex-

tremely high. Moreover, the fact that many in�ections may be written in either Plene or

de�cient spelling, and some in�ections may be vocalized in multiple ways, causes some verbs

to have an exceptional number of in�ections, up to several hundreds.

31

Page 42: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

4.2.1 Patterns - Binyanim

The Hebrew verb in�ectional system is based on patterns (Binyanim - !Mבניני), each pattern

(Binyan) corresponds to a set of templates in which the verb may be in�ected. There exist 7

patterns - פעל! (Paal), נ¢פעל! (Nifal), פעל! (Piel), פuעל! (Pual), התפעל! (Hitpael), הפעיל! (Hi�l)

and הuפעל! (Hufal) [6, 31].

The letters פ|!,ע!,ל! that are present in all the patterns names represent the stem letters

that are cast into the pattern to form a valid in�ection of the stem. The פ|!,ע!,ל! letters are

called הפועל! פ|!' (Peh Hapoal - the Peh of the verb), הפועל! ע!' (Ain Hapoal - the Ain of the

verb) and הפועל! ל!' (Lamed Hapoal - the Lamed of the verb) respectively. Stems with more

than 3 letters are regarded as having more than one Ain Hapoal (in such cases they are

regarded as 1 ,ע! 2 ע! etc.).

As displayed in Figure 4.1, four of the patterns are regarded as light patterns - Paal,

Nifal, Hi�l and Hufal ( !Mהקלי Mהבנייני), and the other 3 are regarded as heavy - Piel, Pual,

Hitpael ( !Mהכבדי Mהבנייני). The light patterns include Dagesh Hazak in their vocalization

templates only in very speci�c circumstances, while the heavy patterns include a Dagesh

Hazak at the Ain of the verb in the majority of cases.

Figure 4.1: Hebrew patterns

Verbפועל!

�� ���� ���� ���� ��Light patterns

!Mהקלי Mהבניני

vv xx �� ��vv xx �� ��vv xx �� ��vv xx �� ��

Heavy patterns!Mהכבדי Mהבניני

�� && �� && �� && �� &&Paalפעל!

Nifalנפעל!

Hi�lהפעיל!

Hufalהופעל!

Pielפיעל!

Pualפועל!

Hitpael

התפעל!

Some stems can be in�ected according to several patterns, for example the stem ,ש!מ|!"נ|!

can be in�ected by Paal to produce a verb that means "(He) got fat", by Piel to produce

a verb that means "(He) lubricated", by Pual to produce a verb that means "(He) was

Lubricated" (passive) and by Hi�l to produce a verb that means "(He) was Getting fat"

(progressive).

Given a stem and a pattern, we expect to know the scheme for creating an actual verb, but

in practice the task of generating the fully vocalized in�ections (each in�ection corresponds

to a possible morphology), is far from being easy. Each pattern is associated with dozens

of In�ection tables that describe how each type of stem should be in�ected. For example

the Paal pattern includes over 50 distinctive in�ection tables (according to [31]), each table

describes the in�ection pattern for a certain type of stems.

32

Page 43: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

4.2.2 In�ection Tables

Even when given a pattern and a complete set of morphological attributes, the exact letters

added/removed and vocalization, by which the in�ection is formed, remains ambiguous.

This results from the fact that the template by which the word is in�ected, is in�uenced

by the letters of the stem. Each set of letters, forming a stem, corresponds to an in�ection

table that describes the manner in which the given stem should be vocalized (see [6, 31]). So

in practice, each pattern is associated with multiple in�ection tables. The in�ection tables

divide into paradigms of stems, each paradigm corresponds to a family of stems, each family

with speci�c attributes. In practice, each paradigm is divided into sub-paradigms (that

correspond to speci�c in�ection tables), but here we present only the top level paradigms.

The verb paradigms ( הפועל! (גזרות

� Complete paradigms !Mהשלמי :גזרות Paradigms for which, all the letters of the stem

are maintained in every in�ection of the verb. For example, every in�ection of the

stem גד!"ל! according to the Paal pattern includes all the stem letters.

� Crippled paradigms נחות! :גזרות Paradigms for which, some in�ections include letters

(from the stem) that are written but are not pronounced, or in�ections which replace

a letter from the stem by another letter. For example, the in�ection of the stem "ל! אכ|!

by the Paal pattern to the morphology - future, 1st person, plural (which means "(we)

will eat") is נ¸אכל! (Nochal), where the Alef is not pronounced [31].

� Defective paradigms חסרות! :גזרות Paradigms for which, some in�ections are missing

a letter that is present in the stem. For example, the in�ection of the stem נס!"ע! by

the Paal pattern to the morphology - imperative, masculine, 2nd person, singular is

סע! (which means "drive"). Obviously, it is missing the �rst letter of the stem.

� Double paradigms !Mהכפולי :גזרות Paradigms for which, Ain Hapoal (the second

letter in the stem) and Lamed Hapoal (the third letter in the stem) are identical. For

example, the stem סב!"ב! has a Bet both as it Ain Hapoal and its Lamed Hapoal.

� Compound paradigms מורכבות! :גזרות Paradigms which include stems that comply

to more than one type of the other paradigms. For example, the stem אפ|!"ה! is regarded

as both crippled and missing [31].

Table 4.8 displays the in�ection table of the stem שפ|!"כ|! (a morpheme that means "to

spill"), that corresponds to the complete paradigm by the Paal pattern. The table describes

all the in�ected verbs and their proper morphology. Table 4.9 displays the in�ection table

of the stem גד!"ל! (also by the in the Paal pattern and the complete paradigm).

33

Page 44: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Clearly, some in�ections with identical morphological attributes di�er between the tables

in more than just the letters of the stem. For example, Table 4.9 does not include any Beinoni

in�ections as opposed to Table 4.8. Moreover, the present, future and imperative templates

of in�ection are also di�erent.

Table 4.8: A complete paradigm from the Paal pattern - in�ections of שפ|!"כ|!

Number I Singular Plural

Person I 1 2 3 1 2 3

Gender I M F M F M F M F M F M F

Past שµפÇתי! שµפÇת! שµפÇת! שµפ�! שµפÈה! שµפÇנו! !MתÇש פ !NתÇש פ שµפכו!Present שופ�! שופכת! שופ�! שופכת! שופ�! שופכת! !Mשופכי שופכות! !Mשופכי שופכות! !Mשופכי שופכות!Beinoni שµפו�! ש פוÈה! שµפו�! ש פוÈה! שµפו�! ש פוÈה! !Mש פוכי ש פוכות! !Mש פוכי ש פוכות! !Mש פוכי ש פוכות!Future אש פו�! תש פו�! תש פכי! י¢ש פו�! תש פו�! נ¢ש פו�! תש פכו! י¢ש פכו!

אש פ�! תש פ�! � י¢ש פ�! תש פ�! נ¢ש פ�! � תש פוÉÇה! � תש פוÉÇה!� � � � � � � � � תש פÉÇה! � תש פÉÇה!

Imperative � � ש פו�! ש¤פכי! � � � � ש¤פכו! � �

� � ש פ�! � � � � � � ש פוÉÇה! � �

� � � � � � � � � ש פÉÇה! � �

Table 4.9: A complete paradigm from the Paal pattern - in�ections of גד!"ל!

Number I Singular PluralPerson I 1 2 3 1 2 3Gender I M F M F M F M F M F M F

Past ג³ד¯לתי! ג³ד¯לת! ג³ד¯לת! ג³ד¯ל! ג³ד�לה! ג³ד¯לנו! !Mג�ד¯לת !Nג�ד¯לת ג³ד�לו!Present ג³ד§ל! ג³ד§לה! ג³ד§ל! ג³ד§לה! ג³ד§ל! ג³ד§לה! !Mג�ד§לי ג�ד§לות! !Mג�ד§לי ג�ד§לות! !Mג�ד§לי ג�ד§לות!Future אג�ד¯ל! תג�ד¯ל! תג�ד�לי! י¢ג�ד¯ל! תג�ד¯ל! נ¢ג�ד¯ל! תג�ד�לו! י¢ג�ד�לו!

� � � � � � � � � תג�ד¯לÉה! � תג�ד¯לÉה!Imperative � � ג�ד¯ל! ג¢ד�לי! � � � � ג¢ד�לו! � �

� � � � � � � � � ג�ד¯לÉה! � �

4.3 Conclusions

Hebrew vocalization scheme includes a vast network of rules, that are based on knowing the

pronunciation of the word and the word's syllable segmentation. If these are known, then

the syllables and vowels rule may be applied for vocalization, but since Hebrew includes a

great deal of unique and special cases, the syllables and vowels rule only serves as a basic

heuristic for vocalization. No short cuts are at hand, a correct and complete vocalization

requires a rare expertise that is obtained today only by linguistics and experts.

In�ection tables describe the way a stem or a base form of a verb (see 4.2) can be

in�ected. Selecting an appropriate In�ection table for a stem/base from, depends both on

the pattern, and the letters forming the stem.

34

Page 45: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 5

Datasets

5.1 Verbs List

As a part of this work, we used a manually constructed list including over 4k base-forms of

verbs (de�ciently spelled verbs in their past, masculine, singular, 3rd person form). Verbs

in the list are non-vocalized, but do include Shin dots, and a corresponding in�ection table

indicator (see appendix A).

5.2 Morphologically Analysed Corpora

Using the morphological tagger [3], we obtained a list including approximately 50 million

Hebrew words that are fully morphologically disambiguated. The corpora includes materials

from the "Haaretz" newspaper, the "Tapuz" website, the "TheMarker" newspaper, the

"Kneset" (the Israeli legislature) dissociations documentation and more.

35

Page 46: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Part III

36

Page 47: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 6

Method

6.1 Generation

The �rst task we confronted is the task of generating fully vocalized Hebrew verbs along

with their corresponding morphology.

In 2002, Hspell [17] �rst release came to light, Hspell aims to implement a free Hebrew

spell checking system. Doing so required Hspell developers to devise a way to obtain a

signi�cant quantity of perfectly spelled words. Obtaining such a word list turned out to be

a complicated task, since existing datasets are not comprehensive enough, and automatically

gathering words from existing texts may also collect mistakes. For this reason, Hspell decided

to develop a generation mechanism that in�ects base forms into words.

The Hspell generation mechanism creates in�ected, non-vocalized nouns, verbs and ad-

jectives, with their corresponding morphological properties. Yet all the generated in�ections

are non-vocalized and include only de�ciently spelled words (as de�ned by the Academy of

the Hebrew Language [1]). As mentioned at 4.1.5, Plene spelling is widespread in modern,

written Hebrew, making Hspell 's convention rather harsh.

In [6, 31] the Hebrew verb in�ectional model is presented via the representing in�ec-

tion tables. For example, [31] describes 264 in�ection tables corresponding to the di�erent

patterns and paradigms. Despite this, Hspell 's decision to limit their boundaries to non-

vocalized, de�ciently spelled text, narrowed down the needed manual labour signi�cantly,

since only a limited subset of the 264 in�ection tables had to be implemented. Applying this

approach Hspell developed a vast reserve of in�ected Hebrew verbs with minimal error rate

and a corresponding morphological attribute set for each word. On the other hand, Hspell 's

harsh (de�cient spelling) convention, and the lack of vocalization signs call for a less strict

generation system that also produces the appropriate vocalization signs.

37

Page 48: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

We decided to take Hspell 's path, and implement a comprehensive system for genera-

tion of vocalized Hebrew verbs with corresponding morphological attributes. As opposed

to Hspell, our system also generates Plene spelling and alternative valid forms of writ-

ing/vocalizing (in addition to the de�cient writing). For example, Table 6.1 shows all the

valid ways to spell and vocalize the word מועמד! (Moamad), that means candidate. All the

presented in�ections in Table 6.1 have the same morphology.

Table 6.1: The valid forms of מועמד!

Plene De�cientמועמד! מuעמד!מועמד! מuעמד!מועמד! מעמד!מועמד! מעמד!

In such cases, when several valid ways of writing/vocalizing a verb exist, our system gen-

erates all valid forms of the in�ection. The input of our mechanism includes two parameters:

� The base form of a verb: Here we use a standard unicode representation of the non-

vocalized, de�ciently spelled, base form. One exception to this standard representation

is the letter Shin ,(ש!) which is the only letter in Hebrew that actually represent two

distinct letters - Shin ( (ש! and Sin .(ש!)

We treat Shin and Sin as unique, since Shin never transforms into Sin and vice versa,

unlike Bet, Kaf and Peh which change their pronunciation depending on their relative

position in the word and their preceding vocalization ,ב|!/ב|!) כ|!/כ|! and .(פ|!/פ|! For this

reason we used ש! to represent Shin, and ' ש! to represent Sin.

� A corresponding in�ection table: As part of our generation model, we imple-

mented the vast majority of existing in�ection tables in Hebrew [6, 31]. 264 in�ection

tables were manually implemented to in�ect an input base form of a verb to all pos-

sible in�ections. An average in�ection table generates 60 in�ections, yet some tables

correspond to exceptional base forms that may form over 300 in�ections.

Some less common in�ection tables were not developed, but implementing them in-

volves a minimal amount of manual labour since such tables are, in most cases, very

similar to more common tables (which are already implemented). Since the in�ec-

tion tables were implemented in Java, the built in class inheritance, may be used for

extending our model and developing new tables, in a quick and simple manner.

38

Page 49: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

The vast majority of frequently used verbs may be in�ected by using the appropriate

in�ection table to create the fully vocalized verb, its corresponding morphology and its

spelling indicator (Plene/de�cient).

The next phase of creating the automatic generation mechanism involved, creating a

comprehensive list which correlates base forms to in�ection tables. Over 4k non-vocalized

base forms were manually correlated to their appropriate in�ection table. Some base forms

that may be in�ected by more than one table were correlated to all their corresponding

in�ection tables.

Applying our in�ection mechanism over the comprehensive base form list, generated a

staggering 240k, fully vocalized in�ected verbs, along with a corresponding set of morpho-

logical attributes for each verb.

In addition to the generated verb in�ections, our system also generates for each base form

(given the appropriate in�ection table) its proper in�nitive form. The in�nitive in�ection is

also fully vocalized, and generated in all valid forms (Plene spelling, de�cient spelling, and

alternative writing/vocalization options).

The list of annotated verbs (4,000) and their fully in�ected and vocalized forms (over

240,000) is provided in <http://www.cs.bgu.ac.il/∼nlpproj/hebrew-vocalization/>. It pro-

vides a signi�cant new resource for analysis of in�ected words into their morphological

attributes (through lookup) and for generation of in�ected vocalized verbal forms.

39

Page 50: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

6.2 Syllable Segmentation

Our second goal was to develop a scheme for automatic segmentation of words into syllables.

As discussed in 4.1.3 there exist two conventions by which a word may be segmented into

syllables, we decided to follow the second convention in our attempts to perform syllable

segmentation (as [11] does). Therefore, we consider all letters vocalized by Shva and Hataf,

as consonants.

Attempting to directly segment a non-vocalized word into syllables is di�cult. This

results not only from the high ambiguity rate in Hebrew which suggests a given word may

have multiple valid segmentations, but also since any Hebrew non-vocalized letter may

function either as a vowel or as a consonant. Therefore, we decided to segment fully vocalized

words, as the words generated by the system we presented in 6.1.

Segmenting a word into syllables is based on the position of vowels and consonants in

the word. Each syllable includes a single vowel and may contain several consonants. Vowels

are easy to recognize, and obviously two vowels in the word are members of two separated

syllables. Non-vocalized letters usually correspond to their preceding syllable. Identifying

the correct syllable for a certain consonants on the other hand, is a much more complicated

problem.

The Shva (!Ë) vocalization sign denotes a consonant, and as explained in 4.1.2 there exist

two types of Shva, Shva Na ( נע! (שווא and Shva Nach ( נח! .(שווא The two Shva types may

be used to recognize the beginning and the ending of a syllable, according to [11], a letter

vocalized by a Shva Na is the �rst letter in a syllable, and a letter vocalized by a Shva Nach

is the last (this rule has one exception, in case two consecutive Shva instances are in the

end of the word - they are both Shva Nach, and they both belong to the last syllable in the

word). So, if we develop a scheme for identifying the Shva type of a given word, the task of

segmentation to syllables will be at hand.

6.2.1 Following Behor's Footsteps - Heuristic Approach

During the 16th century, the linguistic, Rabbi-Eliyahu-Behor devised a simple heuristic, for

determining the type of Shva in Hebrew words. The heuristic presented by Behor included

5 simple rules by which one could classify a Shva as being Shva Na or Shva Nach, Behor's

Shva classi�cation scheme is presented at Algorithm 2.

40

Page 51: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Algorithm 2 The Behor scheme for Shva classi�cation

Require: A vocalized wordif The Shva is at the beginning of the word then

return The Shva is a Shva Naelse if The Shva is second among two consecutive Shva instances (which are not at theend of the word) then

return The Shva is a Shva Naelse if The Shva follows a long non-stressed vowel (according to Table 4.6) then

return The Shva is a Shva Naelse if The Shva vocalizes a letter with Dagesh Hazak then

return The Shva is a Shva Naelse if The Shva vocalizes the �rst letter among two identical letters then

return The Shva is a Shva Naend if

Directly implementing Behor's heuristic as a method to classify Shva appearances is

somewhat problematic. The �rst two conditions are easy to validate (and are always cor-

rect), but the third condition requires the position of the stress, the fourth condition relies

on identifying the type of Dagesh in a given letter, and the �fth rule has many exceptions.

Therefore, we decided to use a resembling heuristic (Algorithm 3), which relies only on

information that is explicitly included in the vocalized word.

In addition to the rules suggested by Behor, we use a known rule which has no exceptions.

This rule determines that the �rst Shva among two consecutive Shva instances, is a Shva

Nach. As can be seen in Algorithm 3, in case none of the rules is applied, we determine that

the Shva is a Shva Nach.

Algorithm 3 Our heuristic for Shva classi�cation

Require: A vocalized wordif The Shva is at the beginning of the word then

return The Shva is a Shva Naelse if The Shva is �rst among two consecutive Shva instances then

return The Shva is a Shva Nachelse if The Shva is second among two consecutive Shva instances (which are not at theend of the word) then

return The Shva is a Shva Naelse if The Shva follows a long vowel (according to Table 4.6) then

return The Shva is a Shva Naelse

return The Shva is a Shva Nachend if

41

Page 52: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

6.2.2 Shva Classi�cation According to the Base Form

The partial successes achieved by applying the syllable segmentation, based on 6.2.1, moti-

vated us to use more than just the vocalized word, for classifying Shva instances.

We decided to use the origin form of the word - masculine, singular, 3rd person, de�ciently

spelled in�ection in the same tense (we refer as the base tense form) for verbs, and the

singular, absolute state ( הנפרד! (צורת morphology for nouns as an indicator for the type of

the Shva.

As explained in [11], a Shva in a given word may be identi�ed as a Shva Na or Shva

Nach by its presence in the origin form of the word. If the Shva is present in the origin form

- it is a Shva Nach, otherwise - it is a Shva Na. Although this classi�cation scheme seems

simple, matching Shva instances in practice poses a challenge. A given word may contain

more than one Shva, and in order to correctly classify it, we need to match each Shva in the

word to its corresponding Shva instance in the origin form (if such a corresponding Shva

actually exists).

For example, the word תז�ד¯קקי! (Tizdakekiy), that means "you (female) will need", includes

two Shva instances, and its corresponding base tense form is י¢ז�ד¯קק! (yizdakek), which means

"(he) will need". We note, the base tense form also contains two Shva instances, one is

visually obvious (at the ,(ז! and the other isn't (at the second (ק! since it is the last letter in

the word. Therefore, the �rst Shva in תז�ד¯קקי! is a Shva Nach (since it is present in the base

tense form), and the second Shva is a Shva Na (since it is not present in the corresponding

ק! in the base tense form).

To make things even more complicated, as mentioned, the Hebrew in�ectional system for

verbs may add, remove or replace letters and vocalization signs, therefore, matching Shva

instances is far from a trivial task.

We decided to cope with the problem of matching Shva instances by using a string

matching algorithm that is based on an edit distance metric. The idea is to transform a given

string to another with the minimal needed change by applying copying, addition, removal

and replacement operations over characters in the string. The de�nition of what a minimal

change is, is expressed by a weight (or a price) for each type of operation, the algorithm

attempts to minimize the total cost of transformation. Once the edit distance metric is

founded, using dynamic programming and backtracking to align the strings, is easy. Given

such an alignment, identifying the Shva type according to its presence/absence in the base

tense form is immediate. Algorithms 4, 5 and 6 describe the scheme for calculating a matrix

that contains the edit distance of every two sub-strings of the input words (M [len1][len2]

holds the edit distance between s1 and s2).

42

Page 53: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Algorithm 4 Edit distance matrix calculation

Require: Strings <s1, s2>, Integers <wInsert, wDelete, wCopy, wReplace>Ensure: M[i,j] = The edit distance between s1.substring(i) and s2.substring(j)

len1← s1.length()len2← s2.length()M ← initiateEditDistanceMatrix(len1, len2, wInsert, wDelete)for i← 1 to len1 do

for j ← 1 to len2 do

M [i][j]←Min

M [i− 1][j − 1] + δ(s1[i− 1], s2[j − 1], wCopy, wReplace),M [i− 1][j] + wDelete,M [i][j − 1] + wInsert

end for

end forreturn M

Algorithm 5 Initiation of the edit distance Matrix

Require: Integers <string1Lengh, string2Lengh, wInsert, wDelete>Initiate M as a matrix of dimensions string1Lengh× string2Lengh s.t ∀i, j M [i][j] = 0topStringSize←Max(string1Lengh, string2Lengh)for i← 1 to topStringSize do

if i < string1Lengh thenM [i][0]← i ∗ wDelete

end ifif i < string2Lengh then

M [0][i]← i ∗ wInsertend if

end for

Yet, directly applying Algorithm 4 to align a vocalized Hebrew verb in�ection and a base

tense form is far from optimal, because of the following properties of the Hebrew in�ectional

system:

� Final and non-�nal letters: As discussed in 4.1.1, some Hebrew letters have a

corresponding �nal letter, therefore, the comparison of c1 and c2 in Algorithm 6 may

indicate two similar letters as di�erent (if one is non-�nite while the other is).

� Only a subset of letters may be removed to form an in�ection: Most Hebrew

letters are never removed from the base tense form to form an in�ection, the only

exceptions are .ו!,י!,נ|!

� Only a subset of letters may be added to form an in�ection: The only letters

that may be added to the base tense form to form an in�ection are .ו!,ה!,מ|!,א!,י!,ת!,נ|!

� Only a subset of letters may be replaced to form an in�ection: The only

letters that may be replaced in the base tense form to generate an in�ection are ה!,י!,נ|!

and their replacing letters may only include .ו!,ה!,מ|!,א!,י!,ת!,נ|!

43

Page 54: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Algorithm 6 The edit distance δ-function

Require: characters <c1, c2>, Integers <wCopy, wReplace>if c1 = c2 then

return wCopyelse

return wReplaceend if

Solving the �rst issue is easy, instead of comparing c1 and c2 in Algorithm 6 directly, we

will check if c1 and c2 represent the same Hebrew letter or if c1 is equal to c2. Such that

equal characters (as vocalization signs or letters) on the one hand, and �nal and non-�nal

instances of the same letter on the other, are regarded as equal.

Following the rest of the issues, we note the add, remove and replace operation that

take place during the formation of an in�ection, may only use the letters .ו!,ה!,מ|!,א!,י!,ת!,נ|! In

order improve our string matching results we decided to exploit this property. Therefore,

the weight we give to every possible copy operation depends on the letter being copied, if

one of the above letters is copied the reword (to the edit distance value) is small - we regard

this as a "weak copy", otherwise (the letter is not of the above) the reword is signi�cant -

we regard this as a "strong copy". Vocalization signs are treated as weak copies, since they

all have the tendency to be added, removed and replaced while forming an in�ection. In

order to distinguish weak copies and strong copies we rewrote the δ-function as shown in

Algorithm 7.

Algorithm 7 The Hebrew adapted δ-function

Require: Characters <c1, c2>, Integers <wWeakCopy, wStrongCopy, wReplace>if c1 = c2 or sameLetter(c1, c2) then

if strongLetter(c1) thenreturn wStrongCopy

elsereturn wWeakCopy

end ifelse

return wReplaceend if

? sameLetter(c1, c2) denotes a function that returns true if c1 and c2 represent the sameletter.

? strongLetter(c1) denotes a function that returns true if c1 is a Hebrew letter that doesnot belong to the set ,א,י,ת,נ|!} .{ו,ה,מ|

44

Page 55: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Initiating the above string matching algorithm requires us to allocate values to the

weights wInsert, wDelete, wWeakCopy, wStrongCopy and wReplace. The weights we de-

cided of are:

� wInsert=wDelete=3: The least preferable option is for a character to be inserted/removed

from the word when it doesn't have to be removed/inserted, therefore, we apply the

biggest penalty for this option.

� wWeakCopy=-1: A copy of character is good, and, therefore, we reword for it, on

the other hand, this is a copy of a "weak" character (one among { {ו,ה,מ|,א,י,ת,נ|! or a

vocalization sign), and, therefore, the reword is modest.

� wStrongCopy=-50: A copy of a "strong" letter must never be missed, therefore, we

reword it generously.

� wReplace=2: If a copy is not possible we would like to replace rather than insert-

ing/deleting (if possible). This operation is widely applied for changes of vocalization.

45

Page 56: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

6.3 Unknown Verbs Classi�cation

Our third objective was to automatically classify Hebrew, non-vocalized base forms (of

verbs) to their corresponding in�ection table. By developing an automatic classi�er for

base forms, it will be possible to generate a vast number of associations of base forms and

tables. This may be achieved by morphologically analyzing some plain text, and applying our

classi�cation mechanism over the words identi�ed as verbs with morphology that corresponds

to a base form.

The classi�er we developed is a Support Vector Machine that receives a non-vocalized

base form, and returns a corresponding in�ection table. The SVM classi�er makes use of

the following features:

� Length: The length of the base form that corresponds to each table is constant, there-

fore, eliminating in�ection tables with a di�erent corresponding base form in�ection

length, is easy. This feature may reduce the number of possible tables signi�cantly, yet

on the other hand, many tables share a similar length. For example, if we are given

a 3 letter base form, it de�nitely does not correspond to the Hitpael pattern (since

all base forms in the Hitpael pattern are composed of 4 letters or more [31]). So the

length feature directly eliminates the 60 in�ection tables which the Hitpael pattern

consists of (and 172 in�ection tables in overall), out of the total 264 in�ection tables.

On the other hand, the ambiguity level remains high with 92 possible in�ection tables.

� Letters position: Many of the base forms that correspond to tables, have some

letters �xed in a given position. For example, patterns of base forms from the Hufal

pattern always include the letter Hey as the �rst letter in the base form (regardless

of the stem). Our feature extraction mechanism, extracts from a given base form any

(preset) number of sequential letters and their corresponding (beginning) position.

� Guttural letters position: Many in�ection tables (and their corresponding base

forms) include guttural letters (see 4.1.2) at some �xed locations (an in�ection table

may be associated to any guttural and not to a speci�c one). Therefore, gutturals

may be used as indicators to identify/eliminate in�ection tables. The di�erence of this

feature and the letters position feature is here we generate the same feature for all

types of gutturals unlike the letter position feature that generates a unique feature for

every type of letter.

46

Page 57: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

� Corpus level features: The above features narrow down the number of relevant

in�ection tables signi�cantly, but in many cases the in�ection table of a given base form

may not be simply extrapolated according to the length, letters and guttural features.

For example, the base form !Mש¤ל (Shilem), that means "(he) paid", corresponds to a

certain in�ection table from the Piel pattern, while the base form שµד¯ד! (Shadad), that

means "(he) robbed", corresponds to a completely di�erent in�ection table in the Paal

pattern. The two words are of the same length with no guttural letters, and the letters

that are present do not give any indication for the corresponding table (as mentioned,

the input to our classi�er does not include the vocalization signs, for otherwise this

task was much simpler). In such cases we attempt to di�erentiate base forms by using

corpus level features, this is done by using a comprehensive, morphologically analyzed,

Hebrew corpus (see 5.2). The idea is that given a base-form we �rst classify it into

a cluster of in�ection tables (instead of directly selecting a speci�c table). Then we

use the data obtained from the corpora to pinpoint the correct table in the selected

cluster. In detail, the idea is as follows:

Given a non-vocalized base form (let us mark this base form by β), we use the SVM to

cluster β into groups of in�ection tables (let G be the set of all such groups and g will

denote some arbitrary group in G). g includes in�ection tables that cause confusion

for an SVM without corpus level features.

After β is classi�ed to some group (g for example), we use the corpus level features

to detect the correct in�ection table in g (let i denote an arbitrary in�ection table in

g). This is achieved by in�ecting β by each in�ection table (i for example) in g , such

that a set of vocalized in�ections with their morphology (denoted by I ) are generated.

Now we simply check for the total number of appearances of in�ections from I (and

their corresponding morphology) in the corpus. This number of matches is used as a

feature to indicate the most likely in�ection table in g . The general scheme for the

two-phased classi�cation with corpus level features is given in Algorithm 8.

Algorithm 8, which extracts corpus level features, may either be applied with a vocal-

ized or a non-vocalized corpora, since currently we do not possess such a resource, we

apply this method by using 5.2.

Table 6.2 displays the features and their associated weights created for the base form

הÇuר¯ז! ("Huchraz"), which means "(was) announced". The features presented are created on

the second classi�cation phase (after it was associated to a group of in�ection tables which

includes the tables F-8 (Hi�l), G-1 (Hufal) and G-8 (Hufal)), and, therefore, the corpus

level features are included.

47

Page 58: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Algorithm 8 Corpus level features extraction

Require: Base form <β>, Group of in�ection tables g1, Corpus <C>corpusLevelFeatures← ∅totalNumberOfInflections← 0for all i such that i ∈ g do

I ← i .makeInflections(β)feature← makeNewFeature(i .getName())feature.setV alue(C.countOccurences(I ))totalNumberOfInflections← totalNumberOfInflections+ feature.getV alue()corpusLevelFeatures← corpusLevelFeatures ∪ {feature}

end forreturn normalizeCorpusLevelFeatures2(corpusLevelFeatures, totalNumberOfInflections)

1. g is the group of in�ection tables to which β was classi�ed by the �rst step of classi�-cation (classi�cation to groups).

2. The function normalizeCorpusLevelFeatures(features,totalIn�ections) normalizes thevalue of each feature by totalIn�ections (the total number of in�ections generated byin�ection tables in g), so the resulting value is a number (which we use as the weightof the feature) between 0 to 1 and∑

f∈features

f.getV alue() = 1

This number represents the frequency in the corpus of in�ections corresponding to thegiven verb according to a speci�c in�ection table.

Table 6.2: Features created on the second classi�cation phase for the base form הÇuר¯ז!

Feature type Feature weightLength feature WORD_SIZE__4 1Letters position features? AT_0__HEY 1

AT_1__KAF 1AT_2__REISH 1AT_3__ZAIN 1

AT_0__HEY&KAF 1AT_1__KAF&REISH 1AT_2__REISH&ZAIN 1

Guttural letters position features AT_0__GRONIT 1AT_2__GRONIT 1

Corpus level features TABLE__F20 0TABLE__G1 0.5TABLE__G8 0.5

? Here the letter position features were extracted for subsequence of two letters or less. Asdescribed in 7.3.2, we experimented on various possible lengths of such sequences.

48

Page 59: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 7

Experiments and Results

7.1 Generation Results

By using the in�ection tables described in 6.1, along with the association of base forms

to in�ection tables (see 5.1), we generated about 240k fully vocalized, and morphologically

tagged verbs (see appendix B). Since there is no reliable, automatic way to test for the

correctness of our output, we manually tested the correctness of the in�ections formed by

264 base forms and their corresponding morphologies - one base form (and all the generated

in�ections associated to it) per table. The total number of in�ections manually checked is

15, 221.

As displayed in table 7.1, 91 errors that resulted from several reasons were found. Under

the (false) assumption that base forms in the dataset are uniformly distributed among the

in�ection tables, a verb from our auto-generated in�ection list, is a valid verb with correct

vocalization and morphology in probability of 99.4%. In order to measure accuracy without

this assumption, we must test for a larger number of base forms form each in�ection table

(as opposed to our test that included one base form per in�ection table).

The mistake type marked as type I caused the most errors, yet all these errors were

caused by a single defective base form in the dataset. This base form was טפל! (from the Piel

pattern), that was misspelled as טיפל! (Plene spelling instead of de�cient spelling as required

in the dataset). Thanks to this error we decided to add a function for preventing mistakes

of this type (Plene instead of de�cient), this function simply validates that the number of

letters in the base form is equal to the number of letters expected by the in�ection table.

By applying this function we spotted 11 �awed base forms, which were in�ected into 601

defective verb in�ections. Three of the 11 mistakes resulted of mis-associated tables (type II

mistakes), and the rest originated from a misspelled (Plene instead of de�cient) base form.

49

Page 60: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 7.1: Generation error analysis

Mistakes in the dataset Mistakes in the Generation scheme Overallmistakes

Mistake type I I II III IV

Errors source I Flawed Wrong base form Wrong Spelling/vocalizationbase form to table association morphology error

# of mistakes 50 0 7 34 91Details All mistakes All mistakes

resulted of resulted of oneone �awed in�ection tablebase form

We estimate the current resource of over 240,000 fully vocalized and in�ected ver-

bal forms for 4,000 base verb forms is over 99.4% correct. This resource is available at

<http://www.cs.bgu.ac.il/∼nlpproj/hebrew-vocalization>.

7.2 Syllable Segmentation Results

7.2.1 Following Behor's Footsteps - Heuristic Approach

The procedure we implemented, that is based on Behor's heuristic, may be applied for any

vocalized word (not just for verbs). Yet since we do not possess a dataset that contain

non-verb, fully vocalized, words, we use our auto-generated, vocalized, verb in�ections as a

test data set.

Due to the lack of a Hebrew source in which syllables are tagged (like the resources used

by [7, 27]), in order to estimate the accuracy of our heuristic approach, we manually check

the syllable segmentation. We tested the correctness of 300 randomly selected in�ections.

We �nd that a given word is correctly segmented into syllables with accuracy of 81%, and

the probability an arbitrary syllable is segmented correctly is 85.92%.

Since the !Ë vocalization sign corresponds with two vocalizations (Kamats and Kamats

Katan), we used the more common vocalization that corresponded with the marking (Ka-

mats). This notice is important since Kamats Katan is regarded as a short vowel and

Kamats as long, therefore, we must de�ne the manner in which our heuristic deals with

such markings. This observation a�ects the results signi�cantly, regarding any !Ë as a Ka-

mats Katan results with a poor 75.33% word accuracy, and the probability an arbitrary

syllable is segmented correctly deteriorates to 81.72%.

50

Page 61: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

7.2.2 Syllable Segmentation According to the Base Form

In order to estimate the accuracy of the string-edit approach to syllable segmentation, we

manually tested for the correctness of the same (randomly selected) 300 in�ections we used

in 7.2.1. Our results show an in�ection is segmented correctly to syllables according to string

matching with the proper base tense form, in 99.33% of the cases. The probability for an

arbitrary syllable to be segmented correctly is 99.5%. Table 7.2 concludes and compares

the results of the two syllable segmentation schemes, and Appendix C displays several verbs

segmented into syllables according to their base form.

Table 7.2: Syllable segmentation comparison

Accuracy Error analysis

Checked Errors in Accuracy Accuracy Na instead Nach instead Otherwords/syllables words/syllables per word per Syllable of Nach of Na errors

Algorithm based onBehor's Heuristic 300/810 57/114 81% 85.92% 52.63%(1) 43.85%(2) 3.52%

String matchingwith the base 300/810 2/4 99.33% 99.5% 0% 100%(3) 0%tense form

1. All these errors resulted from a long vowel that was preceding to the Shva. By Behor,only a Shva that follows a non-stressed long vowel is a Shva Na, yet since we do notknow the position of the stress we apply this rule for every long vowel (regardless ofthe stress).

2. All these errors resulted from a short vowel that was preceding to the Shva, which isidenti�ed as Shva Nach by our heuristic.

3. These errors resulted from a mis-alignment of the in�ection string and the base tenseform string.

We anticipate the errors marked in Table 7.2 as (3), can be avoided by augmenting

another rule to the string matching mechanism. This rule will state a vocalization sign may

never be replaced by a letter and visa-versa.

7.2.3 Discussion

We observe that syllable segmentation is empirically more complex than could have been

expected. Traditional grammars develop the intuition that a few simple rules can accurately

predict syllable segmentation - classi�cation of open and closed syllables, Shva classi�cation

into Shva Na and Shva Nach being the basis of the mechanism.

It turns out that even when we have access to a fully vocalized verb form, we succeed in

only about 80-85% of the cases. Taking into account that automatic vocalization succeeds

at only about 90% accuracy (in the best reported commercial systems), this would indicate

that the success rate for the automatic segmentation of a non-vocalized word form would be

about 0.9× 0.8 ∼ 72% accuracy.

51

Page 62: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

If, however, we start the process from a base form that is non-in�ected, plus the morpho-

logical attributes of the word, then a 2 stage generation process would success with almost

perfect accuracy: our vocalization generation implementation succeeds at over 99.4% accu-

racy given the verb in�ection paradigm and the syllable segmentation implementation using

the string-edit distance succeeds at over 99.3% accuracy.

This observation strongly supports a view of Hebrew phonology that is based on a con-

structionist process � derive phonological information through a process of morphological

in�ection from base forms � and not in a pipeline process where morphology would �rst

generate fully in�ected word forms later segmented into phonological units.

7.3 Results of Unknown Verbs Classi�cation

7.3.1 Base Forms Classi�cation into Binyanim (patterns)

Through this experiment we attempt to automatically classify, non-vocalized base forms to

their corresponding pattern. An important notice is that many non-vocalized base forms

have more than one (valid) corresponding pattern, by which the base form may be in�ected.

For example, the base form !Nשמ, may be in�ected by the Paal pattern to form !Nמµש (Shaman)

which means "(became) fat", by the Piel pattern to form !Nש¤מ (Shimen) which means "(he)

lubricated" and by the Pual pattern to form !Nש«מ (Shuman) that means "(was) lubricated".

The ultimate goal is, therefore, the classi�cation of such words (words that correspond to

more than one pattern) to clusters of corresponding patterns.

We focus on classifying a subset of our base form data set that includes only base forms

that correspond to exactly one in�ection table. This subset includes 2,703 base forms (vs.

the 4246 total number of base forms in our dataset).

Table 7.3 compares the distribution of base forms among patterns in our two lists (and

not in natural text). We compare base forms in the whole dataset vs. the subset that

includes only base forms with exactly one corresponding pattern.

Table 7.3: Base form distribution among patterns

Pattern I Paal Nifal Piel Pual Hitpael Hi�l Hufal TotalThe whole dataset 755 397 1,014 492 685 597 306 4,246

17.78% 9.34% 23.88% 11.58% 16.13% 14.06% 7.2% 100%

The subset with only 494 316 446 6 657 505 279 2,703one pattern per 18.27% 11.69% 16.5% 0.22% 24.3% 18.68% 10.32% 100%

base form

52

Page 63: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

In Figure 7.1, which corresponds to Table 7.3, we note a signi�cant decrease of base

forms that correspond to the Pual, Piel and Paal patterns. This phenomenon is caused by

several resembling attributes of these patterns. For example, almost every (non-vocalized)

base form that corresponds to the Pual pattern, may be also in�ected by the Piel pattern.

For this reason the number of Pual base forms plunges so dramatically in our subset. We

anticipate the resemblance among base forms that correlate to these three patterns, will

cause the majority of errors to our system.

Figure 7.1: Base form distribution among patterns

We use a Support Vector Machine to classify the base forms according to the length,

letters position and guttural letters position features. The base forms in the data set are

randomly separated into a testing set and a training set, 70% were used for training, and 30%

for testing. Table 7.4 displays a typical confusion matrix, and the corresponding precision

and recall. The average accuracy we achieve is 90.25% (averaged over 5 predictions with

independently selected training/testing sets).

As displayed in Table 7.4, the recall measure of the Piel pattern is particularly low, and

most errors resulted of a mis-classi�cation of the base forms as corresponding to the Paal

pattern.

53

Page 64: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 7.4: Classi�cation of base forms to patterns - confusion matrix

Classi�er labelling I Paal Nifal Piel Pual Hitpael Hi�l Hufal RecallActual labelling H

Paal 133 1 14 0 0 0 1 0.8926Nifal 2 93 0 0 0 0 0 0.9789Piel 27 1 86 0 0 0 7 0.7107Pual 0 0 1 0 0 0 0 0

Hitpael 0 0 0 0 196 5 1 0.9702Hi�l 1 0 0 0 1 146 5 0.9542Hufal 0 0 0 0 0 1 89 0.9888

Precision 0.8159 0.9789 0.85148 N/A 0.9949 0.9605 0.8640

7.3.2 Classi�cation of Base Forms to In�ection Tables

In order to set a baseline for the task of identifying the exact in�ection table for a given

unknown base form, we will �rst test for the success rate of this task with the length and

letter position features only. Table 7.5 compares the accuracy and the accuracy's STD

of this baseline to a classi�er that uses the gutturals feature. Presented accuracies were

averaged over 5 independent experiments, with 70% of the samples used for training, and

the remaining 30% for testing.

Table 7.5: Identi�cation of exact in�ection table - comparison of baseline manipulations

Classi�er I Baseline classi�er Classi�er with the gutturals featureLetter position feature manipulations H Accuracy/STD Accuracy/STD

Without the letter 24.06%/0.45 32.5%/0.81position feature(1)

With 1 letter 66.88%/0.89 68.63%/0.9per feature(2)

With 1 and 2 62.63%/1.18 65.67%/1.88letters per feature(3)

With 1, 2 and 3 62.58%/2.49 65.32%/2.43letters per feature(4)

1. No letter position features are created.

2. A feature is created per letter.

3. A feature is created per 2 consecutive letters, in addition to previous features (2).

4. A feature is created per 3 consecutive letters, in addition to previous features (3).

The best results are achieved once the letter position feature uses one letter only, we

believe this is a direct consequent of the enlarged number of created features once more than

one letter is used. When the 1 and 2 letters per features are used, the average number of

features created is increased by a factor of ∼2.7. These additional features cause the SVM's

dimensionality to grow making our model less stable (higher variance) and less accurate.

The addition of the guttural feature improved the best result by 1.75% and again, 1

letter position features produced the best results.

54

Page 65: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

7.3.3 Classi�cation of Base Forms to In�ection Tables with Corpus

Level Features

In order to improve our results, we made an attempt to use corpus level features (as described

in 6.3). Due to the lack of a fully vocalized corpora, we make use of the non-vocalized corpora

described in 5.2.

In order to decide which in�ection table sets to use, we base upon the in�ection tables that

cause the classi�er to err the most, as displayed in Figure 7.2. The error rates displayed

in Figure 7.2 correspond to the 25 pairs of in�ection tables which the classi�er confuses

the most. The �gure displays the number of errors in a simple, one phase classi�cation,

averaged over 5 independent classi�cation experiments (sorted by # of errors), and the

corresponding cumulative error rate. All 5 experiments were con�gured to use our best

experiment con�guration (including the guttural feature and 1 letter for the letter position

features).

Figure 7.2: In�ection table confusion pairs

? A,B,C,D,E,F,G correspond to the patterns Paal, Nifal, Piel, Pual, Hitpael, Hi�l, Hufalrespectively. The numbers associated to the patterns correspond to speci�c in�ectiontables (according to [31]).

55

Page 66: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 7.6 displays the average accuracy and STD achieved by the two step classi�cation

mechanism, where corpus level features are used in the second classi�cation phase (as de-

scribed in 6.3). We de�ne the in�ection table clusters for the second phase by unifying the

i most confusing pairs (displayed in Figure 7.2). The uni�cation is done according to the

scheme displayed in Algorithm 9.

Algorithm 9 Uni�cation of clusters of in�ection tables which cause confusion

Require: Set of in�ection tables confusion sets <S = {s1, ..., sn}>, A new pair of confusingin�ection tables < p >c1 = p.getFirst()c2 = p.getSecond()if c1∈ sa and c2∈ sb s.t. sa, sb ∈ S and sa 6= sb then

return S.unify(S.find(c1), S.find(c2))else if c1∈ sa and c2()∈ sb s.t. sa, sb ∈ S and sa = sb then

return Selse if c1∈ sa and c2/∈ sb s.t. sa ∈ S, ∀sb ∈ S then

return S.unify(S.find(c1), S.makeSet(c2))else if c1/∈ sa and c2∈ sb s.t. ∀sa ∈ S, sb ∈ S then

return S.unify(S.makeSet(c1), S.find(c2))else

return S.unify(S.makeSet(c1), S.makeSet(c2))end if

? makeSet, �nd and unify are the known methods of the UnionFind data structure:

◦ S.makeSet(x): Creates a new set that includes only x in S .

◦ �nd(x): Finds the set in the which contains x in S .

◦ unify(x,y): merges the set that includes x in S with the set that includes y in S.

To our disappointment, only the A1-C1 cluster shows a signi�cant improvement in the

overall accuracy. Generally, the more clusters we de�ne, the higher will the �rst step accuracy

be (although there exist "di�cult" clusters which disrupt the classi�er - as F9-F16). Yet, in

many cases the corpus level features does not seem e�ective enough to improve our results

by improving the accuracy of the second classi�cation step. We do expect that using a

vocalized corpora instead, may improve our accuracy more signi�cantly, at this time though

we do not possess such a resource.

56

Page 67: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Table 7.6: Classi�cation of base forms to in�ection tables with corpus level features

# of pairs New pair Clusters of First step Cluster Totalin�ection tables accuracy/STD accuracy?/STD accuracy/STD

0 � 68.63/0.9 68.63/0.91 A1-C1 72.62%/0.016 70.08%/1.4

{A1,C1} 55.94%/3.82 A17-C4 72.65%/1.19 68.97%/1.06

{A1,C1} 48.13%/5.14{A17,C4} 53.68%/5.91

3 A22-C1 73.53%/1.25 68.8%/0.9{A1,A22,C1} 48.92%/5.15{A17,C4} 51.87%/7.84

4 F9-F16 72.6%/2.42 67.12%/2.24{A1,A22,C1} 48.19%/6.07{A17,C4} 56.1%/5.35{F9,F16} 52.47%/20.98

5 C12-C20 73.31%/1.15 67.12%/0.77{A1,A22,C1} 50.88%/4.09{A17,C4} 50.21%/9.2{C12,C20} 69.95%/5.53{F9,F16} 48.48%/6.83

6 A1-A22 No change No change No change No change7 B1-C20 73.29%/2.18 67.27%/2.83

{A1,A22,C1} 53.96%/4.14{A17,C4} 56.69%/13.28

{B1,C12,C20} 78.1%/3.88{F9,F16} 35.95%/11.92

? Note that each cluster accuracy includes mistakes which result of mis-classi�cations in the�rst classi�cation step (classi�cation into clusters).

7.3.4 Discussion

While classi�cation of verbs into one of the 7 Binyanim (patterns) of Hebrew is a simple

task given the letter pattern of the base form of a verb, the confusion between a few patterns

remains challenging (in particular Paal vs. Piel). The pattern information, though, is not

su�cient to predict vocalization of the verb's many in�ections.

Full classi�cation of a non-vocalized verb form into a non-ambiguous in�ection paradigm

is a much more challenging task. This contradicts traditional grammars which develop the

intuition that simple letter-based rules are su�cient to predict the vocalization paradigm of

a Hebrew verb. This approach only succeeds at a rate of about 68% on a large sample of

about 2,700 verbs.

The addition of simple corpus-based features improves the classi�cation in a signi�cant

manner. We experimented with simple unsupervised corpus-based features and observed

that for one of the most common confusion sources (A1-C1 - which is the most common

group of Paal / Piel verbs), corpus-based features improve the classi�cation signi�cantly.

We hypothesize that stronger corpus-based features based on vocalized data would provide

higher improvements for more confusion sets.

This �nding indicates that the Hebrew verb word formation system is much more irreg-

ular than one could assume on the basis of traditional grammar descriptions.

57

Page 68: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Part IV

58

Page 69: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Chapter 8

Conclusions and Future Work

8.1 Conclusions

This work aimed to create and use a vocalized Hebrew dataset for various NLP usages.

We developed through a generation model, a large, fully vocalized, dataset which includes

about 240k Hebrew in�ected verbs along with their corresponding morphological attributes.

The 240 in�ections are associated to over 4k base forms of Hebrew verbs. A manual check

of a sample including 15,000 in�ections and their corresponding morphology, estimates that

the dataset accuracy of spelling, vocalization and morphological attributes, is around 99.4%

(under the uniformity assumption mentioned in 7.1). Generally, it seems that a considerable,

fully vocalized and morphologically tagged datasets, with a minimal error rate, can be

obtained by constructing a comprehensive generation mechanism. Obviously, since now this

mechanism is implemented, new verbs and their corresponding in�ections may be easily

added to the dataset. This analysis indicates that given a non-vocalized base verb form and

one of about 300 unambiguous in�ection paradigms, one can deterministically generate a

fully accurate vocalized form of any in�ected form of the verb.

Having this dataset, we developed two algorithms for segmentation of vocalized Hebrew

words into syllables. Our �rst algorithm used the vocalized word as its only input. Based on

a middle-ages heuristic developed by Rabbi-Eliyahu-Behor, we managed to segment 85.92%

of the syllables and 81% of the words correctly. This accuracy measurement was made by

manually checking a random sample of 300 verbs that comprised 810 syllables. Our second

syllable segmentation algorithm used the vocalized word together with its corresponding

origin form (or absolute state for nouns). We tested segmentation accuracy with the same

sample. Our second approach signi�cantly improved our results, some 99.5% of the syllables

and 99.33% of the words were segmented correctly. This improved method requires addi-

tional knowledge to accurately perform syllable segmentation. If the vocalized non-in�ected

59

Page 70: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

base tense form of the verb is added, we then can perform the task almost perfectly. This

�nding strongly supports a constructionist view of Hebrew phonology.

In order to estimate the amount of knowledge required to map non-vocalized base forms

to in�ection tables (and, therefore, to predict the full vocalization paradigm of the verb),

we experiment with automatic classi�cation of a non-vocalized non-in�ected base forms of

verbs into one of over 260 in�ection paradigms. This task proved to be rather challenging,

whereas many non-vocalized base forms correspond to more than one possible in�ection

table. We focused on classifying base forms with only one corresponding in�ection table.

We applied a two step classi�cation process, �rst - classi�cation into clusters of in�ection

tables, and second - pinpointing the table in the cluster. Following this approach we achieved

70.08% accuracy in classifying a base form into over than 260 possible in�ection tables. In

addition, We also used our classi�er for classifying base forms of verbs into their correlated

patterns, here we achieve 90.25% accuracy. We conclude that classifying base forms into

speci�c in�ection tables and to patterns, are tasks that are far from trivial with no contextual

features available, and yet the use of the guttural and the corpus level features enabled us

to produce improved results. We expect the corpus-level features will play an even larger

role when a vocalized corpora becomes available.

8.2 Future Work

8.2.1 Generation

Implementing Rare In�ection Tables

The scope of this work included the implementation of over 260 common in�ection tables.

An additional 40 infrequent tables still need to be implemented to cover all known cases of

verbal in�ection.

Implementing In�ection Tables for Nouns

As in other languages, nouns in Hebrew are by far the most common part of speech. Unlike

verbs, the in�ection tables for nouns consist of 10 in�ections in average (as opposed to 60

in�ections in average for verbs). Yet, the number of noun in�ection tables which are to be

implemented, is signi�cantly larger than the overall 300 verb in�ection tables [5].

By creating a comprehensive generation mechanism for nouns, along with a signi�cant

list of nouns and their corresponding in�ection table, we expect to have a vast dataset

which will complete the vocalized verbs dataset, generated in this work. We expect this

comprehensive, vocalized and morphologically tagged dataset will provide the foundation

for the implementation of an automatic, highly accurate Hebrew vocalization software.

60

Page 71: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

8.2.2 Syllable Segmentation

Searching for Optimal Weights

Our string matching algorithm uses weights to de�ne the bene�t/punishment of match-

ing/replacing/inserting/deleting a letter or a vocalization sign. The selection of these weights

was made manually, based on a general understanding of the Hebrew in�ectional model

properties. This selection may be improved, to achieve better results (over our manually

segmented 300 verbs) by searching for optimal weights. This can be achieved by either a

simple grid search (with some boundaries), or by a more sophisticated searching scheme (as

gradient descent, genetic algorithm etc.), over the weights.

Machine Learning of Syllable Segmentation

Our success in segmenting vocalized words into syllables given their base form, may be used

to improve syllable segmentation when the base form of the word is not available. This may

be achieved by using our 240k verbs that are segmented to syllables with over 99% word

accuracy, as corpora for some supervised learning mechanism. By doing so, we believe the

81% accurately segmented words, may be improved with no need for a base form (which

might be di�cult to obtain).

8.2.3 Unknown Verbs Classi�cation

Using Vocalized Corpora

Currently, we classify base forms into in�ection tables by using length, letter position and

guttural letters position as features. In addition, we attempted to use a non-vocalized

corpora counts of in�ections, as an enhancement to these features by performing the classi�-

cation in two steps. As an additional possible feature we suggest using a vocalized corpora,

by which the in�ection counts may be much more accurate. We expect the accuracy to be

improved due to the very low ambiguity rate of vocalized Hebrew words (in comparison to

the very high ambiguity for non-vocalized Hebrew). Such accurate counts of in�ections in a

vast corpora may improve our classi�cation accuracy signi�cantly.

Classifying Vocalized Verbs into In�ection Tables

Once a vocalized corpora will become available, our motivation for automatically classifying

base forms of verbs to their corresponding in�ection tables, naturally increases. Obviously,

the accuracy of classi�cation will be signi�cantly increased and automatically gathering base

forms for our generation mechanism will become immediate.

61

Page 72: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Performing Feature Selection

As described in Table 7.5, once the number of features grows the accuracy drops signi�cantly.

We suspect that performing feature selection (such as discriminant analysis or principal

component analysis) may improve the robustness of our model, and our results by eliminating

irrelevant features.

General Verbs Classi�cation

The current Classi�cation task we confront is the task of classifying base forms of verbs

into their corresponding in�ection tables/patterns. A more general task would consider

classifying any in�ection of a verb (not just a base form) to its relevant in�ection table.

Such a classi�er could use the results of our generation mechanism for performing supervised

learning. Once such a system is implemented, any verb in a given corpora may be used to

enhance our dataset (unlike our current mechanism which uses only base form verbs in the

corpora).

Exploring the SVM Parameters

In this work we used a �xed setting of the SVM parameters (c, g, kernel etc.). Performing a

search for optimal SVM settings may be achieved (as in 8.2.2) by either a simple grid search,

or by a more sophisticated searching scheme.

8.2.4 Automatic Vocalization

Setting a Baseline by a Vocalized Corpora

Given a comprehensive vocalized corpora, which we anticipate we will obtain in several

months time, the statistics for the possible vocalizations for a Hebrew word, may be easily

gathered (using a word segmentation system). For example, as we described in 1.1 the non-

vocalized word ספר! has a number of possible vocalizations, using the vocalized corpora we

could count the number of instances of each valid way of vocalization. Then, in order to set a

baseline, we could vocalize each word by the most common vocalization for it. This scheme,

by which we intend to set our baseline, is not new, in fact it is similar to [23], [13] and [37],

the innovation in such work lies in the corpora used. For the �rst time a modern, typical

vocalized text will be used for learning vocalized words distribution. Therefore, we believe

our renewed baseline may re�ect the di�culty in the challenge of automatic vocalization,

much more accurately.

62

Page 73: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Improving the Baseline via Supervised Learning

In order to improve the baseline described in 8.2.4 we suggest the scheme described in

Algorithm 10.

Algorithm 10 Supervised learning for automatic vocalization

Require: A vocalized corpora CW ←segment the words in Cfor all w such that w ∈ W do

x← w stripped from its vocalization signsV ←All possible vocalizations of x in C (x's confusion set)

end forSplit W into training and testing sets, and use supervised machine learning with con-textual features to vocalize words correctly into one of the possible vocalizations in V.

63

Page 74: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Appendix A

Data Set Example - Base Forms

Correlated to In�ection Tables

The following words are a sample from our dataset of base forms which includes over 4k

base forms:

<base form, pattern, Table number>

A,16,אבד!

C,20,אבזר!

!Nאבח,C,25

C,26,אבטח!

A,14,אגד!

!Pאג,A,14

A,14,אגר!

!Pאגר,C,20

A,21,אהב!

A,20,אהד!

D,14,אבזר!

!Nאבח,D,16

D,19,אבטח!

D,1,אבק!

!Nאב,D,3

D,1,אגד!

!Pאג,D,1

D,10,אדה!

C,20,אורר!

D,14,אורר!

64

Page 75: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Appendix B

Data Set Example - In�ected

Vocalized Verbs

The following words are a sample from our automatically generated 250k dataset of in-

�ected and fully vocalized verbs along with their corresponding morphological attributes

(Time+Person+Gender+Number+Spelling):

<pattern, Table number, vocalized in�ection, morphology, corresponding base form>

A,1,!בג®ד�תי,PAST+FIRST+MF+SINGULAR+COMPLETE,!בג®ד

A,1,!בג®ד�ת,PAST+SECOND+M+SINGULAR+COMPLETE,!בג®ד

A,1,!בג®ד�ת,PAST+SECOND+F+SINGULAR+COMPLETE,!בג®ד

A,1,!בג®ד,PAST+THIRD+M+SINGULAR+COMPLETE,!בג®ד

A,1,!בג�ד´ה,PAST+THIRD+F+SINGULAR+COMPLETE,!בג®ד

A,1,!בג®ד�נו,PAST+FIRST+MF+PLURAL+COMPLETE,!בג®ד

A,1,!Mבג®ד�ת,PAST+SECOND+M+PLURAL+COMPLETE,!בג®ד

A,1,!Nבג®ד�ת,PAST+SECOND+F+PLURAL+COMPLETE,!בג®ד

A,1,!בג�דו,PAST+THIRD+M+PLURAL+COMPLETE,!בג®ד

A,1,!בג�דו,PAST+THIRD+F+PLURAL+COMPLETE,!בג®ד

A,1,!בוג¦ד,PRESENT+FIRST+M+SINGULAR+COMPLETE, בג®ד!

A,1,!ד»תªבוג,PRESENT+FIRST+F+SINGULAR+COMPLETE, בג®ד!

A,1,!בוג¦ד,PRESENT+SECOND+M+SINGULAR+COMPLETE, בג®ד!

A,1,!ד»תªבוג,PRESENT+SECOND+F+SINGULAR+COMPLETE, בג®ד!

A,1,!בוג¦ד,PRESENT+THIRD+M+SINGULAR+COMPLETE, בג®ד!

A,1,!ד»תªבוג,PRESENT+THIRD+F+SINGULAR+COMPLETE, בג®ד!

A,1,!Mבוג�ד£י,PRESENT+FIRST+M+PLURAL+COMPLETE, בג®ד!

A,1,!בוג�דות,PRESENT+FIRST+F+PLURAL+COMPLETE, בג®ד!

A,1,!Mבוג�ד£י,PRESENT+SECOND+M+PLURAL+COMPLETE, בג®ד!

65

Page 76: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Appendix C

Data Set Example - In�ected

Verbs Segmented into Syllables

The following segmented words are a sample from our automatically generated dataset of

in�ected verbs which were segmented into syllables according to the string matching based

algorithm:

בÊג®ד�Êתי!

בÊג®ד�Êת!

בÊג®ד�ת!

בÊג®ד!

בÊג�ד´ה!

בÊג®ד�Êנו!

!MתÊבג®ד�

!NתÊבג®ד�

בÊג�דו!

בוÊג¦ד!

בוÊגʪד»ת!

!Mג�ד£יÊבו

בוÊג�דות!

בÊגוד!

בגוÊד´ה!

!Mד£יÊבגו

בגוÊדות!

אבÊגוד!

אבÊגד!

תבÊגוד!

66

Page 77: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

Bibliography

[1] The academy for the hebrew language. http://hebrew-academy.huji.ac.il.

[2] Qaryan hebrew tts. http://www.software112.com/products/qaryan-hebrew-

tts.html.

[3] M.M. Adler. Hebrew morphological disambiguation: An unsupervised stochastic

word-based approach. PhD thesis, Citeseer, 2007.

[4] E. Avneyon, R. Nir, and I. Yosef. Milon sapir: The Concise Sapphire Dictionary.

Hed Artsi, Tel Aviv. (in Hebrew), 1997.

[5] S. Barkali. The complete tablet of names: tablets for the in�ections of names

in all their forms. Luah ha-shemot ha-shalem: luhot le-netiyat ha-shemot al kol

mishkalehem ve-tsurotehem bi-tsiruf reshimah shel beerekh 20,000 shemot-etsem

mi-tekufat ha-Tanakh ve-ad yemenu... Re'uven Mas (in Hebrew), 1962.

[6] S. Barkali. The complete tablet of Verbs: tablets for the in�ection of verbs. Luach

ha-pe'alim ha-shalem: luchot li-netiyat ha-pe'alim... reshimah mechudeshet shel

kol shoreshe ha-pe'alim ba-loshon ha-'Ivrit... Re'uven Mas (in Hebrew), 1988.

[7] S. Bartlett, G. Kondrak, and C. Cherry. Automatic syllabi�cation with structured

svms for letter-to-phoneme conversion. Proceedings of ACL-08: HLT, pages 568�

576, 2008.

[8] T. Berkovich. Snopi automatic nikud. http://www.nakdan.com/Nakdan.aspx.

[9] Y. Choueka and Y. Neeman. Nakdan-text,(an in-context text-vocalizer for modern

hebrew). In BISFAI-95, The Fifth Bar Ilan Symposium for Arti�cial Intelligence,

1995.

[10] D. Dannélls and J.J. Camilleri. Verb morphology of hebrew and maltese-towards

an open source type theoretical resource grammar in gf. In Proceedings of LREC

2010. Workshop on Language Resources (LRs) and Human Language Technologies

(HLT) for Semitic Languages Status, Updates, and Prospects., 2010.

67

Page 78: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

BIBLIOGRAPHY 68

[11] A. Even-Shoshan. A new dictionary of the Hebrew language. Kiryat-Sefer,

Jerusalem (in Hebrew), 1981.

[12] R. Finkel and G. Stump. Generating hebrew verb morphology by default in-

heritance hierarchies. In Proceedings of the ACL-02 workshop on Computational

approaches to semitic languages, pages 1�10. Association for Computational Lin-

guistics, 2002.

[13] Y. Gal. An hmm approach to vowel restoration in arabic and hebrew. In Proceed-

ings of the ACL-02 workshop on Computational approaches to semitic languages,

pages 1�7. Association for Computational Linguistics, 2002.

[14] Y. Goldberg and M. Elhadad. Hebrew dependency parsing: Initial results. In

Proceedings of the 11th International Conference on Parsing Technologies, pages

129�133. Association for Computational Linguistics, 2009.

[15] Y. Goldberg, R. Tsarfaty, M. Adler, and M. Elhadad. Enhancing unlexicalized

parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and

em-hmm-based lexical probabilities. In Proceedings of the 12th Conference of the

European Chapter of the Association for Computational Linguistics, pages 327�

335. Association for Computational Linguistics, 2009.

[16] L. Haizhou and Y. Baosheng. Chinese word segmentation. Language, 18:212�217,

1998.

[17] N. Har'el and D. Kenigsberg. Hspell-the free hebrew spell checker and morpho-

logical analyzer. In Israeli Seminar on Computational Linguistics, 2004.

[18] C. Huang, J. Gao, LI MU, and X. Chang. Chinese word segmentation, 2005. EP

Patent 1,515,240.

[19] Alon Itai and Shuly Wintner. Language resources for Hebrew.

Language Resources and Evaluation, 42(1):75�98, March 2008.

http://www.mila.cs.technion.ac.il/mila/eng/index.html.

[20] D. Jurafsky and J.H. Martin. Speech and language processing: An introduction

to natural language processing, computational linguistics, and speech recognition.

MIT Press, 2006.

[21] D. Kamir, N. Soreq, and Y. Neeman. A comprehensive nlp system for modern

standard arabic and modern hebrew. In Proceedings of the ACL-02 workshop

on Computational approaches to semitic languages, pages 1�9. Association for

Computational Linguistics, 2002.

Page 79: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

BIBLIOGRAPHY 69

[22] S. Klein and R.F. Simmons. A computational approach to grammatical coding of

english words. Journal of the ACM (JACM), 10(3):334�347, 1963.

[23] L. Kontorovich. Problems in semitic nlp: Hebrew vocalization using hmms. 2001.

[24] M. Levinger, A. Itai, and U. Ornan. Learning morpho-lexical probabilities from

an untagged corpus with an application to hebrew. Computational Linguistics,

21(3):383�404, 1995.

[25] J.K. Low, H.T. Ng, and W. Guo. A maximum entropy approach to chinese

word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese

Language Processing, volume 1612164. Jeju Island, Korea, 2005.

[26] Melingo. Kolan. http://www.melingo.co.il/kolan.htm.

[27] K. Müller. Automatic detection of syllable boundaries combining the advantages

of treebank and bracketed corpora training. In Proceedings of the 39th Annual

Meeting on Association for Computational Linguistics, pages 410�417. Associa-

tion for Computational Linguistics, 2001.

[28] R. Nelken and S.M. Shieber. Arabic diacritization using w eighted �nite-state

transducers. Computational Approaches to Semitic Languages, 8:79, 2005.

[29] N. Neser. Hanikud halachah lema'ase. Mesada (in Hebrew), 1976.

[30] Z. Qun and C. Yu. Research on chinese word segmentation algorithm based on

special identi�ers. Computing and Intelligent Systems, pages 377�385, 2011.

[31] Asif S. The tablets of verbs for the Hebrew language Luahot ha-pe'alim ba-safa

ha-'Ivrit. Prolog publishing house Ltd. (in Hebrew), 2009.

[32] H. Safadi, O. Dakkak, and N. Ghneim. Computational methods to vocalize arabic

texts. In Second Workshop on Internationalizing SSML, 2006.

[33] M. Saimaiti and Z. Feng. A syllabi�cation algorithm and syllable statistics of

written uyghur. 2008.

[34] S. Shoval. Nikuda, 2010. http://www.nikuda.co.il/.

[35] K. Sima'an, A. Itai, Y. Winter, A. Altman, and N. Nativ. Building a tree-bank

of modern hebrew text. Traitement Automatique des Langues, 42(2), 2001.

[36] Torah Educational Software. Auto nikud plus.

http://www.jewishstore.com/Software/AutoNikud.htm.

Page 80: AUTOMATIC HEBREW TEXT VOCALIZATIONelhadad/nlpproj/hebrew-vocalization/... · of Hebrew, Hebrew text to speech and many other NLP tasks may bene t from a reliable system which restores

BIBLIOGRAPHY 70

[37] M. Spiegel and J. Volk. Hebrew vowel restoration with neural networks. In Class

of 2003 Senior Conference on Natural Language Processing. Citeseer, 2003.

[38] P. Taylor. Text-to-speech synthesis, volume 1. Citeseer, 2009.

[39] P. Taylor, A. Black, and R. Caley. The architecture of the festival speech synthe-

sis system. In The Third ESCA Workshop in Speech Synthesis, pages 147�151.

Citeseer, 1998.

[40] D. Yarowsky. Decision lists for lexical ambiguity resolution: Application to accent

restoration in spanish and french. In Proceedings of the 32nd annual meeting on

Association for Computational Linguistics, pages 88�95. Association for Compu-

tational Linguistics, 1994.

[41] Y. Zamir. Hocr is a hebrew optical character recognition library, 2008.

http://hocr.berlios.de/index.html.

[42] I. Zitouni and R. Sarikaya. Arabic diacritic restoration approach based on maxi-

mum entropy models. Computer Speech & Language, 23(3):257�276, 2009.