automatic hebrew text vocalizationelhadad/nlpproj/hebrew-vocalization/... · of hebrew, hebrew text...
TRANSCRIPT
Ben-Gurion University of the Negev
Faculty of Natural Science
Department of Computer Science
AUTOMATIC HEBREW TEXT
VOCALIZATION
Thesis submitted as part of the requirements for the
M.Sc. degree of Ben-Gurion University of the Negev
by
Eran Tomer
January 2012
Subject: Automatic Hebrew Text Vocalization
Writen By: Eran Tomer
Advisor: Michael Elhadad
Department: Computer Science
Faculty: Natural Science
Ben-Gurion University of the Negev
Signatures:
Author: Eran Tomer Date
Advisor: Michael Elhadad Date
Dept. Committee Chairman Date
I
Abstract
Written Hebrew involves an exceptionally high ambiguity rate due to the lack of vowel
markings in the vast majority of modern Hebrew texts. Each letter may be vocalized and
pronounced in multiple ways. This letter level ambiguity naturally increases word level
ambiguity, which complicates many other natural language tasks. Automatic translation
of Hebrew, Hebrew text to speech and many other NLP tasks may bene�t from a reliable
system which restores vocalization signs.
In this research we aim to implement tools that may be used to enhance the capabili-
ties of an automatic vowel restoration system. We focus on handling verbs. The Hebrew
verb system has the most complex morphology and vocalization mechanisms of all parts of
speech. We �rst present a comprehensive generation mechanism, which produces vocalized
and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an
indication of which pattern the verb follows. Given a classi�cation of verbs into about 300
distinct clusters, our generation mechanism generates fully vocalized in�ected verbal forms.
Using our implementation of this method, we have produced a lexical resource for modern
Hebrew that includes all in�ected forms for over 4,000 distinct verbs. This database contains
about 250,000 fully in�ected and fully vocalized verbal forms, with an accuracy estimate of
over 99.4
In the second part, we address the task of automatic segmentation of vocalized words
into syllables. This task is a necessary component of Text to Speech systems. We estimate
the accuracy of the syllable segmentation algorithm taught in classical Hebrew grammar
books (Behor's 5 rules of shva classi�cation). We observe that this algorithm only succeeds
in about 80%-85% of the verbs we tested. In the case of verbs, if we introduce additional
knowledge, we manage to make this task succeed at over 99.4% accuracy. This success rate
requires as additional input to the syllable segmentation algorithm, the vocalized origin form
of the verb. This �nding provides strong support to indicate that phonological mechanisms
in Hebrew rely on a construction mechanism (that is, syllable segmentation starts from a
base form + in�ection form and not directly from the fully in�ected form).
II
Finally, we address the task of classifying the non-vocalized base form of a verb into one
of the 300 in�ection paradigms needed to determine the verb's full vocalization. We use
supervised learning techniques and letter-features: position of the letters in the base form
and classi�cation of the letters in guttural/non guttural families. Surprisingly, this classi�-
cation mechanism only succeeds at about 70% accuracy over a sample of several thousand
verbs. This �nding contradicts traditional grammar intuitions that indicate that simple
letter pattern rules predict the verb vocalization. We �rst con�rm that the simpler task of
classifying into Binyanim (the 7 basic major verb patterns in Hebrew) can be achieved on
the basis of simple letter rules (with accuracy over 90%). We then investigate the hypothesis
that corpus-level features may help the task of classi�cation of non-vocalized verbal base
forms into one of the 300 in�ection paradigm. Corpus-level features capture ad-hoc letter
patterns observed over a very large corpus of fully in�ected non-vocalized verbs (gathered
over a corpus of over 50M words). We �nd that such features improve verb classi�cation
by about 5%. This �nding indicates that the verb in�ection mechanism in Hebrew is more
irregular than can be assumed.
The several resources (software and databases) we have produced provide important
stepping stones and resources towards implementation of a fully automatic vowel restoration
system in Hebrew. The experimental data we have gathered also provides new insights on
the nature of the verbal word formation system in Hebrew.
III
Acknowledgements
I would like to thank my advisor, Prof. Michael Elhadad for all his ideas and support, and
the member of the Academy of the Hebrew Language, Prof. Jacob Ben Tulila, who helped
me understand the basics of manual and automatic vocalization. All the NLP team at Ben-
Gurion University, and especially Dr. Meni Adler and Dr. Yoav Goldberg, who eased my
ventures in implementing my SVM based classi�ers. My parents and my wife for all their
help and understanding, and last, to my (newly born) baby girl, Dani, for sleeping so well
at nights, and by that making the writing of this work possible.
IV
Contents
Abstract II
Acknowledgements IV
I 2
1 Introduction 3
1.1 Domain - Natural Language Processing . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Natural Language Processing in Hebrew . . . . . . . . . . . . . . . . . 4
1.1.2 Natural Language Processing in Hebrew - Related Work . . . . . . . . 5
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Motivation and Contributions 7
2.1 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Development of a Hebrew Text-To-Speech System . . . . . . . . . . . 7
2.1.2 Generation of Vocalized Text for Teaching Usages . . . . . . . . . . . 8
2.1.3 Improving Automatic Translation Systems . . . . . . . . . . . . . . . . 8
2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 The In�ectional Model for Verbs . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Creating a Shva Classi�cation Algorithm . . . . . . . . . . . . . . . . 10
2.3.3 Missing Training/Testing Sets . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Evaluation of Generation and Syllable Segmentation Accuracy . . . . 10
2.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
V
2.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Previous Work Regarding Hebrew Vocalization 13
3.1 Commercial/Non-academic Vowel Restoration Systems . . . . . . . . . . . . . 14
3.2 Commercial/Non-academic Attempts on Syllable Segmentation . . . . . . . . 18
3.3 Academic Attempts on Vowel Restoration . . . . . . . . . . . . . . . . . . . . 19
3.4 Academic Attempts on Syllable Segmentation . . . . . . . . . . . . . . . . . . 20
3.5 Generation in Hebrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II 22
4 Background - Principles of Vocalization in Hebrew 23
4.1 Linguistic De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Hebrew Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Vocalization Signs in Hebrew . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Syllables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.4 Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.5 De�cient Spelling vs. Plene Spelling . . . . . . . . . . . . . . . . . . . 29
4.1.6 Vocalization Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 The Case of Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Patterns - Binyanim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 In�ection Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Datasets 35
5.1 Verbs List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Morphologically Analysed Corpora . . . . . . . . . . . . . . . . . . . . . . . . 35
III 36
6 Method 37
6.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2.1 Following Behor's Footsteps - Heuristic Approach . . . . . . . . . . . . 40
6.2.2 Shva Classi�cation According to the Base Form . . . . . . . . . . . . . 42
6.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
VI
7 Experiments and Results 49
7.1 Generation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Syllable Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2.1 Following Behor's Footsteps - Heuristic Approach . . . . . . . . . . . . 50
7.2.2 Syllable Segmentation According to the Base Form . . . . . . . . . . . 51
7.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Results of Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . 52
7.3.1 Base Forms Classi�cation into Binyanim (patterns) . . . . . . . . . . 52
7.3.2 Classi�cation of Base Forms to In�ection Tables . . . . . . . . . . . . 54
7.3.3 Classi�cation of Base Forms to In�ection Tables with Corpus Level
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
IV 58
8 Conclusions and Future Work 59
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2.2 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.3 Unknown Verbs Classi�cation . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.4 Automatic Vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A Data Set Example - Base Forms Correlated to In�ection Tables 64
B Data Set Example - In�ected Vocalized Verbs 65
C Data Set Example - In�ected Verbs Segmented into Syllables 66
Bibliography 67
VII
List of Tables
1.1 Ways to vocalize ספר! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Hebrew letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Additional letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 The types of consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Vocalization vowels and semi-vowels . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Dagesh sound manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Long vs. short vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Application of the syllables and vowels rule . . . . . . . . . . . . . . . . . . . 31
4.8 A complete paradigm from the Paal pattern - in�ections of שפ|!"כ|! . . . . . . . 34
4.9 A complete paradigm from the Paal pattern - in�ections of גד!"ל! . . . . . . . 34
6.1 The valid forms of מועמד! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Features created on the second classi�cation phase for the base form הÇuר¯ז! . . 48
7.1 Generation error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2 Syllable segmentation comparison . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Base form distribution among patterns . . . . . . . . . . . . . . . . . . . . . . 52
7.4 Classi�cation of base forms to patterns - confusion matrix . . . . . . . . . . . 54
7.5 Identi�cation of exact in�ection table - comparison of baseline manipulations 54
7.6 Classi�cation of base forms to in�ection tables with corpus level features . . . 57
VIII
List of Figures
3.1 Ambiguity in vocalization of Hebrew text . . . . . . . . . . . . . . . . . . . . 15
3.2 The little prince vocalization according to Nakdan Text . . . . . . . . . . . . 17
3.3 The little prince vocalization according to Snopi - Automatic Nikud . . . . . 17
3.4 The little prince vocalization according to Nikuda . . . . . . . . . . . . . . . . 18
4.1 Hebrew patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.1 Base form distribution among patterns . . . . . . . . . . . . . . . . . . . . . . 53
7.2 In�ection table confusion pairs . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IX
List of Algorithms
1 The syllables and vowels rule ( והתנועות! ההברות (כלל . . . . . . . . . . . . . . . 30
2 The Behor scheme for Shva classi�cation . . . . . . . . . . . . . . . . . . . . . 41
3 Our heuristic for Shva classi�cation . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Edit distance matrix calculation . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Initiation of the edit distance Matrix . . . . . . . . . . . . . . . . . . . . . . . 43
6 The edit distance δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 The Hebrew adapted δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8 Corpus level features extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 Uni�cation of clusters of in�ection tables which cause confusion . . . . . . . . 56
10 Supervised learning for automatic vocalization . . . . . . . . . . . . . . . . . 63
1
Part I
2
Chapter 1
Introduction
This research aims to investigate and develop new methodologies in Natural Language Pro-
cessing (NLP), regarding vocalization of Hebrew text. Our main goals are:
� Automatic generation of vocalized words.
� Automatic word segmentation into syllables.
� Handling unknown words in our generation model.
The strategy we use includes a hybrid approach, that combines a rule-based model, statistical
methods and machine learning tools.
1.1 Domain - Natural Language Processing
The �eld of computational linguistics, attempts to model and study languages using com-
putational techniques. The diverse challenges confronted by computational linguistics re-
searchers include machine translation, automatic text-summarization, speech-to-text, text-
to-speech and many more.
Throughout the years many linguistic resources as dictionaries, lexicons and other forms
of labelled text, were developed, mainly for English, yet some resources do exist for other
languages as well. The �rst attempts to accomplish tasks in NLP were based mostly on
formalizing deterministic models that are based on linguistic knowledge, for example, as [20,
Ch.12] describes the attempt to model English into a Context Free Grammar (CFG), or
the attempt by Klein and Simmons to develop an automatic tagging system, by manually
gathering hand-crafted rules [22]. Later on, the probabilistic path of NLP began blooming,
and the use of corpora-based, statistical data, became widespread [20, Ch.14]. Today, the
use of machine learning, in its various forms, is playing a key role in NLP state of the art
achievements.
3
1.1.1 Natural Language Processing in Hebrew
Computational linguistics in Hebrew is considered relatively di�cult for two main reasons:
The �rst is the lack of existing large-scale Hebrew linguistic resources, making supervised
learning techniques generally harder to apply. The second di�culty is caused by the rich
morphology and high ambiguity level of Hebrew. According to [3, p.39] the average ambigu-
ity rate for a Hebrew word is approximately 2.7, vs. English (1.41), French (1.7) or Spanish
(1.25).
The high ambiguity rate in written Hebrew results, among others, from the lack of
vocalization signs in the Hebrew standard writing system. Therefore, a given Hebrew word,
may have an astonishing number of di�erent meanings. For example, Table 1.1, displays
the di�erent ways to vocalize the word ,ספר! and the corresponding meanings, morphological
analysis and pronunciations.
The high ambiguity rate poses a signi�cant obstacle for many computational linguistic
tasks as text-to-speech, automatic morphological analysis, machine translation and many
other key tasks.
Table 1.1: Ways to vocalize ספר!
Word POS? Morphology Meaning Pronunciationספר! Verb Past-Masculine-3rd person-Singular counted safarספר! Verb Imperative-Masculine-2nd person-Singular count sforספר! Noun Singular book seferספר! Verb Past-Masculine-3rd person-Singular (he) told siperסuפר! Verb Past-Masculine-3rd person-Singular (Was) told suparספר! Verb Imperative-Masculine-2nd person-Singular tell saperספר! Noun Singular hairdresser saparספר! Verb Past-Masculine-3rd person-Singular (He) cut hair siperסuפר! Verb Past-Masculine-3rd person-Singular (his) hair was cut suparספר! Verb Imperative-Masculine-2nd person-Singular cut hair saperספר! Noun Singular border, frontier sfarספר! Noun Singular narrative siper
? Part Of Speech
In some cases, Hebrew texts do contain vocalization signs, the Bible and some ancient
writings, poetry, educational material and even some encyclopaedias include vocalization
signs. In such cases, the ambiguity rate drops signi�cantly, yet such texts are rather scarce,
and (apart of encyclopaedias) rarely use modern Hebrew grammar.
Moreover, such resources rarely exist in digital form, and transforming printed vocalized
Hebrew text into digital form involves either a great deal of manual labour, or via OCR (see
[41]) a substantial number of errors.
4
1.1.2 Natural Language Processing in Hebrew - Related Work
In spite of the di�culties, much has been accomplished in Natural Language Processing in
Hebrew:
� The Hebrew TreeBank: The tree bank [35] includes some 5000 sentences with
complete segmentation and POS-tagging from the "Ha'aretz" daily newspaper.
� MILA - Knowledge Center for Processing Hebrew: MILA [19] o�ers a com-
prehensive corpora of plain Hebrew text, a limited corpora of spoken Hebrew, lexicons
and tools for tokenization, morphological tagging, and morphological disambiguation.
� Word Segmentation: Word segmentation in Hebrew is a challenging task because
of the agglutinative nature of Hebrew - some parts of speech can be glued, as pre�xes
or su�xes, to other words. For example:
1. In Hebrew, there are 7 particles " "מ|!","ש!","ה!","ו!","כ|!","ל!","ב! (letters that do not
appear by themselves, but come as pre�xes to a word), in many cases more than
one such pre�x is valid. Of course, these letters are also used as regular letters
to form words. For example the phrase "!Mומש Nמכא" which means "from here and
from there", should be segmented as [ !M[ו!][מ|!][ש [ !N[מ|!][כא.
2. Hebrew verbs tend to receive pre�xes and su�xes corresponding to their morphol-
ogy. For example the word " "אהבתיה! (Ahavtiha) is composed of "אהב!" (loved),
"תי!" (I) and "ה!" (her), s.t. "אהבתיה!" actually means "I loved her".
3. Nouns also receive su�xes that indicate number, gender and de�niteness.
Much has been accomplished in Hebrew words segmentation [14, 15], but a complete
solution is yet to be achieved. Word segmentation is not a unique challenge in Hebrew,
substantial research of Chinese word segmentation was, and still is, conducted [16, 18,
25, 30].
� Morphological tagging tool: The task of morphologically tagging a given text in-
volves attaching a tag of morphological attributes to every word in the text. In 2007
Adler and Elhadad [3] de�ned a morphological tag-set for Hebrew and developed a
full morphological analyzer for Hebrew Text. The morphological analyzer provides
segmentation to morphemes and POS-tags at about 94% accuracy, and full morpho-
logical disambiguation in 91% accuracy.
5
1.2 Overview
In Part I, we introduce our general area of research and outline the motivation, previous
related work, research questions and our contributions. Part II presents some Hebrew lin-
guistic background and the datasets we use. Part III describes our method, experiments and
results, and Part IV concludes our work and suggests future work. In addition, we present
in the appendix samples from the datasets we produced throughout this work .
6
Chapter 2
Motivation and Contributions
2.1 Practical Applications
The motivation for an automatic vocalization mechanism includes various practical topics,
in which success is only partial today.
2.1.1 Development of a Hebrew Text-To-Speech System
A text-to-speech (TTS) system aims to convert written text into speech. The TTS task
consists of two main parts, one involves processing the words in the text into normalized
words, and the second includes the actual synthesis of voice (based upon the normalized
words). A normalized word attempts to represent the way a word should be read. For
example, the word lead in the sentence "I eat lead for breakfast", should be pronounced as
led (and not leed), this should be deducible from the normalized form of lead.
TTS for non-vocalized Hebrew text is complex mainly because Hebrew non-vocalized
words tend to be very ambiguous, and, therefore, a given non-vocalized word, may be
pronounced in many di�erent ways. The high ambiguity rate in pronunciation is caused by
several main Hebrew attributes:
1. Hebrew's agglutinative nature, and the resulting di�culty in word segmentation.
2. Each Hebrew letter can be pronounced in various ways depending on its vocalization
(up to 12 pronunciations per letter!).
3. Each Hebrew word is pronounced with a certain emphasis, which can not be derived
directly from the non-vocalized word and its morphology. Moreover, Hebrew word
pronunciation is dependent of the syllables composing the word. Each syllable is
pronounced at a frequency corresponding to the vocalization of the letters forming the
syllable. Segmenting a non-vocalized word to its syllables is di�cult since there exist
no clues for deciding if a given letter plays a vowel or a consonant role.
7
Vocalized and syllabi�ed words seem like a good representation of normalized Hebrew
text by which all these issues are either completely solved or drastically simpli�ed. Therefore,
the lion's share of the �rst part of the TTS task may be resolved by creating a program that
returns vocalized text, given the standard non-vocalized input. Once the goal of creating
such a program is achieved, we expect the formation of a TTS system to naturally follow.
The TTS task is well studied in various languages (see [38]); moreover it was implemented
successfully in languages such as English and Spanish (see for example the Festival Speech
Synthesis System [39]). In Hebrew, a few TTS systems exist as Kolan by melingo [21, 26], the
open source Qaryan Hebrew TTS [2] and some others. In [21], 98% accuracy was reported
for Kolan, yet the manner in which this accuracy was measured was not speci�ed. Other
systems do not provide descriptive data assessing their capabilities.
2.1.2 Generation of Vocalized Text for Teaching Usages
Vocalized words are commonly used for teaching Hebrew. The idea is reducing ambiguity,
and the vast number of pronunciation possibilities. Vocalizing words manually is regarded
as a complicated lost art, which may be practically applied by only a handful of scholars.
An automatic vocalization generator could satisfy the need of vocalized words for teaching.
2.1.3 Improving Automatic Translation Systems
Modern automatic translation systems, that translate Hebrew to other languages tend to
err when confronting words that have more than one meaning. For example, the word ,תמונה!
may either be תמוÉה! (Tmunah) that means "picture", or תמונªה! (Temuneh) that means "(you)
will be appointed". In such cases a translation system in most cases chooses the word that
is more frequent. Google Translate, for example, translates the sentence לתפקיד!" תמונה "אתה
to "You picture the job" instead of "you will be appointed to the job".
Translating vocalized text is a much simpler task due to the dramatic reduction of
ambiguity. In other words, a vocalized Hebrew word has a very limited number of meanings
(one - in the absolute majority of cases).
8
2.2 Objectives
2.2.1 Generation
Given a non-vocalized base word (an absolute state noun or a verb base form), we intend
to automatically generate all appropriate in�ections of the word and their corresponding
morphological characteristics.
Obviously, accomplishing the above is a tremendous task that involves a great deal of
manual labour, therefore, we focus on a sub-objective which we will aim to accomplish in
this work. We aim to confront the generation problem for verbs only.
2.2.2 Syllable Segmentation
Given a vocalized word, we intend to present a scheme for syllable segmentation. Once
a word is segmented into syllables, a TTS system could use the vocalized and segmented
output as its normalized text input and determine for each syllable its appropriate duration,
volume and stress.
2.2.3 Unknown Verbs Classi�cation
The general task of vocalization takes non-vocalized text as input, and returns the fully
restored vocalization of the text. The vocalized text could then be further analysed with
relative ease, thanks to the signi�cant reduction of words ambiguity.
The task of vocalization requires either a large corpus of vocalized text (which we did
not manage to acquire in the scope of this work) to apply supervised learning, or a large
lexical resource with full vocalization. In addition to the resource we intend to develop via
our generation mechanism, we will provide a system for classifying unknown verbs into their
corresponding pattern (Binyan) and into the speci�c corresponding in�ection table.
9
2.3 Challenges
2.3.1 The In�ectional Model for Verbs
The Hebrew verb in�ectional model is composed of a complex network of transformations
and vocalization schemas. Implementing such a model involves the manual de�nition of over
250 in�ection tables, where each table includes over 60 in�ection schemes (in average), that
form the actual verb in�ections.
2.3.2 Creating a Shva Classi�cation Algorithm
Segmenting a word into syllables relies heavily on identifying the Shva instances in the word
as Shva Na or Shva Nach. For this reason we must de�ne a method for classifying Shva
instances for one of the two types.
2.3.3 Missing Training/Testing Sets
In order to apply machine learning techniques to confront 2.2.3, we must obtain some labelled
dataset that correlates in�ection tables and base forms of verbs. To our knowledge there
exist no such digital dataset, therefore, we will have to manually gather such a dataset.
2.3.4 Evaluation of Generation and Syllable Segmentation Accu-
racy
Due to the lack of a comprehensive vocalized lexicon (that includes in�ections), we have to
manually check a substantial amount of vocalized words and their corresponding morphology.
This manual check will enable us to assess the accuracy of our generation model and our
resulting, fully vocalized and morphologically disambiguated, in�ections list. Similarly, there
exist no automatic way to assess the precision of our syllable segmentation algorithm, hence
manual checking of a representative group, can not be avoided.
10
2.4 Research Questions
In this research we intend to address two questions regarding vocalization and syllable
segmentation in Hebrew:
� How complex must be the computational model for verb full morphological and vocal-
ization generation? and how much lexical knowledge and exceptions are required to
cover the Hebrew verbs lexicon?
� How complex is syllable segmentation? and what level of knowledge is required for
successful segmentation?
2.5 Contributions
2.5.1 Resources
� A corpus of vocalized songs: Vocalized data is hard to obtain, yet Hebrew literature
often make use of vocalized words. The web site http://www.zemereshet.co.il includes
over 2,500 fully vocalized Hebrew songs, with over 50 words in each song on average.
As part of this work we have gathered these vocalized songs into one vocalized corpora
which may be used in the future. Yet, literature is far from optimal source of vocalized
data, with high rate of names, places, borrowed words from other languages and non-
typical grammar.
� A collection of vocalized and morphologically tagged verbs: We provide a
collection of over 240k vocalized verb in�ections along with a set of corresponding
morphological attributes including tense, gender, person, number and spelling. More-
over, in case more then one valid form of the in�ection exists, we provide all valid
forms.
� A collection of verbs segmented into syllables: We introduce a collection of over
240k vocalized verb in�ections that were automatically segmented into syllables. A
word in the collection is segmented correctly in probability of 99.33% and a syllable is
segmented correctly in probability of 99.5%.
11
2.5.2 Methods
� In�ection tables implementation: We present the Java implementation of over
250 in�ection tables. Each in�ection table takes a verb in its past, masculine, 3rd
person, singular, de�ciently spelled form (base form) and generates all in�ections with
the corresponding morphology.
� Syllable segmentation: We introduce two methods for syllable segmentation of a
given vocalized word. One method takes the vocalized word only, as input, while the
second method also takes the base tense form (the masculine, 3rd person, singular,
de�ciently spelled in�ection in the same tense). The �rst method proved to be accurate
in 81% of the cases, and the second method was accurate in 99.33% of the cases (per
word accuracy).
� Classi�cation of unknown base forms into our generation model: We provide
two methods for the classi�cation of unknown base forms. The �rst method classi�es
a base form to its corresponding pattern (Binyan) in accuracy larger than 90%. The
second method classi�es a base form to its corresponding in�ection table with accuracy
of about 70%.
12
Chapter 3
Previous Work Regarding Hebrew
Vocalization
The task of Hebrew text automatic vocalization was confronted by few companies and aca-
demic researchers. The vowel restoration process may be viewed as the uni�cation of three
independent tasks:
� Word Segmentation (optional): As discussed in 1.1.2, in many cases Hebrew words
are formed by the concatenation of some pre�xes and su�xes, which may indicate par-
ticipant, preposition and more, to words. Therefore, a simple dictionary that includes
words with all these possible pre�xes and su�xes would have to be very big. According
to [4] the number of basic words in Hebrew is around 90,000, yet every noun has 10
su�x modi�ers which determine participant, and 7 possible pre�xes (particles) which
indicate preposition, de�nitive letter etc, and some more combinations of these pre-
�xes are also valid. For verbs the case is even worse, as a base form of a verb may be
in�ected into about 60 in�ections (as described in 6.1), while each in�ection may be
augmented with some possible pre�xes. Overall, as stated in [21], the total number of
words in a dictionary which includes in�ections and other pre�xes and su�xes is esti-
mated to be around 70M. By using a word segmentation mechanism, certain pre�xes
and su�xes are separated from the core of the word, and the size of the dictionary may
be dramatically reduced by a factor of ∼1000 (90k vs. 70M). Generally, a system that
does not use word segmentation as part of its vocalization process may su�er from a
high rate of unknown words.
13
� Suggestion of possible vocalizations: This task involves the gathering of a collec-
tion of possible vocalizations for a given word in the text. Obviously, if each possible
vocalized word in this collection is also associated with its corresponding morphology,
the next selection phase may be performed more easily.
� Selection of a certain vocalization: Given a non-vocalized word and a collection
of the possible ways to vocalize it, in this task, one vocalized form of the word should
be selected from the collection. The selection may be performed according to part of
speech agreement, morphology annotation agreement and other statistical and context
dependent schemes.
Some systems ignore the selection phase, resulting in a semi-automatic vocalization soft-
ware (meaning the user of the software is the one who performs the selection). Yet the holy
grail is obviously a completely automatic vowel restoration system.
3.1 Commercial/Non-academic Vowel Restoration Sys-
tems
Nakdan Text טקסט!) Nנקד) [9, 21] which was originally developed by the Israeli Center for ed-
ucational technology, is marketed today by the Melingo company, and is claimed to vocalize
Hebrew text at over 97% accuracy. As [13] states, it is unclear how this accuracy is mea-
sured (per character/per word), and no information was published regarding the methods
used by Nakdan Text text. In his work [23], Kontorovich examines the gender agreement in
some simple examples vocalized by Nakdan Text, and derives that Nakdan Text is not using
a generation model and is mostly based on lookup tables and ad-hoc rules. The resulting
vocalized text is, therefore, in some cases surprisingly wrong, for example two verbs that
belong to the same paradigm may by vocalized by di�erent vocalization templates.
Another commercial system is Snopi - Automatic Nikud [8] ( אוטומטי! ניקוד - ,(סנופי! which
shows exactly the same results as Nakdan Text for Kontorovich examples. Again, we were
not able to locate any speci�cation of the methods used in Snopi - Automatic Nikud, yet
due to the similarity of the results to Nakdan Text, we believe Kontorovich conclusions
concerning Nakdan Text are also valid in this case. According to other informal tests we
conducted over Snopi - Automatic Nikud it seem to use a rather poor POS-tagger (if any)
and a very limited word segmentation scheme.
14
Auto Nikud [36] ניקוד!) (אוטו is a semi-automatic vocalization software, it lets the user
select the appropriate vocalization for a given word out of all the vocalized words it knows
(which were most probably automatically collected from some vocalized corpus). According
to Auto Nikud website, it includes a list of about 200k vocalized words.
The Nakdanit (נקדנית!) commercial system o�ers automatic, semi-automatic and manual
vocalization. Nakdanit is (according to its website description), a completely dictionary-
based software, a word is automatically vocalized according to the �rst corresponding entry
in the dictionary. The dictionary of Nakdanit includes 260k vocalized words that are 99%
validated to be correct according to the website.
In 2010, another software called Nikuda נ¢קuד¯ה! [34], came to light. Nikuda clearly states
it does not use any linguistic rules, but rely solely on a database of vocalized words. The
database includes words from the Bible, poetry and words manually vocalized by Nikuda
users. In practice, some informal tests we have conducted suggest Nikuda is inferior to
the previous systems. Many unknowns and mistakes were present in examples the previous
systems vocalized successfully. To be speci�c, Nikuda's database oriented method, causes
many unnecessary unknowns. For example, the sentence אתמול! חשב ילד (that means "a child
thought yesterday") is vocalized correctly as אתמול! חש°ב ,יªלד but in the sentence שחשב ילד
אתמול! (that means "a child that thought yesterday") on the other hand, Nikuda treats שחשב!
as unknown, although it may be deduced from חשב! via word segmentation.
Figure 3.1: Ambiguity in vocalization of Hebrew text
? The little prince, by Antoine de Saint-Exupery, translated to Hebrew by Arieh Lerner.
? Words with more than one possible way for vocalization are underlined.
15
We conclude this brief summary of previous commercial and non-academic work regard-
ing automatic vocalization, with a comparison of three of the automatic vocalization tools.
In Figure 3.1, we have a text that includes 88 non-vocalized Hebrew words, where words
which may be vocalized in more than one way are underlined. Some 51 words in the text
actually do have such ambiguity in vocalization (about 58%, very much like the 55% ambi-
guity rate over 40,000 words from the Ha'aretz newspaper reported by Levinger [24]). The
ambiguities in the text and the mistakes made by the di�erent systems were carefully an-
notated according to [1, 4, 5, 11, 31]. Our accuracy measure was calculated in a word-wise
manner (no word segmentation was applied).
Figure 3.2 displays the resulting vocalization of our text by Nakdan Text. We note 9
words are vocalized incorrectly and one word for which spelling was alternated (about 89%
success rate). Three of the mis-vocalized words are non-words, one of these is ,מה! which
should be vocalized מה! according to a decision of the academy for the Hebrew language (see
2.5.3 in [1]). The other two non-words are בתמהה! and בעÉו³ה! instead of בתמהה! and בעÉו³ה!
correspondingly. In both cases the error is in the vocalization of the ב! in the beginning of
the word. The ב! in these cases is not a part of the actual word, but a particle (a formative
letter which means "in") that is glued to the word. These mistakes are surprising, since in
general, the vocalization of formative letters is relatively simple [11], under the assumption
that ב! is recognized as a formative letter. Since the rest of the word is vocalized correctly,
and since the word is not ambiguous, we deduce the ב! may be simply recognized as a
formative letter, and, therefore, the reasoning of its vocalization should be immediate. As
mentioned, one word ( (ספור! in Nakdan Text output, was spelled di�erently as ,(ספר!) in the
output. According to the suggested vocalization it is clear that Nakdan Text produced the
correct word ספור!) and ספר! are both valid forms of the same word in Hebrew), yet for some
mysterious reason Nakdan Text preferred an alternative spelling (as if ספור! is misspelled, as
opposed to [4]).
16
Figure 3.2: The little prince vocalization according to Nakdan Text
? Vocalization mistakes are in red, non-words are in orange and alternatively spelled words
are in pink.
Figure 3.3 displays the resulting vocalization of our text by Snopi - Automatic Nikud.
As displayed, 10 vocalization mistakes are present and 2 words were not vocalized (overall
about 86% success rate). We note the odd mistakes regarding the vocalization of some
formative letters which were present in Nakdan Text are absent here, yet other mistakes
which may result from an inferior dictionary or from the absence of simple in�ectional rules
(for participants su�xes for example) appear.
Figure 3.3: The little prince vocalization according to Snopi - Automatic Nikud
? Vocalization mistakes are in red and Non-vocalized words are in blue.
17
Figure 3.4 displays the resulting vocalization of our text by Nikuda. Here, 13 vocalization
errors appear (one of these is the mis-vocalization of מה! as seen in Nakdan Text), and 10
words were left non-vocalized (overall about 74% success rate). Again, the odd mistakes
made by Nakdan Text are absent, yet as expected in a system which is completely dictionary-
based, the number of unknowns is dramatically increased, and, therefore, more words are
left non-vocalized. We notice all unknown words include formative letters or participants
su�xes, again as one should expect.
Figure 3.4: The little prince vocalization according to Nikuda
? Vocalization mistakes are in red, non-words are in orange and Non-vocalized words are in
blue.
3.2 Commercial/Non-academic Attempts on Syllable Seg-
mentation
The only system we are aware of is a syllable segmentation component in the TTS system
for Hebrew text by Melingo company. The TTS system is called Kolan (!Nקול) [21, 26]
and according to the Melingo web site [26], the syllable segmentation component in Kolan
achieves 98% accuracy in average. Yet, it is unclear if this measure relates to word accu-
racy or syllable accuracy. No additional descriptive information regarding the methods and
datasets used by Melingo were available.
18
3.3 Academic Attempts on Vowel Restoration
In 2001 Kontorovich [23] attempted to use HMM for Hebrew text vocalization. Kontorovich
used parts of the Bible (Westminster Hebrew Morphological Database 2001) for learning
and testing, such that 90% of the data was used for training and the remaining 10% for
testing. Kontorovich applied three experiments, the �rst included the gathering of a context
free list of vocalized words along with their frequency in the training set. Then a word in
the test set was vocalized according to the most frequent word (with similar letters) in the
list. 77% of the testing words were vocalized correctly. The second experiment included
a list of frequencies for the vocalized words along with their corresponding part of speech
tag. Then vocalization is assigned for a word in the test set according to highest frequency
of a word in the list with similar letters and a similar POS tag. Yet, the POS tags for the
words in the testing set were taken as given from the Westminster Database, which is clearly
not applicable for general, non-tagged texts. Here, 79% of the testing words were vocalized
correctly. The third experiment used an HMM with 14 hidden states (corresponding to the
14 POS tags used by Westminster), here 81% of the words were vocalized correctly.
In 2002, Gal [13] aimed to develop a robust system that enables vowel restoration for
both Hebrew and Arabic. The corpora used, included the Hebrew Bible (Westminster) and
the Qur'an (publicly available from the sacred text archive), such that 90% were used for
training and the remaining 10% for testing. As Kontorovich, Gal used a frequency-based
lookup table to set a baseline, achieving accurately vocalized words in 68% and 74% of the
cases, for Hebrew and Arabic correspondingly. Next, Gal used a bigram HMM which used
the previous word as context, doing so Gal achieves 81% and 86% accuracy for Hebrew and
Arabic correspondingly. Another interesting angle noted by Gal was the similarity in modern
phonology of some Hebrew diacritics. Armed with this notion, Gal clustered the Hebrew
vowels into six sound groups, which enabled an impressive improvement of (partial) vowel
restoration - 87%. Obviously, for some usages as text to speech or for reading assistance
as mentioned in [37] restoring the sound group of a vowel is su�cient. For example, the
distinction between !Ë and !Ë is not relevant in modern Hebrew phonology - both signs are
pronounced 'a'.
19
In 2003, Spiegel and Volk used a neural network to address Hebrew automatic vocal-
ization [37]. Again, the Westminster Database was exploited, and a half of the book of
Genesis was used as corpora with 90% and 10% training/testing split. In their system,
Spiegel and Volk restored exactly one vowel per letter, therefore, Dagesh, Mapik and also
apparently Shin Dots (see 4.1.2 for further details on these diacritics) restoration were not
addressed. The neural network used, included a hidden layer with 300 nodes and was given
non-vocalized words with a corresponding morphology as input (no additional contextual
input is used). Spiegel and Volk report 74% accuracy per letter once morphology is not
used, 85% letter accuracy with morphology and 87% accuracy per letter once vowels are
clustered, as previously achieved by Gal [13].
In other languages some attempts were made to restore vowels. In 2005, Yarowsky
attempted to restore accent in Spanish and French [40], and [28, 32, 42] that attempted
restoring vowels in Arabic. Yet, these seem to be signi�cantly simpler than the Hebrew
vocalization problem due to the sheer number of existing Hebrew vowel types and the higher
ambiguity rate [3].
Generally, it seems that the lack of publicly available, modern, vocalized data, poses a
signi�cant obstacle for automatic vocalization. As both Kontorovich and Gal claim a sub-
stantial percentage of the mistakes performed by their systems resulted from the irregularity
of the (ancient) text that was used as corpora. On the other hand, all Hebrew related systems
discussed, ignored the problem of segmenting the input text into words, and relied on the
segmentation given by the Westminster Database. More over, some systems [23, 37] directly
use the manually assigned POS tags and morphology given in Westminster Database. For
these reasons, it seems di�cult to assess the true capability of these systems when handling
modern, non-annotated text.
3.4 Academic Attempts on Syllable Segmentation
In 2001, Müller [27] used a probabilistic CFG (PCFG) for automatic detection of syllable
boundaries in Dutch, the PCFG was automatically assembled according to a pronunciation
dictionary which provides syllable boundaries (to the best of our knowledge such a resource
is not available for Hebrew). Müller reports 96.4% of the words were syllabi�ed correctly.
In 2002, Finkel and Stump [12] attempted to mark the stressed syllable in Hebrew verbs
as part of their work. The method they use incorporates the heuristic developed by the
linguistic, Rabbi-Eliyahu-Behor (which is described in detail in 6.2.1). Yet, no measure of
success is provided.
In 2008 Bartlett, Kondrak and Cherry [7] used structured SVMs for automatic syllab-
i�cation of English, German and Dutch. Training data was gathered from the CELEX
20
annotated dataset (same as Müller). Here 85.5% 99% and 98% of the words were syllabi�ed
correctly for English, German and Dutch respectively.
Some attempts were also made regarding less known languages as Uyghur [33], where
Saimaiti and Feng describe a rule-based algorithm that achieves 98.7% word accuracy.
3.5 Generation in Hebrew
In 2002, Finkel and Stump [12] used default inheritance hierarchies to model the in�ectional
system of Hebrew verbs. The hierarchy was expressed by the formal language KATR,
such that given a query which includes some morphosyntactic attributes, the output is a
fully vocalized verb. Moreover, Finkel and Stump also distinct Shva Na and Shva Nach,
and provide the stress location in their output. Yet, seemingly no accuracy checks were
conducted for any of the parts of this system, and, therefore, the quality of the output is
unclear. In addition, the writers state only a portion from the total generation task for verbs
was actually implemented. Moreover, it seems the "out of the ordinary" types of in�ections,
are the ones left out, so it is di�cult to assess the true advantage of using KATR.
In 2010, Dannélls and Camilleri [10] implemented a mechanism for generating verb
in�ections. This mechanism was implemented for both Hebrew and Maltese, which are
both Semitic languages with some resembling characteristics. In their system, Dannélls and
Camilleri do not provide vocalization, therefore the number of in�ection patters ought to
be implemented, for completely covering the Hebrew verb in�ectional model, signi�cantly
drops. Again, not all types of verb in�ection patterns were implemented and no accuracy
checks were conducted.
3.6 Conclusions
Overall, to the best of our knowledge, it seems neither commercial systems nor academic
attempts on automatic vocalization have tried using an extensive generation mechanism as
part of their methods. As noted by Kontorovich [23], existing systems which rely on partial
or inconsistent dictionaries may be substantially crippled. Moreover, a generative model
simpli�es error pruning and the speed at which a fully tagged dictionary may be assembled.
Concerning syllable segmentation, due to the lack of existing annotated data in Hebrew,
in this work we will attempt to develop an accurate rule-based system for syllabi�cation.
Regarding verb classi�cation into patterns (Binyanim) or paradigms, we are not aware
of any previous work.
21
Part II
22
Chapter 4
Background - Principles of
Vocalization in Hebrew
4.1 Linguistic De�nitions
The following linguistic description is based in general on [1, 4, 5, 6, 11, 29, 31].
4.1.1 Hebrew Letters
The Hebrew alphabet uses 22 letters and 5 more �nal letters as displayed in Table 4.1. In
addition, modern spoken and written Hebrew uses 2 more letters and one more �nal letter
as displayed in Table 4.2.
4.1.2 Vocalization Signs in Hebrew
Unlike other languages, vowels in Hebrew are not displayed as independent letters, but as
vocalization signs decorating the letters. Hebrew vocalization signs may be used to de�ne
several attributes for a given letter:
The Letter Function as a Consonant
Consonants are clustered into types that indicate the part in the diction system, by which the
consonant is pronounced, as displayed in Table 4.3 [29]. The �ve consonants that originate
from the throat, are called the guttural consonants הגרוניות!) ,(האותיות and their in�uence over
vocalization is substantial, unlike most of the other consonants.
23
Table 4.1: Hebrew letters
Name Writing Final Spoken sound Additional spoken soundAlef א! � A �Bet ב! � B V
Gimel ג! � G �Dalet ד! � D �Hey ה! � H �Vav ו! � W VZain ז! � Z �Het ח! � No parallel in English �Tet ט! � T �Yod י! � Y �Kaf כ|! !K K No parallel in English
Lamed ל! � L �Mem מ|! !M M �Nun נ|! !N N �
Samech ס! � S �Ayin ע! � No parallel in English �Peh פ|! !P P FTsadi צ|! !Z TS �Kuf ק! � K �Reish ר! � R �Shin ש! � S SHTav ת! � T �
Table 4.2: Additional letters
Name Writing Final Spoken sound Additional spoken soundJimel ג!' � J �Chadi צ|!' '!Z CH �
A letter functioning as a consonant will either be vocalized with a Shva (!Ë), or if it is
the last letter of the word it will not be vocalized. The letters Kaf and Tav are exceptions
- a consonant Kaf that is the last letter in the word will be vocalized with a Shva, and a
consonant Tav that is the last letter in the word will be vocalized with a Shva if it is a 2nd
person, past, female, singular, verb.
There exist two types of Shva in Hebrew, Shva Nach ( נח! (שווא and Shva Na ( נע! .(שווא
The distinction between the two types of Shva is necessary for syllable segmentation and
for Dagesh Kal ( קל! (דגש positioning.
24
Table 4.3: The types of consonants
Origin LettersThroat א,ה,ח,ע,ר!Palate ג,י,כ|,ק!Tongue ד,ט,ל,נ|,ת!Teeth ז,ס,צ|,ש,ש!Lips ב,ו,מ|,פ|!
A Shva may be identi�ed as Shva Na נע!) (שווא or Shva Nach נח!) (שווא according to the
following rules:
� A Shva is a Shva Na if it vocalizes the �rst letter in the word.
� A Shva is a Shva Nach if it vocalizes the last letter in the word.
� Two consecutive Shva instances at the end of the word are both Shva Nach.
� Any other type of Shva may be identi�ed as Na or Nah by its presence in the origin
form of the word (the absolute state for a noun, and the in�ection with the same tense
in its singular, masculine, 3rd person form, for a verb).
� If the Shva is present in the origin form, it is a Shva Nach.
� If the Shva vocalizes the letter that was last in the origin form and this letter was
not vocalized or vocalized by a Patah Ganuv, it is a Shva Nach.
� Otherwise it is a Shva Na.
For example, the word נ�בלבל! (Nevalbel), that means "(We) will confuse", includes two
appearances of Shva, the �rst is a Shva Na (since it is positioned as the �rst letter of the
word), the second is a Shva Nach (due to its presence in י�בלבל! - the singular, masculine,
3rd person, future in�ection).
In most cases a letter vocalized by Shva is pronounced as a consonant, yet few letters
vocalized with Shva denote it may be pronounced as an "E" vowel. This phenomenon is
relatively common for Shva Na, for example in נ�בלבל! (Nevalbel) the �rst Shva is pronounced
as "E". And rather rare for Shva Nach, e.g in מעד�ת! (Maadet) that means "(you feminine)
lost balance" the �rst Shva is a Shva Nach and is pronounced as "E".
25
The Letter Function as a Certain Vowel
Hebrew uses 9 vocalization marks to describe vowels, and 3 more semi-vowels (Hataf Ka-
mats, Hataf Patah and Hataf Segol). In the past each vocalization sign corresponded to a
unique vowel sound, modern Hebrew on the other hand, uses the same sound for multiple
vocalization marks, as shown in Table 4.4.
Semi-vowels (Hataf Patah, Hataf Kamats and Hataf Segol) are basically vowels that
are pronounced very similarly to their corresponding vowel (Hataf Patah to Patah, Hataf
Kamats to Kamats Katan and Hataf Segol to Segol) only in a shorter manner. In modern,
spoken Hebrew, the distinction of vowels and semi-vowels is mild, and so vowels from the
same sound group (see Table 4.4), are pronounced in the same manner.
There exist two unique vowels that can not be applied for any letter, the Suruk and
Holam Male can only be applied to Vav ( ו! and .(ו!
Another exception is the case of Patah Ganuv, Patah Ganuv is a Patah that vocalizes
either Hey, Het or Ain that is the last letter in the word. Such a Patah is pronounced
di�erently, its "A" sound is pronounced before the sound of the letter it vocalizes, unlike all
the other vowels.
Table 4.4: Vocalization vowels and semi-vowels
Sound group Vocalization sign NameA !Ë Kamats
!Ë Patah!Ë Hataf Patah (semi-vowel)
E !Ë Segol!Ë Tsere!Ë Hataf Segol (semi-vowel)
I !Ë HirikU !uË Kubuts
ו! ShurukO !Ë Holam
ו! Holam Male!Ë Kamats Katan!Ë Hataf Kamats (semi-vowel)
26
Di�ering the Pronunciation of the Letter
Several vocalization signs change the way certain letters are pronounced.
� Dagesh : Basically, the Dagesh ( !Ì) is used for emphasizing letters, yet modern Hebrew
neglected this emphasis for most of the letters. Today, Dagesh is noticeable only when
it vocalizes 3 letters - Bet, Kaf and Peh. Table 4.5 displays the in�uence of the Dagesh
on the pronunciation of these letters. Several letters can not be applied with Dagesh:
Alef ,(א!) Hey ,(ה!) Het ,(ח!) Ain (ע!) and Reish .(ר!) A Dagesh may belong to one of
two types, Dagesh Kal קל!) (דגש and Dagesh Hazak חזק!) .(דגש
Dagesh Hazak is either aDagesh that is structured in the general pattern corresponding
to the word (Mishkal for nouns and Binyan for verbs), or a Dagesh resulting of some
linguistic phenomena. For example, the Noun מתÉה! (Matana), that means "present"
includes a Dagesh Hazak that is structured in the Mishkal of the noun, while the
Dagesh in the word Éחתי! (Nahatty), which means "(I) landed", is created through the
following linguistic phenomenon called uni�cation: +תי! Éחת! -> Éחתתי! -> Éחתי!
Dagesh Kal may appear only in a small subset of letters Bet ,(ב!) Gimel ,(ג!) Dalet ( ,(ד!
Kaf ( ,(כ|! Peh ( (פ|! and Tav ,(ת!) and only either if the letter is the �rst one in the word,
or if it follows a Shva Nach. For example, the word תש תלב! (Tishtalev) that means
"(you) will �t in", includes 3 occurrences of Dagesh. The �rst is Dagesh Kal, since it
vocalizes the letter Tav and it is positioned as the �rst letter of the word. The second
Dagesh is also a Dagesh Kal, since it vocalizes a Tav that follows a Shva Nach. The
third Dagesh is not a Dagesh Kal, since it vocalizes a Lamed and not one of the ,ב! ,ג!
,ד! ,כ|! ,פ|! ת! letters, therefore, it is a Dagesh Hazak.
Table 4.5: Dagesh sound manipulations
Name With/Without Dagesh Letter soundBet ב! B
ב! VKaf כ! K
כ|! No parallel in EnglishPeh פ! P
פ|! F
� Mapik : The Mapik (ה!) sign indicates a consonant Hey at the end of the word. The
Mapik emphasises the pronunciation of a Hey vocalized by it, yet its a�ect in modern
Hebrew is rather mild. In many cases, Mapik denotes a female possessor.
� Shin dots: In vocalized text the letter Shin will always be accompanied by a right
or a left Shin dot ש!) or .(ש! The Shin dot indicates whether the letter should be
pronounced as SH ( (ש! or as S .(ש!)
27
4.1.3 Syllables
Each Hebrew word is composed of one or more sequences of letters called syllables. A syllable
is a phonological entity that is pronounced in one e�ort. Each syllable includes exactly one
vowel, and may contain one or more consonants.
There exist two conventions for segmenting a word into syllables [11], one regards letters
vocalized by Shva Na, Hataf Patah, Hataf Kamats and Hataf Segol as vowels, where the
second regards these as consonants. For example, let us segment the word רואי®נ�תי! (Roayanti),
that means "(I) was interviewed", to syllables. By the �rst convention, the Hataf Patah ( !Ë)
is regarded as a vowel, and, therefore, the word is segmented to רו!-א!-י®נ�!-תי! (Ro-a-yan-ti).
The second convention though, treats !Ë the same way as Shva, and, therefore, the resulting
segmentation is -תי! רו!-אי®נ�! (Ro-ayan-ti).
Syllables are divided into two types, open syllables and closed syllables. An open syllable
is a syllable ending with a vowel, and a closed syllable ends with a consonant. For example,
in -אי®נ�!-תי! ,רו! the syllable אי®נ�! is considered closed since it ends with a Shva, while תי! and רו!
are considered open because they end with a vowel. Note that the syllable תי! is regarded
as ending with a vowel, although it ends with a non-vocalized letter, this accrues because
there exist four cases where a vowel is not created by a single letter and its associated vowel
marking. These cases are Ëי! (Hirik Male), Ëי! (Tsere Male), Ëו! (Holam Male) and Ëו! (Shuruk
Male). In two of these cases Ëי!) and (Ëי! the syllable ends with a non-vocalized letter, and
yet, the vowel is regarded as open due to the phonological role of these forms - as vowels.
Generally, the segmentation of a word is performed according to the way the word is
pronounced, each exhalation e�ort corresponds to a syllable. Another way to segment a word
into syllables is by using vowels (as mentioned, each syllable includes exactly one vowel),
and consonants (Shva Na denotes the beginning of a syllable, and Shva Nach indicates a
syllable ending) markings as indicators.
28
4.1.4 Stress
Generally, a Hebrew word is stressed according to one of two stress schemes, Milel (מלעיל!)
or Milra .(מלרע!)
The Milel stress scheme denotes that the stress is located on the syllable preceding the
last syllable, and Milra denotes the stress location is on the last syllable.
For example, the word תינקת! (Tinoket), that means "baby girl", is pronounced by the
Milel stress scheme, and, therefore, is pronounced תי!-נ!-קת! (Ti-no-ket), underscore denotes
the stress position. On the other hand, מור»ה! (Moreh), that means "teacher", is stressed
according to Milra, and, therefore, is pronounced מו!-ר»ה! (Mo-reh).
The vast majority of words in Hebrew are pronounced by the Milra stress [29], and a few
words in modern Hebrew are not stressed by either Milel or Milra (meaning the stress is
positioned at an earlier syllable). Stress may sometimes be inferred, following the inversion
of the the syllables and vowels rule, presented in 4.1.6.
4.1.5 De�cient Spelling vs. Plene Spelling
Hebrew spelling scheme includes in many cases more than one valid form to write a word.
The letters Vav ( (ו! and Yod (י!) may, in some cases, be omitted from the word's spelling.
For example, the word אירפא! (Erape), that means "(I) will be healed", may also be written
in the following (de�cient) way - .ארפא!
Actually, in past times only the de�cient spelling was regarded as valid, but due to
the di�culty in reading a non-vocalized, de�ciently spelled word, the Plene spelling evolved
around the middle ages. Today, Plene writing dominates common written Hebrew, yet many
texts use a mixture of Plene and de�cient spelling.
4.1.6 Vocalization Rules
In linguistic view, determining the way a given word should be vocalized in, relies heavily
on the general sound of the word, the location of the stress and the segmentation of the
word into syllables. Surprisingly, even when these are known, determining the vocalization
is still regarded as a di�cult task, this results mainly from two issues:
� Each vowel sound in Hebrew, corresponds to more than one vocalization sign.
� Some vocalization signs, gradually changed throughout history, such that nowadays
they do not invoke any change to the pronunciation of the word. Obviously, a correct
and complete vocalization does include these signs.
29
Table 4.6: Long vs. short vowels
Vowel typeSound group Long vowel Short vowel
A !Ë !ËE Ëי! , !Ë !ËI Ëי! !ËU Ëו! !uËO Ëו! !Ë
The syllables and vowels rule [11] If pronunciation, stress and the syllable segmenta-
tion are known, a guideline for general vocalization called "The syllables and vowels rule"
והתנועות!) ההברות (כלל can be applied (see Algorithm 1). The syllables and vowels rule de-
termines the type of vowel (a short vowel or a long vowel as shown in Table 4.6) a given
syllable will get. Since the pronunciation of the syllable is known, the correct vocalization
for the syllable can be selected based on the sound group.
Algorithm 1 The syllables and vowels rule והתנועות!) ההברות (כללRequire: A stressed/non-stressed syllable (s)
if s is a non-stressed syllable thenif s is an open syllable then
return Vocalize s with a long vowel (according to Table 4.6)else
return Vocalize s with a short vowel (according to Table 4.6)end if
elsereturn In most cases s should be Vocalized with a long vowel (according to Table4.6), yet the number of exceptions is considerable
end if
Despite the above, the syllables and vowels rule serves only as a general guideline for
vocalizing Hebrew words, and does not give a complete solution for vocalization due to
several reasons:
� Hebrew includes a vast number of special or unique cases in which the syllables and
vowels rule can not be applied, and specially costumed rules must be used, therefore,
vocalization requires considerable acquaintance of such exceptional cases.
� The rule assumes that one who is using it knows for each syllable if it is an opened
or a closed syllable. This obviously may prove to be di�cult, given a non-vocalized
word.
� The syllables and vowels rule does not determine when semi-vowels should be used.
In Table 4.7 the syllables and vowels rule is applied for several words as an example.
30
Table 4.7: Application of the syllables and vowels rule
Word Pronunciation Meaning Syllables Stressed syllable Vocalizationעכבר! Achbar Mouse עכ|!-בר! / Ach-bar בר! / bar עÇבר!נהר! Nahar River נ|!-הר! / Na-har הר! / har Éהר!ספר! Sefer Book ס!-פר! / Se-fer ס! / Se ספר!
?לילה! Laila Night לי!-לה! / Lai -la לי! / Lai לי�לה!?דלת! Delet Door ד!-לת! / De-let ד! / De ד»לת!
? Denotes an exception to the syllables and vowels rule.
4.2 The Case of Verbs
A Hebrew verb is composed of a stem under some in�ection. The stem includes 3 letters
in most cases, but 4 and 5 letter stems also exist. A stem may be in�ected into an actual
Hebrew verb, by adding/removing/replacing letters, and by setting vocalization signs, the
resulting in�ection corresponds to a set of morphological attributes:
� Tense: Past / Beinoni (Participle) / Present / Future / Imperative.
� Gender: Masculine / Feminine / Both.
� Person: First / Second / Third.
� Number: Singular / Plural.
The base form of a verb is regarded as the in�ection corresponding to the morphology
past, masculine, 3rd person, singular. The formation of a verb in�ection is usually perceived
as a transformation of the base form, or of the stem.
Each set of morphological attributes derives a "typical" way of in�ection, meaning in
many cases a given morphology (and a pattern - !Nבני) derive the addition/removal of certain
letters, and a speci�c pattern of vocalization signs. Yet, since Hebrew includes many out of
the ordinary verbs, vocalization of in�ected verbs remains a di�cult task.
Since the Hebrew verb in�ections include added/removed letters and vocalization signs
that indicate the in�ection's morphology, the number of possible in�ections, may be ex-
tremely high. Moreover, the fact that many in�ections may be written in either Plene or
de�cient spelling, and some in�ections may be vocalized in multiple ways, causes some verbs
to have an exceptional number of in�ections, up to several hundreds.
31
4.2.1 Patterns - Binyanim
The Hebrew verb in�ectional system is based on patterns (Binyanim - !Mבניני), each pattern
(Binyan) corresponds to a set of templates in which the verb may be in�ected. There exist 7
patterns - פעל! (Paal), נ¢פעל! (Nifal), פעל! (Piel), פuעל! (Pual), התפעל! (Hitpael), הפעיל! (Hi�l)
and הuפעל! (Hufal) [6, 31].
The letters פ|!,ע!,ל! that are present in all the patterns names represent the stem letters
that are cast into the pattern to form a valid in�ection of the stem. The פ|!,ע!,ל! letters are
called הפועל! פ|!' (Peh Hapoal - the Peh of the verb), הפועל! ע!' (Ain Hapoal - the Ain of the
verb) and הפועל! ל!' (Lamed Hapoal - the Lamed of the verb) respectively. Stems with more
than 3 letters are regarded as having more than one Ain Hapoal (in such cases they are
regarded as 1 ,ע! 2 ע! etc.).
As displayed in Figure 4.1, four of the patterns are regarded as light patterns - Paal,
Nifal, Hi�l and Hufal ( !Mהקלי Mהבנייני), and the other 3 are regarded as heavy - Piel, Pual,
Hitpael ( !Mהכבדי Mהבנייני). The light patterns include Dagesh Hazak in their vocalization
templates only in very speci�c circumstances, while the heavy patterns include a Dagesh
Hazak at the Ain of the verb in the majority of cases.
Figure 4.1: Hebrew patterns
Verbפועל!
�� ���� ���� ���� ��Light patterns
!Mהקלי Mהבניני
vv xx �� ��vv xx �� ��vv xx �� ��vv xx �� ��
Heavy patterns!Mהכבדי Mהבניני
�� && �� && �� && �� &&Paalפעל!
Nifalנפעל!
Hi�lהפעיל!
Hufalהופעל!
Pielפיעל!
Pualפועל!
Hitpael
התפעל!
Some stems can be in�ected according to several patterns, for example the stem ,ש!מ|!"נ|!
can be in�ected by Paal to produce a verb that means "(He) got fat", by Piel to produce
a verb that means "(He) lubricated", by Pual to produce a verb that means "(He) was
Lubricated" (passive) and by Hi�l to produce a verb that means "(He) was Getting fat"
(progressive).
Given a stem and a pattern, we expect to know the scheme for creating an actual verb, but
in practice the task of generating the fully vocalized in�ections (each in�ection corresponds
to a possible morphology), is far from being easy. Each pattern is associated with dozens
of In�ection tables that describe how each type of stem should be in�ected. For example
the Paal pattern includes over 50 distinctive in�ection tables (according to [31]), each table
describes the in�ection pattern for a certain type of stems.
32
4.2.2 In�ection Tables
Even when given a pattern and a complete set of morphological attributes, the exact letters
added/removed and vocalization, by which the in�ection is formed, remains ambiguous.
This results from the fact that the template by which the word is in�ected, is in�uenced
by the letters of the stem. Each set of letters, forming a stem, corresponds to an in�ection
table that describes the manner in which the given stem should be vocalized (see [6, 31]). So
in practice, each pattern is associated with multiple in�ection tables. The in�ection tables
divide into paradigms of stems, each paradigm corresponds to a family of stems, each family
with speci�c attributes. In practice, each paradigm is divided into sub-paradigms (that
correspond to speci�c in�ection tables), but here we present only the top level paradigms.
The verb paradigms ( הפועל! (גזרות
� Complete paradigms !Mהשלמי :גזרות Paradigms for which, all the letters of the stem
are maintained in every in�ection of the verb. For example, every in�ection of the
stem גד!"ל! according to the Paal pattern includes all the stem letters.
� Crippled paradigms נחות! :גזרות Paradigms for which, some in�ections include letters
(from the stem) that are written but are not pronounced, or in�ections which replace
a letter from the stem by another letter. For example, the in�ection of the stem "ל! אכ|!
by the Paal pattern to the morphology - future, 1st person, plural (which means "(we)
will eat") is נ¸אכל! (Nochal), where the Alef is not pronounced [31].
� Defective paradigms חסרות! :גזרות Paradigms for which, some in�ections are missing
a letter that is present in the stem. For example, the in�ection of the stem נס!"ע! by
the Paal pattern to the morphology - imperative, masculine, 2nd person, singular is
סע! (which means "drive"). Obviously, it is missing the �rst letter of the stem.
� Double paradigms !Mהכפולי :גזרות Paradigms for which, Ain Hapoal (the second
letter in the stem) and Lamed Hapoal (the third letter in the stem) are identical. For
example, the stem סב!"ב! has a Bet both as it Ain Hapoal and its Lamed Hapoal.
� Compound paradigms מורכבות! :גזרות Paradigms which include stems that comply
to more than one type of the other paradigms. For example, the stem אפ|!"ה! is regarded
as both crippled and missing [31].
Table 4.8 displays the in�ection table of the stem שפ|!"כ|! (a morpheme that means "to
spill"), that corresponds to the complete paradigm by the Paal pattern. The table describes
all the in�ected verbs and their proper morphology. Table 4.9 displays the in�ection table
of the stem גד!"ל! (also by the in the Paal pattern and the complete paradigm).
33
Clearly, some in�ections with identical morphological attributes di�er between the tables
in more than just the letters of the stem. For example, Table 4.9 does not include any Beinoni
in�ections as opposed to Table 4.8. Moreover, the present, future and imperative templates
of in�ection are also di�erent.
Table 4.8: A complete paradigm from the Paal pattern - in�ections of שפ|!"כ|!
Number I Singular Plural
Person I 1 2 3 1 2 3
Gender I M F M F M F M F M F M F
Past שµפÇתי! שµפÇת! שµפÇת! שµפ�! שµפÈה! שµפÇנו! !MתÇש פ !NתÇש פ שµפכו!Present שופ�! שופכת! שופ�! שופכת! שופ�! שופכת! !Mשופכי שופכות! !Mשופכי שופכות! !Mשופכי שופכות!Beinoni שµפו�! ש פוÈה! שµפו�! ש פוÈה! שµפו�! ש פוÈה! !Mש פוכי ש פוכות! !Mש פוכי ש פוכות! !Mש פוכי ש פוכות!Future אש פו�! תש פו�! תש פכי! י¢ש פו�! תש פו�! נ¢ש פו�! תש פכו! י¢ש פכו!
אש פ�! תש פ�! � י¢ש פ�! תש פ�! נ¢ש פ�! � תש פוÉÇה! � תש פוÉÇה!� � � � � � � � � תש פÉÇה! � תש פÉÇה!
Imperative � � ש פו�! ש¤פכי! � � � � ש¤פכו! � �
� � ש פ�! � � � � � � ש פוÉÇה! � �
� � � � � � � � � ש פÉÇה! � �
Table 4.9: A complete paradigm from the Paal pattern - in�ections of גד!"ל!
Number I Singular PluralPerson I 1 2 3 1 2 3Gender I M F M F M F M F M F M F
Past ג³ד¯לתי! ג³ד¯לת! ג³ד¯לת! ג³ד¯ל! ג³ד�לה! ג³ד¯לנו! !Mג�ד¯לת !Nג�ד¯לת ג³ד�לו!Present ג³ד§ל! ג³ד§לה! ג³ד§ל! ג³ד§לה! ג³ד§ל! ג³ד§לה! !Mג�ד§לי ג�ד§לות! !Mג�ד§לי ג�ד§לות! !Mג�ד§לי ג�ד§לות!Future אג�ד¯ל! תג�ד¯ל! תג�ד�לי! י¢ג�ד¯ל! תג�ד¯ל! נ¢ג�ד¯ל! תג�ד�לו! י¢ג�ד�לו!
� � � � � � � � � תג�ד¯לÉה! � תג�ד¯לÉה!Imperative � � ג�ד¯ל! ג¢ד�לי! � � � � ג¢ד�לו! � �
� � � � � � � � � ג�ד¯לÉה! � �
4.3 Conclusions
Hebrew vocalization scheme includes a vast network of rules, that are based on knowing the
pronunciation of the word and the word's syllable segmentation. If these are known, then
the syllables and vowels rule may be applied for vocalization, but since Hebrew includes a
great deal of unique and special cases, the syllables and vowels rule only serves as a basic
heuristic for vocalization. No short cuts are at hand, a correct and complete vocalization
requires a rare expertise that is obtained today only by linguistics and experts.
In�ection tables describe the way a stem or a base form of a verb (see 4.2) can be
in�ected. Selecting an appropriate In�ection table for a stem/base from, depends both on
the pattern, and the letters forming the stem.
34
Chapter 5
Datasets
5.1 Verbs List
As a part of this work, we used a manually constructed list including over 4k base-forms of
verbs (de�ciently spelled verbs in their past, masculine, singular, 3rd person form). Verbs
in the list are non-vocalized, but do include Shin dots, and a corresponding in�ection table
indicator (see appendix A).
5.2 Morphologically Analysed Corpora
Using the morphological tagger [3], we obtained a list including approximately 50 million
Hebrew words that are fully morphologically disambiguated. The corpora includes materials
from the "Haaretz" newspaper, the "Tapuz" website, the "TheMarker" newspaper, the
"Kneset" (the Israeli legislature) dissociations documentation and more.
35
Part III
36
Chapter 6
Method
6.1 Generation
The �rst task we confronted is the task of generating fully vocalized Hebrew verbs along
with their corresponding morphology.
In 2002, Hspell [17] �rst release came to light, Hspell aims to implement a free Hebrew
spell checking system. Doing so required Hspell developers to devise a way to obtain a
signi�cant quantity of perfectly spelled words. Obtaining such a word list turned out to be
a complicated task, since existing datasets are not comprehensive enough, and automatically
gathering words from existing texts may also collect mistakes. For this reason, Hspell decided
to develop a generation mechanism that in�ects base forms into words.
The Hspell generation mechanism creates in�ected, non-vocalized nouns, verbs and ad-
jectives, with their corresponding morphological properties. Yet all the generated in�ections
are non-vocalized and include only de�ciently spelled words (as de�ned by the Academy of
the Hebrew Language [1]). As mentioned at 4.1.5, Plene spelling is widespread in modern,
written Hebrew, making Hspell 's convention rather harsh.
In [6, 31] the Hebrew verb in�ectional model is presented via the representing in�ec-
tion tables. For example, [31] describes 264 in�ection tables corresponding to the di�erent
patterns and paradigms. Despite this, Hspell 's decision to limit their boundaries to non-
vocalized, de�ciently spelled text, narrowed down the needed manual labour signi�cantly,
since only a limited subset of the 264 in�ection tables had to be implemented. Applying this
approach Hspell developed a vast reserve of in�ected Hebrew verbs with minimal error rate
and a corresponding morphological attribute set for each word. On the other hand, Hspell 's
harsh (de�cient spelling) convention, and the lack of vocalization signs call for a less strict
generation system that also produces the appropriate vocalization signs.
37
We decided to take Hspell 's path, and implement a comprehensive system for genera-
tion of vocalized Hebrew verbs with corresponding morphological attributes. As opposed
to Hspell, our system also generates Plene spelling and alternative valid forms of writ-
ing/vocalizing (in addition to the de�cient writing). For example, Table 6.1 shows all the
valid ways to spell and vocalize the word מועמד! (Moamad), that means candidate. All the
presented in�ections in Table 6.1 have the same morphology.
Table 6.1: The valid forms of מועמד!
Plene De�cientמועמד! מuעמד!מועמד! מuעמד!מועמד! מעמד!מועמד! מעמד!
In such cases, when several valid ways of writing/vocalizing a verb exist, our system gen-
erates all valid forms of the in�ection. The input of our mechanism includes two parameters:
� The base form of a verb: Here we use a standard unicode representation of the non-
vocalized, de�ciently spelled, base form. One exception to this standard representation
is the letter Shin ,(ש!) which is the only letter in Hebrew that actually represent two
distinct letters - Shin ( (ש! and Sin .(ש!)
We treat Shin and Sin as unique, since Shin never transforms into Sin and vice versa,
unlike Bet, Kaf and Peh which change their pronunciation depending on their relative
position in the word and their preceding vocalization ,ב|!/ב|!) כ|!/כ|! and .(פ|!/פ|! For this
reason we used ש! to represent Shin, and ' ש! to represent Sin.
� A corresponding in�ection table: As part of our generation model, we imple-
mented the vast majority of existing in�ection tables in Hebrew [6, 31]. 264 in�ection
tables were manually implemented to in�ect an input base form of a verb to all pos-
sible in�ections. An average in�ection table generates 60 in�ections, yet some tables
correspond to exceptional base forms that may form over 300 in�ections.
Some less common in�ection tables were not developed, but implementing them in-
volves a minimal amount of manual labour since such tables are, in most cases, very
similar to more common tables (which are already implemented). Since the in�ec-
tion tables were implemented in Java, the built in class inheritance, may be used for
extending our model and developing new tables, in a quick and simple manner.
38
The vast majority of frequently used verbs may be in�ected by using the appropriate
in�ection table to create the fully vocalized verb, its corresponding morphology and its
spelling indicator (Plene/de�cient).
The next phase of creating the automatic generation mechanism involved, creating a
comprehensive list which correlates base forms to in�ection tables. Over 4k non-vocalized
base forms were manually correlated to their appropriate in�ection table. Some base forms
that may be in�ected by more than one table were correlated to all their corresponding
in�ection tables.
Applying our in�ection mechanism over the comprehensive base form list, generated a
staggering 240k, fully vocalized in�ected verbs, along with a corresponding set of morpho-
logical attributes for each verb.
In addition to the generated verb in�ections, our system also generates for each base form
(given the appropriate in�ection table) its proper in�nitive form. The in�nitive in�ection is
also fully vocalized, and generated in all valid forms (Plene spelling, de�cient spelling, and
alternative writing/vocalization options).
The list of annotated verbs (4,000) and their fully in�ected and vocalized forms (over
240,000) is provided in <http://www.cs.bgu.ac.il/∼nlpproj/hebrew-vocalization/>. It pro-
vides a signi�cant new resource for analysis of in�ected words into their morphological
attributes (through lookup) and for generation of in�ected vocalized verbal forms.
39
6.2 Syllable Segmentation
Our second goal was to develop a scheme for automatic segmentation of words into syllables.
As discussed in 4.1.3 there exist two conventions by which a word may be segmented into
syllables, we decided to follow the second convention in our attempts to perform syllable
segmentation (as [11] does). Therefore, we consider all letters vocalized by Shva and Hataf,
as consonants.
Attempting to directly segment a non-vocalized word into syllables is di�cult. This
results not only from the high ambiguity rate in Hebrew which suggests a given word may
have multiple valid segmentations, but also since any Hebrew non-vocalized letter may
function either as a vowel or as a consonant. Therefore, we decided to segment fully vocalized
words, as the words generated by the system we presented in 6.1.
Segmenting a word into syllables is based on the position of vowels and consonants in
the word. Each syllable includes a single vowel and may contain several consonants. Vowels
are easy to recognize, and obviously two vowels in the word are members of two separated
syllables. Non-vocalized letters usually correspond to their preceding syllable. Identifying
the correct syllable for a certain consonants on the other hand, is a much more complicated
problem.
The Shva (!Ë) vocalization sign denotes a consonant, and as explained in 4.1.2 there exist
two types of Shva, Shva Na ( נע! (שווא and Shva Nach ( נח! .(שווא The two Shva types may
be used to recognize the beginning and the ending of a syllable, according to [11], a letter
vocalized by a Shva Na is the �rst letter in a syllable, and a letter vocalized by a Shva Nach
is the last (this rule has one exception, in case two consecutive Shva instances are in the
end of the word - they are both Shva Nach, and they both belong to the last syllable in the
word). So, if we develop a scheme for identifying the Shva type of a given word, the task of
segmentation to syllables will be at hand.
6.2.1 Following Behor's Footsteps - Heuristic Approach
During the 16th century, the linguistic, Rabbi-Eliyahu-Behor devised a simple heuristic, for
determining the type of Shva in Hebrew words. The heuristic presented by Behor included
5 simple rules by which one could classify a Shva as being Shva Na or Shva Nach, Behor's
Shva classi�cation scheme is presented at Algorithm 2.
40
Algorithm 2 The Behor scheme for Shva classi�cation
Require: A vocalized wordif The Shva is at the beginning of the word then
return The Shva is a Shva Naelse if The Shva is second among two consecutive Shva instances (which are not at theend of the word) then
return The Shva is a Shva Naelse if The Shva follows a long non-stressed vowel (according to Table 4.6) then
return The Shva is a Shva Naelse if The Shva vocalizes a letter with Dagesh Hazak then
return The Shva is a Shva Naelse if The Shva vocalizes the �rst letter among two identical letters then
return The Shva is a Shva Naend if
Directly implementing Behor's heuristic as a method to classify Shva appearances is
somewhat problematic. The �rst two conditions are easy to validate (and are always cor-
rect), but the third condition requires the position of the stress, the fourth condition relies
on identifying the type of Dagesh in a given letter, and the �fth rule has many exceptions.
Therefore, we decided to use a resembling heuristic (Algorithm 3), which relies only on
information that is explicitly included in the vocalized word.
In addition to the rules suggested by Behor, we use a known rule which has no exceptions.
This rule determines that the �rst Shva among two consecutive Shva instances, is a Shva
Nach. As can be seen in Algorithm 3, in case none of the rules is applied, we determine that
the Shva is a Shva Nach.
Algorithm 3 Our heuristic for Shva classi�cation
Require: A vocalized wordif The Shva is at the beginning of the word then
return The Shva is a Shva Naelse if The Shva is �rst among two consecutive Shva instances then
return The Shva is a Shva Nachelse if The Shva is second among two consecutive Shva instances (which are not at theend of the word) then
return The Shva is a Shva Naelse if The Shva follows a long vowel (according to Table 4.6) then
return The Shva is a Shva Naelse
return The Shva is a Shva Nachend if
41
6.2.2 Shva Classi�cation According to the Base Form
The partial successes achieved by applying the syllable segmentation, based on 6.2.1, moti-
vated us to use more than just the vocalized word, for classifying Shva instances.
We decided to use the origin form of the word - masculine, singular, 3rd person, de�ciently
spelled in�ection in the same tense (we refer as the base tense form) for verbs, and the
singular, absolute state ( הנפרד! (צורת morphology for nouns as an indicator for the type of
the Shva.
As explained in [11], a Shva in a given word may be identi�ed as a Shva Na or Shva
Nach by its presence in the origin form of the word. If the Shva is present in the origin form
- it is a Shva Nach, otherwise - it is a Shva Na. Although this classi�cation scheme seems
simple, matching Shva instances in practice poses a challenge. A given word may contain
more than one Shva, and in order to correctly classify it, we need to match each Shva in the
word to its corresponding Shva instance in the origin form (if such a corresponding Shva
actually exists).
For example, the word תז�ד¯קקי! (Tizdakekiy), that means "you (female) will need", includes
two Shva instances, and its corresponding base tense form is י¢ז�ד¯קק! (yizdakek), which means
"(he) will need". We note, the base tense form also contains two Shva instances, one is
visually obvious (at the ,(ז! and the other isn't (at the second (ק! since it is the last letter in
the word. Therefore, the �rst Shva in תז�ד¯קקי! is a Shva Nach (since it is present in the base
tense form), and the second Shva is a Shva Na (since it is not present in the corresponding
ק! in the base tense form).
To make things even more complicated, as mentioned, the Hebrew in�ectional system for
verbs may add, remove or replace letters and vocalization signs, therefore, matching Shva
instances is far from a trivial task.
We decided to cope with the problem of matching Shva instances by using a string
matching algorithm that is based on an edit distance metric. The idea is to transform a given
string to another with the minimal needed change by applying copying, addition, removal
and replacement operations over characters in the string. The de�nition of what a minimal
change is, is expressed by a weight (or a price) for each type of operation, the algorithm
attempts to minimize the total cost of transformation. Once the edit distance metric is
founded, using dynamic programming and backtracking to align the strings, is easy. Given
such an alignment, identifying the Shva type according to its presence/absence in the base
tense form is immediate. Algorithms 4, 5 and 6 describe the scheme for calculating a matrix
that contains the edit distance of every two sub-strings of the input words (M [len1][len2]
holds the edit distance between s1 and s2).
42
Algorithm 4 Edit distance matrix calculation
Require: Strings <s1, s2>, Integers <wInsert, wDelete, wCopy, wReplace>Ensure: M[i,j] = The edit distance between s1.substring(i) and s2.substring(j)
len1← s1.length()len2← s2.length()M ← initiateEditDistanceMatrix(len1, len2, wInsert, wDelete)for i← 1 to len1 do
for j ← 1 to len2 do
M [i][j]←Min
M [i− 1][j − 1] + δ(s1[i− 1], s2[j − 1], wCopy, wReplace),M [i− 1][j] + wDelete,M [i][j − 1] + wInsert
end for
end forreturn M
Algorithm 5 Initiation of the edit distance Matrix
Require: Integers <string1Lengh, string2Lengh, wInsert, wDelete>Initiate M as a matrix of dimensions string1Lengh× string2Lengh s.t ∀i, j M [i][j] = 0topStringSize←Max(string1Lengh, string2Lengh)for i← 1 to topStringSize do
if i < string1Lengh thenM [i][0]← i ∗ wDelete
end ifif i < string2Lengh then
M [0][i]← i ∗ wInsertend if
end for
Yet, directly applying Algorithm 4 to align a vocalized Hebrew verb in�ection and a base
tense form is far from optimal, because of the following properties of the Hebrew in�ectional
system:
� Final and non-�nal letters: As discussed in 4.1.1, some Hebrew letters have a
corresponding �nal letter, therefore, the comparison of c1 and c2 in Algorithm 6 may
indicate two similar letters as di�erent (if one is non-�nite while the other is).
� Only a subset of letters may be removed to form an in�ection: Most Hebrew
letters are never removed from the base tense form to form an in�ection, the only
exceptions are .ו!,י!,נ|!
� Only a subset of letters may be added to form an in�ection: The only letters
that may be added to the base tense form to form an in�ection are .ו!,ה!,מ|!,א!,י!,ת!,נ|!
� Only a subset of letters may be replaced to form an in�ection: The only
letters that may be replaced in the base tense form to generate an in�ection are ה!,י!,נ|!
and their replacing letters may only include .ו!,ה!,מ|!,א!,י!,ת!,נ|!
43
Algorithm 6 The edit distance δ-function
Require: characters <c1, c2>, Integers <wCopy, wReplace>if c1 = c2 then
return wCopyelse
return wReplaceend if
Solving the �rst issue is easy, instead of comparing c1 and c2 in Algorithm 6 directly, we
will check if c1 and c2 represent the same Hebrew letter or if c1 is equal to c2. Such that
equal characters (as vocalization signs or letters) on the one hand, and �nal and non-�nal
instances of the same letter on the other, are regarded as equal.
Following the rest of the issues, we note the add, remove and replace operation that
take place during the formation of an in�ection, may only use the letters .ו!,ה!,מ|!,א!,י!,ת!,נ|! In
order improve our string matching results we decided to exploit this property. Therefore,
the weight we give to every possible copy operation depends on the letter being copied, if
one of the above letters is copied the reword (to the edit distance value) is small - we regard
this as a "weak copy", otherwise (the letter is not of the above) the reword is signi�cant -
we regard this as a "strong copy". Vocalization signs are treated as weak copies, since they
all have the tendency to be added, removed and replaced while forming an in�ection. In
order to distinguish weak copies and strong copies we rewrote the δ-function as shown in
Algorithm 7.
Algorithm 7 The Hebrew adapted δ-function
Require: Characters <c1, c2>, Integers <wWeakCopy, wStrongCopy, wReplace>if c1 = c2 or sameLetter(c1, c2) then
if strongLetter(c1) thenreturn wStrongCopy
elsereturn wWeakCopy
end ifelse
return wReplaceend if
? sameLetter(c1, c2) denotes a function that returns true if c1 and c2 represent the sameletter.
? strongLetter(c1) denotes a function that returns true if c1 is a Hebrew letter that doesnot belong to the set ,א,י,ת,נ|!} .{ו,ה,מ|
44
Initiating the above string matching algorithm requires us to allocate values to the
weights wInsert, wDelete, wWeakCopy, wStrongCopy and wReplace. The weights we de-
cided of are:
� wInsert=wDelete=3: The least preferable option is for a character to be inserted/removed
from the word when it doesn't have to be removed/inserted, therefore, we apply the
biggest penalty for this option.
� wWeakCopy=-1: A copy of character is good, and, therefore, we reword for it, on
the other hand, this is a copy of a "weak" character (one among { {ו,ה,מ|,א,י,ת,נ|! or a
vocalization sign), and, therefore, the reword is modest.
� wStrongCopy=-50: A copy of a "strong" letter must never be missed, therefore, we
reword it generously.
� wReplace=2: If a copy is not possible we would like to replace rather than insert-
ing/deleting (if possible). This operation is widely applied for changes of vocalization.
45
6.3 Unknown Verbs Classi�cation
Our third objective was to automatically classify Hebrew, non-vocalized base forms (of
verbs) to their corresponding in�ection table. By developing an automatic classi�er for
base forms, it will be possible to generate a vast number of associations of base forms and
tables. This may be achieved by morphologically analyzing some plain text, and applying our
classi�cation mechanism over the words identi�ed as verbs with morphology that corresponds
to a base form.
The classi�er we developed is a Support Vector Machine that receives a non-vocalized
base form, and returns a corresponding in�ection table. The SVM classi�er makes use of
the following features:
� Length: The length of the base form that corresponds to each table is constant, there-
fore, eliminating in�ection tables with a di�erent corresponding base form in�ection
length, is easy. This feature may reduce the number of possible tables signi�cantly, yet
on the other hand, many tables share a similar length. For example, if we are given
a 3 letter base form, it de�nitely does not correspond to the Hitpael pattern (since
all base forms in the Hitpael pattern are composed of 4 letters or more [31]). So the
length feature directly eliminates the 60 in�ection tables which the Hitpael pattern
consists of (and 172 in�ection tables in overall), out of the total 264 in�ection tables.
On the other hand, the ambiguity level remains high with 92 possible in�ection tables.
� Letters position: Many of the base forms that correspond to tables, have some
letters �xed in a given position. For example, patterns of base forms from the Hufal
pattern always include the letter Hey as the �rst letter in the base form (regardless
of the stem). Our feature extraction mechanism, extracts from a given base form any
(preset) number of sequential letters and their corresponding (beginning) position.
� Guttural letters position: Many in�ection tables (and their corresponding base
forms) include guttural letters (see 4.1.2) at some �xed locations (an in�ection table
may be associated to any guttural and not to a speci�c one). Therefore, gutturals
may be used as indicators to identify/eliminate in�ection tables. The di�erence of this
feature and the letters position feature is here we generate the same feature for all
types of gutturals unlike the letter position feature that generates a unique feature for
every type of letter.
46
� Corpus level features: The above features narrow down the number of relevant
in�ection tables signi�cantly, but in many cases the in�ection table of a given base form
may not be simply extrapolated according to the length, letters and guttural features.
For example, the base form !Mש¤ל (Shilem), that means "(he) paid", corresponds to a
certain in�ection table from the Piel pattern, while the base form שµד¯ד! (Shadad), that
means "(he) robbed", corresponds to a completely di�erent in�ection table in the Paal
pattern. The two words are of the same length with no guttural letters, and the letters
that are present do not give any indication for the corresponding table (as mentioned,
the input to our classi�er does not include the vocalization signs, for otherwise this
task was much simpler). In such cases we attempt to di�erentiate base forms by using
corpus level features, this is done by using a comprehensive, morphologically analyzed,
Hebrew corpus (see 5.2). The idea is that given a base-form we �rst classify it into
a cluster of in�ection tables (instead of directly selecting a speci�c table). Then we
use the data obtained from the corpora to pinpoint the correct table in the selected
cluster. In detail, the idea is as follows:
Given a non-vocalized base form (let us mark this base form by β), we use the SVM to
cluster β into groups of in�ection tables (let G be the set of all such groups and g will
denote some arbitrary group in G). g includes in�ection tables that cause confusion
for an SVM without corpus level features.
After β is classi�ed to some group (g for example), we use the corpus level features
to detect the correct in�ection table in g (let i denote an arbitrary in�ection table in
g). This is achieved by in�ecting β by each in�ection table (i for example) in g , such
that a set of vocalized in�ections with their morphology (denoted by I ) are generated.
Now we simply check for the total number of appearances of in�ections from I (and
their corresponding morphology) in the corpus. This number of matches is used as a
feature to indicate the most likely in�ection table in g . The general scheme for the
two-phased classi�cation with corpus level features is given in Algorithm 8.
Algorithm 8, which extracts corpus level features, may either be applied with a vocal-
ized or a non-vocalized corpora, since currently we do not possess such a resource, we
apply this method by using 5.2.
Table 6.2 displays the features and their associated weights created for the base form
הÇuר¯ז! ("Huchraz"), which means "(was) announced". The features presented are created on
the second classi�cation phase (after it was associated to a group of in�ection tables which
includes the tables F-8 (Hi�l), G-1 (Hufal) and G-8 (Hufal)), and, therefore, the corpus
level features are included.
47
Algorithm 8 Corpus level features extraction
Require: Base form <β>, Group of in�ection tables g1, Corpus <C>corpusLevelFeatures← ∅totalNumberOfInflections← 0for all i such that i ∈ g do
I ← i .makeInflections(β)feature← makeNewFeature(i .getName())feature.setV alue(C.countOccurences(I ))totalNumberOfInflections← totalNumberOfInflections+ feature.getV alue()corpusLevelFeatures← corpusLevelFeatures ∪ {feature}
end forreturn normalizeCorpusLevelFeatures2(corpusLevelFeatures, totalNumberOfInflections)
1. g is the group of in�ection tables to which β was classi�ed by the �rst step of classi�-cation (classi�cation to groups).
2. The function normalizeCorpusLevelFeatures(features,totalIn�ections) normalizes thevalue of each feature by totalIn�ections (the total number of in�ections generated byin�ection tables in g), so the resulting value is a number (which we use as the weightof the feature) between 0 to 1 and∑
f∈features
f.getV alue() = 1
This number represents the frequency in the corpus of in�ections corresponding to thegiven verb according to a speci�c in�ection table.
Table 6.2: Features created on the second classi�cation phase for the base form הÇuר¯ז!
Feature type Feature weightLength feature WORD_SIZE__4 1Letters position features? AT_0__HEY 1
AT_1__KAF 1AT_2__REISH 1AT_3__ZAIN 1
AT_0__HEY&KAF 1AT_1__KAF&REISH 1AT_2__REISH&ZAIN 1
Guttural letters position features AT_0__GRONIT 1AT_2__GRONIT 1
Corpus level features TABLE__F20 0TABLE__G1 0.5TABLE__G8 0.5
? Here the letter position features were extracted for subsequence of two letters or less. Asdescribed in 7.3.2, we experimented on various possible lengths of such sequences.
48
Chapter 7
Experiments and Results
7.1 Generation Results
By using the in�ection tables described in 6.1, along with the association of base forms
to in�ection tables (see 5.1), we generated about 240k fully vocalized, and morphologically
tagged verbs (see appendix B). Since there is no reliable, automatic way to test for the
correctness of our output, we manually tested the correctness of the in�ections formed by
264 base forms and their corresponding morphologies - one base form (and all the generated
in�ections associated to it) per table. The total number of in�ections manually checked is
15, 221.
As displayed in table 7.1, 91 errors that resulted from several reasons were found. Under
the (false) assumption that base forms in the dataset are uniformly distributed among the
in�ection tables, a verb from our auto-generated in�ection list, is a valid verb with correct
vocalization and morphology in probability of 99.4%. In order to measure accuracy without
this assumption, we must test for a larger number of base forms form each in�ection table
(as opposed to our test that included one base form per in�ection table).
The mistake type marked as type I caused the most errors, yet all these errors were
caused by a single defective base form in the dataset. This base form was טפל! (from the Piel
pattern), that was misspelled as טיפל! (Plene spelling instead of de�cient spelling as required
in the dataset). Thanks to this error we decided to add a function for preventing mistakes
of this type (Plene instead of de�cient), this function simply validates that the number of
letters in the base form is equal to the number of letters expected by the in�ection table.
By applying this function we spotted 11 �awed base forms, which were in�ected into 601
defective verb in�ections. Three of the 11 mistakes resulted of mis-associated tables (type II
mistakes), and the rest originated from a misspelled (Plene instead of de�cient) base form.
49
Table 7.1: Generation error analysis
Mistakes in the dataset Mistakes in the Generation scheme Overallmistakes
Mistake type I I II III IV
Errors source I Flawed Wrong base form Wrong Spelling/vocalizationbase form to table association morphology error
# of mistakes 50 0 7 34 91Details All mistakes All mistakes
resulted of resulted of oneone �awed in�ection tablebase form
We estimate the current resource of over 240,000 fully vocalized and in�ected ver-
bal forms for 4,000 base verb forms is over 99.4% correct. This resource is available at
<http://www.cs.bgu.ac.il/∼nlpproj/hebrew-vocalization>.
7.2 Syllable Segmentation Results
7.2.1 Following Behor's Footsteps - Heuristic Approach
The procedure we implemented, that is based on Behor's heuristic, may be applied for any
vocalized word (not just for verbs). Yet since we do not possess a dataset that contain
non-verb, fully vocalized, words, we use our auto-generated, vocalized, verb in�ections as a
test data set.
Due to the lack of a Hebrew source in which syllables are tagged (like the resources used
by [7, 27]), in order to estimate the accuracy of our heuristic approach, we manually check
the syllable segmentation. We tested the correctness of 300 randomly selected in�ections.
We �nd that a given word is correctly segmented into syllables with accuracy of 81%, and
the probability an arbitrary syllable is segmented correctly is 85.92%.
Since the !Ë vocalization sign corresponds with two vocalizations (Kamats and Kamats
Katan), we used the more common vocalization that corresponded with the marking (Ka-
mats). This notice is important since Kamats Katan is regarded as a short vowel and
Kamats as long, therefore, we must de�ne the manner in which our heuristic deals with
such markings. This observation a�ects the results signi�cantly, regarding any !Ë as a Ka-
mats Katan results with a poor 75.33% word accuracy, and the probability an arbitrary
syllable is segmented correctly deteriorates to 81.72%.
50
7.2.2 Syllable Segmentation According to the Base Form
In order to estimate the accuracy of the string-edit approach to syllable segmentation, we
manually tested for the correctness of the same (randomly selected) 300 in�ections we used
in 7.2.1. Our results show an in�ection is segmented correctly to syllables according to string
matching with the proper base tense form, in 99.33% of the cases. The probability for an
arbitrary syllable to be segmented correctly is 99.5%. Table 7.2 concludes and compares
the results of the two syllable segmentation schemes, and Appendix C displays several verbs
segmented into syllables according to their base form.
Table 7.2: Syllable segmentation comparison
Accuracy Error analysis
Checked Errors in Accuracy Accuracy Na instead Nach instead Otherwords/syllables words/syllables per word per Syllable of Nach of Na errors
Algorithm based onBehor's Heuristic 300/810 57/114 81% 85.92% 52.63%(1) 43.85%(2) 3.52%
String matchingwith the base 300/810 2/4 99.33% 99.5% 0% 100%(3) 0%tense form
1. All these errors resulted from a long vowel that was preceding to the Shva. By Behor,only a Shva that follows a non-stressed long vowel is a Shva Na, yet since we do notknow the position of the stress we apply this rule for every long vowel (regardless ofthe stress).
2. All these errors resulted from a short vowel that was preceding to the Shva, which isidenti�ed as Shva Nach by our heuristic.
3. These errors resulted from a mis-alignment of the in�ection string and the base tenseform string.
We anticipate the errors marked in Table 7.2 as (3), can be avoided by augmenting
another rule to the string matching mechanism. This rule will state a vocalization sign may
never be replaced by a letter and visa-versa.
7.2.3 Discussion
We observe that syllable segmentation is empirically more complex than could have been
expected. Traditional grammars develop the intuition that a few simple rules can accurately
predict syllable segmentation - classi�cation of open and closed syllables, Shva classi�cation
into Shva Na and Shva Nach being the basis of the mechanism.
It turns out that even when we have access to a fully vocalized verb form, we succeed in
only about 80-85% of the cases. Taking into account that automatic vocalization succeeds
at only about 90% accuracy (in the best reported commercial systems), this would indicate
that the success rate for the automatic segmentation of a non-vocalized word form would be
about 0.9× 0.8 ∼ 72% accuracy.
51
If, however, we start the process from a base form that is non-in�ected, plus the morpho-
logical attributes of the word, then a 2 stage generation process would success with almost
perfect accuracy: our vocalization generation implementation succeeds at over 99.4% accu-
racy given the verb in�ection paradigm and the syllable segmentation implementation using
the string-edit distance succeeds at over 99.3% accuracy.
This observation strongly supports a view of Hebrew phonology that is based on a con-
structionist process � derive phonological information through a process of morphological
in�ection from base forms � and not in a pipeline process where morphology would �rst
generate fully in�ected word forms later segmented into phonological units.
7.3 Results of Unknown Verbs Classi�cation
7.3.1 Base Forms Classi�cation into Binyanim (patterns)
Through this experiment we attempt to automatically classify, non-vocalized base forms to
their corresponding pattern. An important notice is that many non-vocalized base forms
have more than one (valid) corresponding pattern, by which the base form may be in�ected.
For example, the base form !Nשמ, may be in�ected by the Paal pattern to form !Nמµש (Shaman)
which means "(became) fat", by the Piel pattern to form !Nש¤מ (Shimen) which means "(he)
lubricated" and by the Pual pattern to form !Nש«מ (Shuman) that means "(was) lubricated".
The ultimate goal is, therefore, the classi�cation of such words (words that correspond to
more than one pattern) to clusters of corresponding patterns.
We focus on classifying a subset of our base form data set that includes only base forms
that correspond to exactly one in�ection table. This subset includes 2,703 base forms (vs.
the 4246 total number of base forms in our dataset).
Table 7.3 compares the distribution of base forms among patterns in our two lists (and
not in natural text). We compare base forms in the whole dataset vs. the subset that
includes only base forms with exactly one corresponding pattern.
Table 7.3: Base form distribution among patterns
Pattern I Paal Nifal Piel Pual Hitpael Hi�l Hufal TotalThe whole dataset 755 397 1,014 492 685 597 306 4,246
17.78% 9.34% 23.88% 11.58% 16.13% 14.06% 7.2% 100%
The subset with only 494 316 446 6 657 505 279 2,703one pattern per 18.27% 11.69% 16.5% 0.22% 24.3% 18.68% 10.32% 100%
base form
52
In Figure 7.1, which corresponds to Table 7.3, we note a signi�cant decrease of base
forms that correspond to the Pual, Piel and Paal patterns. This phenomenon is caused by
several resembling attributes of these patterns. For example, almost every (non-vocalized)
base form that corresponds to the Pual pattern, may be also in�ected by the Piel pattern.
For this reason the number of Pual base forms plunges so dramatically in our subset. We
anticipate the resemblance among base forms that correlate to these three patterns, will
cause the majority of errors to our system.
Figure 7.1: Base form distribution among patterns
We use a Support Vector Machine to classify the base forms according to the length,
letters position and guttural letters position features. The base forms in the data set are
randomly separated into a testing set and a training set, 70% were used for training, and 30%
for testing. Table 7.4 displays a typical confusion matrix, and the corresponding precision
and recall. The average accuracy we achieve is 90.25% (averaged over 5 predictions with
independently selected training/testing sets).
As displayed in Table 7.4, the recall measure of the Piel pattern is particularly low, and
most errors resulted of a mis-classi�cation of the base forms as corresponding to the Paal
pattern.
53
Table 7.4: Classi�cation of base forms to patterns - confusion matrix
Classi�er labelling I Paal Nifal Piel Pual Hitpael Hi�l Hufal RecallActual labelling H
Paal 133 1 14 0 0 0 1 0.8926Nifal 2 93 0 0 0 0 0 0.9789Piel 27 1 86 0 0 0 7 0.7107Pual 0 0 1 0 0 0 0 0
Hitpael 0 0 0 0 196 5 1 0.9702Hi�l 1 0 0 0 1 146 5 0.9542Hufal 0 0 0 0 0 1 89 0.9888
Precision 0.8159 0.9789 0.85148 N/A 0.9949 0.9605 0.8640
7.3.2 Classi�cation of Base Forms to In�ection Tables
In order to set a baseline for the task of identifying the exact in�ection table for a given
unknown base form, we will �rst test for the success rate of this task with the length and
letter position features only. Table 7.5 compares the accuracy and the accuracy's STD
of this baseline to a classi�er that uses the gutturals feature. Presented accuracies were
averaged over 5 independent experiments, with 70% of the samples used for training, and
the remaining 30% for testing.
Table 7.5: Identi�cation of exact in�ection table - comparison of baseline manipulations
Classi�er I Baseline classi�er Classi�er with the gutturals featureLetter position feature manipulations H Accuracy/STD Accuracy/STD
Without the letter 24.06%/0.45 32.5%/0.81position feature(1)
With 1 letter 66.88%/0.89 68.63%/0.9per feature(2)
With 1 and 2 62.63%/1.18 65.67%/1.88letters per feature(3)
With 1, 2 and 3 62.58%/2.49 65.32%/2.43letters per feature(4)
1. No letter position features are created.
2. A feature is created per letter.
3. A feature is created per 2 consecutive letters, in addition to previous features (2).
4. A feature is created per 3 consecutive letters, in addition to previous features (3).
The best results are achieved once the letter position feature uses one letter only, we
believe this is a direct consequent of the enlarged number of created features once more than
one letter is used. When the 1 and 2 letters per features are used, the average number of
features created is increased by a factor of ∼2.7. These additional features cause the SVM's
dimensionality to grow making our model less stable (higher variance) and less accurate.
The addition of the guttural feature improved the best result by 1.75% and again, 1
letter position features produced the best results.
54
7.3.3 Classi�cation of Base Forms to In�ection Tables with Corpus
Level Features
In order to improve our results, we made an attempt to use corpus level features (as described
in 6.3). Due to the lack of a fully vocalized corpora, we make use of the non-vocalized corpora
described in 5.2.
In order to decide which in�ection table sets to use, we base upon the in�ection tables that
cause the classi�er to err the most, as displayed in Figure 7.2. The error rates displayed
in Figure 7.2 correspond to the 25 pairs of in�ection tables which the classi�er confuses
the most. The �gure displays the number of errors in a simple, one phase classi�cation,
averaged over 5 independent classi�cation experiments (sorted by # of errors), and the
corresponding cumulative error rate. All 5 experiments were con�gured to use our best
experiment con�guration (including the guttural feature and 1 letter for the letter position
features).
Figure 7.2: In�ection table confusion pairs
? A,B,C,D,E,F,G correspond to the patterns Paal, Nifal, Piel, Pual, Hitpael, Hi�l, Hufalrespectively. The numbers associated to the patterns correspond to speci�c in�ectiontables (according to [31]).
55
Table 7.6 displays the average accuracy and STD achieved by the two step classi�cation
mechanism, where corpus level features are used in the second classi�cation phase (as de-
scribed in 6.3). We de�ne the in�ection table clusters for the second phase by unifying the
i most confusing pairs (displayed in Figure 7.2). The uni�cation is done according to the
scheme displayed in Algorithm 9.
Algorithm 9 Uni�cation of clusters of in�ection tables which cause confusion
Require: Set of in�ection tables confusion sets <S = {s1, ..., sn}>, A new pair of confusingin�ection tables < p >c1 = p.getFirst()c2 = p.getSecond()if c1∈ sa and c2∈ sb s.t. sa, sb ∈ S and sa 6= sb then
return S.unify(S.find(c1), S.find(c2))else if c1∈ sa and c2()∈ sb s.t. sa, sb ∈ S and sa = sb then
return Selse if c1∈ sa and c2/∈ sb s.t. sa ∈ S, ∀sb ∈ S then
return S.unify(S.find(c1), S.makeSet(c2))else if c1/∈ sa and c2∈ sb s.t. ∀sa ∈ S, sb ∈ S then
return S.unify(S.makeSet(c1), S.find(c2))else
return S.unify(S.makeSet(c1), S.makeSet(c2))end if
? makeSet, �nd and unify are the known methods of the UnionFind data structure:
◦ S.makeSet(x): Creates a new set that includes only x in S .
◦ �nd(x): Finds the set in the which contains x in S .
◦ unify(x,y): merges the set that includes x in S with the set that includes y in S.
To our disappointment, only the A1-C1 cluster shows a signi�cant improvement in the
overall accuracy. Generally, the more clusters we de�ne, the higher will the �rst step accuracy
be (although there exist "di�cult" clusters which disrupt the classi�er - as F9-F16). Yet, in
many cases the corpus level features does not seem e�ective enough to improve our results
by improving the accuracy of the second classi�cation step. We do expect that using a
vocalized corpora instead, may improve our accuracy more signi�cantly, at this time though
we do not possess such a resource.
56
Table 7.6: Classi�cation of base forms to in�ection tables with corpus level features
# of pairs New pair Clusters of First step Cluster Totalin�ection tables accuracy/STD accuracy?/STD accuracy/STD
0 � 68.63/0.9 68.63/0.91 A1-C1 72.62%/0.016 70.08%/1.4
{A1,C1} 55.94%/3.82 A17-C4 72.65%/1.19 68.97%/1.06
{A1,C1} 48.13%/5.14{A17,C4} 53.68%/5.91
3 A22-C1 73.53%/1.25 68.8%/0.9{A1,A22,C1} 48.92%/5.15{A17,C4} 51.87%/7.84
4 F9-F16 72.6%/2.42 67.12%/2.24{A1,A22,C1} 48.19%/6.07{A17,C4} 56.1%/5.35{F9,F16} 52.47%/20.98
5 C12-C20 73.31%/1.15 67.12%/0.77{A1,A22,C1} 50.88%/4.09{A17,C4} 50.21%/9.2{C12,C20} 69.95%/5.53{F9,F16} 48.48%/6.83
6 A1-A22 No change No change No change No change7 B1-C20 73.29%/2.18 67.27%/2.83
{A1,A22,C1} 53.96%/4.14{A17,C4} 56.69%/13.28
{B1,C12,C20} 78.1%/3.88{F9,F16} 35.95%/11.92
? Note that each cluster accuracy includes mistakes which result of mis-classi�cations in the�rst classi�cation step (classi�cation into clusters).
7.3.4 Discussion
While classi�cation of verbs into one of the 7 Binyanim (patterns) of Hebrew is a simple
task given the letter pattern of the base form of a verb, the confusion between a few patterns
remains challenging (in particular Paal vs. Piel). The pattern information, though, is not
su�cient to predict vocalization of the verb's many in�ections.
Full classi�cation of a non-vocalized verb form into a non-ambiguous in�ection paradigm
is a much more challenging task. This contradicts traditional grammars which develop the
intuition that simple letter-based rules are su�cient to predict the vocalization paradigm of
a Hebrew verb. This approach only succeeds at a rate of about 68% on a large sample of
about 2,700 verbs.
The addition of simple corpus-based features improves the classi�cation in a signi�cant
manner. We experimented with simple unsupervised corpus-based features and observed
that for one of the most common confusion sources (A1-C1 - which is the most common
group of Paal / Piel verbs), corpus-based features improve the classi�cation signi�cantly.
We hypothesize that stronger corpus-based features based on vocalized data would provide
higher improvements for more confusion sets.
This �nding indicates that the Hebrew verb word formation system is much more irreg-
ular than one could assume on the basis of traditional grammar descriptions.
57
Part IV
58
Chapter 8
Conclusions and Future Work
8.1 Conclusions
This work aimed to create and use a vocalized Hebrew dataset for various NLP usages.
We developed through a generation model, a large, fully vocalized, dataset which includes
about 240k Hebrew in�ected verbs along with their corresponding morphological attributes.
The 240 in�ections are associated to over 4k base forms of Hebrew verbs. A manual check
of a sample including 15,000 in�ections and their corresponding morphology, estimates that
the dataset accuracy of spelling, vocalization and morphological attributes, is around 99.4%
(under the uniformity assumption mentioned in 7.1). Generally, it seems that a considerable,
fully vocalized and morphologically tagged datasets, with a minimal error rate, can be
obtained by constructing a comprehensive generation mechanism. Obviously, since now this
mechanism is implemented, new verbs and their corresponding in�ections may be easily
added to the dataset. This analysis indicates that given a non-vocalized base verb form and
one of about 300 unambiguous in�ection paradigms, one can deterministically generate a
fully accurate vocalized form of any in�ected form of the verb.
Having this dataset, we developed two algorithms for segmentation of vocalized Hebrew
words into syllables. Our �rst algorithm used the vocalized word as its only input. Based on
a middle-ages heuristic developed by Rabbi-Eliyahu-Behor, we managed to segment 85.92%
of the syllables and 81% of the words correctly. This accuracy measurement was made by
manually checking a random sample of 300 verbs that comprised 810 syllables. Our second
syllable segmentation algorithm used the vocalized word together with its corresponding
origin form (or absolute state for nouns). We tested segmentation accuracy with the same
sample. Our second approach signi�cantly improved our results, some 99.5% of the syllables
and 99.33% of the words were segmented correctly. This improved method requires addi-
tional knowledge to accurately perform syllable segmentation. If the vocalized non-in�ected
59
base tense form of the verb is added, we then can perform the task almost perfectly. This
�nding strongly supports a constructionist view of Hebrew phonology.
In order to estimate the amount of knowledge required to map non-vocalized base forms
to in�ection tables (and, therefore, to predict the full vocalization paradigm of the verb),
we experiment with automatic classi�cation of a non-vocalized non-in�ected base forms of
verbs into one of over 260 in�ection paradigms. This task proved to be rather challenging,
whereas many non-vocalized base forms correspond to more than one possible in�ection
table. We focused on classifying base forms with only one corresponding in�ection table.
We applied a two step classi�cation process, �rst - classi�cation into clusters of in�ection
tables, and second - pinpointing the table in the cluster. Following this approach we achieved
70.08% accuracy in classifying a base form into over than 260 possible in�ection tables. In
addition, We also used our classi�er for classifying base forms of verbs into their correlated
patterns, here we achieve 90.25% accuracy. We conclude that classifying base forms into
speci�c in�ection tables and to patterns, are tasks that are far from trivial with no contextual
features available, and yet the use of the guttural and the corpus level features enabled us
to produce improved results. We expect the corpus-level features will play an even larger
role when a vocalized corpora becomes available.
8.2 Future Work
8.2.1 Generation
Implementing Rare In�ection Tables
The scope of this work included the implementation of over 260 common in�ection tables.
An additional 40 infrequent tables still need to be implemented to cover all known cases of
verbal in�ection.
Implementing In�ection Tables for Nouns
As in other languages, nouns in Hebrew are by far the most common part of speech. Unlike
verbs, the in�ection tables for nouns consist of 10 in�ections in average (as opposed to 60
in�ections in average for verbs). Yet, the number of noun in�ection tables which are to be
implemented, is signi�cantly larger than the overall 300 verb in�ection tables [5].
By creating a comprehensive generation mechanism for nouns, along with a signi�cant
list of nouns and their corresponding in�ection table, we expect to have a vast dataset
which will complete the vocalized verbs dataset, generated in this work. We expect this
comprehensive, vocalized and morphologically tagged dataset will provide the foundation
for the implementation of an automatic, highly accurate Hebrew vocalization software.
60
8.2.2 Syllable Segmentation
Searching for Optimal Weights
Our string matching algorithm uses weights to de�ne the bene�t/punishment of match-
ing/replacing/inserting/deleting a letter or a vocalization sign. The selection of these weights
was made manually, based on a general understanding of the Hebrew in�ectional model
properties. This selection may be improved, to achieve better results (over our manually
segmented 300 verbs) by searching for optimal weights. This can be achieved by either a
simple grid search (with some boundaries), or by a more sophisticated searching scheme (as
gradient descent, genetic algorithm etc.), over the weights.
Machine Learning of Syllable Segmentation
Our success in segmenting vocalized words into syllables given their base form, may be used
to improve syllable segmentation when the base form of the word is not available. This may
be achieved by using our 240k verbs that are segmented to syllables with over 99% word
accuracy, as corpora for some supervised learning mechanism. By doing so, we believe the
81% accurately segmented words, may be improved with no need for a base form (which
might be di�cult to obtain).
8.2.3 Unknown Verbs Classi�cation
Using Vocalized Corpora
Currently, we classify base forms into in�ection tables by using length, letter position and
guttural letters position as features. In addition, we attempted to use a non-vocalized
corpora counts of in�ections, as an enhancement to these features by performing the classi�-
cation in two steps. As an additional possible feature we suggest using a vocalized corpora,
by which the in�ection counts may be much more accurate. We expect the accuracy to be
improved due to the very low ambiguity rate of vocalized Hebrew words (in comparison to
the very high ambiguity for non-vocalized Hebrew). Such accurate counts of in�ections in a
vast corpora may improve our classi�cation accuracy signi�cantly.
Classifying Vocalized Verbs into In�ection Tables
Once a vocalized corpora will become available, our motivation for automatically classifying
base forms of verbs to their corresponding in�ection tables, naturally increases. Obviously,
the accuracy of classi�cation will be signi�cantly increased and automatically gathering base
forms for our generation mechanism will become immediate.
61
Performing Feature Selection
As described in Table 7.5, once the number of features grows the accuracy drops signi�cantly.
We suspect that performing feature selection (such as discriminant analysis or principal
component analysis) may improve the robustness of our model, and our results by eliminating
irrelevant features.
General Verbs Classi�cation
The current Classi�cation task we confront is the task of classifying base forms of verbs
into their corresponding in�ection tables/patterns. A more general task would consider
classifying any in�ection of a verb (not just a base form) to its relevant in�ection table.
Such a classi�er could use the results of our generation mechanism for performing supervised
learning. Once such a system is implemented, any verb in a given corpora may be used to
enhance our dataset (unlike our current mechanism which uses only base form verbs in the
corpora).
Exploring the SVM Parameters
In this work we used a �xed setting of the SVM parameters (c, g, kernel etc.). Performing a
search for optimal SVM settings may be achieved (as in 8.2.2) by either a simple grid search,
or by a more sophisticated searching scheme.
8.2.4 Automatic Vocalization
Setting a Baseline by a Vocalized Corpora
Given a comprehensive vocalized corpora, which we anticipate we will obtain in several
months time, the statistics for the possible vocalizations for a Hebrew word, may be easily
gathered (using a word segmentation system). For example, as we described in 1.1 the non-
vocalized word ספר! has a number of possible vocalizations, using the vocalized corpora we
could count the number of instances of each valid way of vocalization. Then, in order to set a
baseline, we could vocalize each word by the most common vocalization for it. This scheme,
by which we intend to set our baseline, is not new, in fact it is similar to [23], [13] and [37],
the innovation in such work lies in the corpora used. For the �rst time a modern, typical
vocalized text will be used for learning vocalized words distribution. Therefore, we believe
our renewed baseline may re�ect the di�culty in the challenge of automatic vocalization,
much more accurately.
62
Improving the Baseline via Supervised Learning
In order to improve the baseline described in 8.2.4 we suggest the scheme described in
Algorithm 10.
Algorithm 10 Supervised learning for automatic vocalization
Require: A vocalized corpora CW ←segment the words in Cfor all w such that w ∈ W do
x← w stripped from its vocalization signsV ←All possible vocalizations of x in C (x's confusion set)
end forSplit W into training and testing sets, and use supervised machine learning with con-textual features to vocalize words correctly into one of the possible vocalizations in V.
63
Appendix A
Data Set Example - Base Forms
Correlated to In�ection Tables
The following words are a sample from our dataset of base forms which includes over 4k
base forms:
<base form, pattern, Table number>
A,16,אבד!
C,20,אבזר!
!Nאבח,C,25
C,26,אבטח!
A,14,אגד!
!Pאג,A,14
A,14,אגר!
!Pאגר,C,20
A,21,אהב!
A,20,אהד!
D,14,אבזר!
!Nאבח,D,16
D,19,אבטח!
D,1,אבק!
!Nאב,D,3
D,1,אגד!
!Pאג,D,1
D,10,אדה!
C,20,אורר!
D,14,אורר!
64
Appendix B
Data Set Example - In�ected
Vocalized Verbs
The following words are a sample from our automatically generated 250k dataset of in-
�ected and fully vocalized verbs along with their corresponding morphological attributes
(Time+Person+Gender+Number+Spelling):
<pattern, Table number, vocalized in�ection, morphology, corresponding base form>
A,1,!בג®ד�תי,PAST+FIRST+MF+SINGULAR+COMPLETE,!בג®ד
A,1,!בג®ד�ת,PAST+SECOND+M+SINGULAR+COMPLETE,!בג®ד
A,1,!בג®ד�ת,PAST+SECOND+F+SINGULAR+COMPLETE,!בג®ד
A,1,!בג®ד,PAST+THIRD+M+SINGULAR+COMPLETE,!בג®ד
A,1,!בג�ד´ה,PAST+THIRD+F+SINGULAR+COMPLETE,!בג®ד
A,1,!בג®ד�נו,PAST+FIRST+MF+PLURAL+COMPLETE,!בג®ד
A,1,!Mבג®ד�ת,PAST+SECOND+M+PLURAL+COMPLETE,!בג®ד
A,1,!Nבג®ד�ת,PAST+SECOND+F+PLURAL+COMPLETE,!בג®ד
A,1,!בג�דו,PAST+THIRD+M+PLURAL+COMPLETE,!בג®ד
A,1,!בג�דו,PAST+THIRD+F+PLURAL+COMPLETE,!בג®ד
A,1,!בוג¦ד,PRESENT+FIRST+M+SINGULAR+COMPLETE, בג®ד!
A,1,!ד»תªבוג,PRESENT+FIRST+F+SINGULAR+COMPLETE, בג®ד!
A,1,!בוג¦ד,PRESENT+SECOND+M+SINGULAR+COMPLETE, בג®ד!
A,1,!ד»תªבוג,PRESENT+SECOND+F+SINGULAR+COMPLETE, בג®ד!
A,1,!בוג¦ד,PRESENT+THIRD+M+SINGULAR+COMPLETE, בג®ד!
A,1,!ד»תªבוג,PRESENT+THIRD+F+SINGULAR+COMPLETE, בג®ד!
A,1,!Mבוג�ד£י,PRESENT+FIRST+M+PLURAL+COMPLETE, בג®ד!
A,1,!בוג�דות,PRESENT+FIRST+F+PLURAL+COMPLETE, בג®ד!
A,1,!Mבוג�ד£י,PRESENT+SECOND+M+PLURAL+COMPLETE, בג®ד!
65
Appendix C
Data Set Example - In�ected
Verbs Segmented into Syllables
The following segmented words are a sample from our automatically generated dataset of
in�ected verbs which were segmented into syllables according to the string matching based
algorithm:
בÊג®ד�Êתי!
בÊג®ד�Êת!
בÊג®ד�ת!
בÊג®ד!
בÊג�ד´ה!
בÊג®ד�Êנו!
!MתÊבג®ד�
!NתÊבג®ד�
בÊג�דו!
בוÊג¦ד!
בוÊגʪד»ת!
!Mג�ד£יÊבו
בוÊג�דות!
בÊגוד!
בגוÊד´ה!
!Mד£יÊבגו
בגוÊדות!
אבÊגוד!
אבÊגד!
תבÊגוד!
66
Bibliography
[1] The academy for the hebrew language. http://hebrew-academy.huji.ac.il.
[2] Qaryan hebrew tts. http://www.software112.com/products/qaryan-hebrew-
tts.html.
[3] M.M. Adler. Hebrew morphological disambiguation: An unsupervised stochastic
word-based approach. PhD thesis, Citeseer, 2007.
[4] E. Avneyon, R. Nir, and I. Yosef. Milon sapir: The Concise Sapphire Dictionary.
Hed Artsi, Tel Aviv. (in Hebrew), 1997.
[5] S. Barkali. The complete tablet of names: tablets for the in�ections of names
in all their forms. Luah ha-shemot ha-shalem: luhot le-netiyat ha-shemot al kol
mishkalehem ve-tsurotehem bi-tsiruf reshimah shel beerekh 20,000 shemot-etsem
mi-tekufat ha-Tanakh ve-ad yemenu... Re'uven Mas (in Hebrew), 1962.
[6] S. Barkali. The complete tablet of Verbs: tablets for the in�ection of verbs. Luach
ha-pe'alim ha-shalem: luchot li-netiyat ha-pe'alim... reshimah mechudeshet shel
kol shoreshe ha-pe'alim ba-loshon ha-'Ivrit... Re'uven Mas (in Hebrew), 1988.
[7] S. Bartlett, G. Kondrak, and C. Cherry. Automatic syllabi�cation with structured
svms for letter-to-phoneme conversion. Proceedings of ACL-08: HLT, pages 568�
576, 2008.
[8] T. Berkovich. Snopi automatic nikud. http://www.nakdan.com/Nakdan.aspx.
[9] Y. Choueka and Y. Neeman. Nakdan-text,(an in-context text-vocalizer for modern
hebrew). In BISFAI-95, The Fifth Bar Ilan Symposium for Arti�cial Intelligence,
1995.
[10] D. Dannélls and J.J. Camilleri. Verb morphology of hebrew and maltese-towards
an open source type theoretical resource grammar in gf. In Proceedings of LREC
2010. Workshop on Language Resources (LRs) and Human Language Technologies
(HLT) for Semitic Languages Status, Updates, and Prospects., 2010.
67
BIBLIOGRAPHY 68
[11] A. Even-Shoshan. A new dictionary of the Hebrew language. Kiryat-Sefer,
Jerusalem (in Hebrew), 1981.
[12] R. Finkel and G. Stump. Generating hebrew verb morphology by default in-
heritance hierarchies. In Proceedings of the ACL-02 workshop on Computational
approaches to semitic languages, pages 1�10. Association for Computational Lin-
guistics, 2002.
[13] Y. Gal. An hmm approach to vowel restoration in arabic and hebrew. In Proceed-
ings of the ACL-02 workshop on Computational approaches to semitic languages,
pages 1�7. Association for Computational Linguistics, 2002.
[14] Y. Goldberg and M. Elhadad. Hebrew dependency parsing: Initial results. In
Proceedings of the 11th International Conference on Parsing Technologies, pages
129�133. Association for Computational Linguistics, 2009.
[15] Y. Goldberg, R. Tsarfaty, M. Adler, and M. Elhadad. Enhancing unlexicalized
parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and
em-hmm-based lexical probabilities. In Proceedings of the 12th Conference of the
European Chapter of the Association for Computational Linguistics, pages 327�
335. Association for Computational Linguistics, 2009.
[16] L. Haizhou and Y. Baosheng. Chinese word segmentation. Language, 18:212�217,
1998.
[17] N. Har'el and D. Kenigsberg. Hspell-the free hebrew spell checker and morpho-
logical analyzer. In Israeli Seminar on Computational Linguistics, 2004.
[18] C. Huang, J. Gao, LI MU, and X. Chang. Chinese word segmentation, 2005. EP
Patent 1,515,240.
[19] Alon Itai and Shuly Wintner. Language resources for Hebrew.
Language Resources and Evaluation, 42(1):75�98, March 2008.
http://www.mila.cs.technion.ac.il/mila/eng/index.html.
[20] D. Jurafsky and J.H. Martin. Speech and language processing: An introduction
to natural language processing, computational linguistics, and speech recognition.
MIT Press, 2006.
[21] D. Kamir, N. Soreq, and Y. Neeman. A comprehensive nlp system for modern
standard arabic and modern hebrew. In Proceedings of the ACL-02 workshop
on Computational approaches to semitic languages, pages 1�9. Association for
Computational Linguistics, 2002.
BIBLIOGRAPHY 69
[22] S. Klein and R.F. Simmons. A computational approach to grammatical coding of
english words. Journal of the ACM (JACM), 10(3):334�347, 1963.
[23] L. Kontorovich. Problems in semitic nlp: Hebrew vocalization using hmms. 2001.
[24] M. Levinger, A. Itai, and U. Ornan. Learning morpho-lexical probabilities from
an untagged corpus with an application to hebrew. Computational Linguistics,
21(3):383�404, 1995.
[25] J.K. Low, H.T. Ng, and W. Guo. A maximum entropy approach to chinese
word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese
Language Processing, volume 1612164. Jeju Island, Korea, 2005.
[26] Melingo. Kolan. http://www.melingo.co.il/kolan.htm.
[27] K. Müller. Automatic detection of syllable boundaries combining the advantages
of treebank and bracketed corpora training. In Proceedings of the 39th Annual
Meeting on Association for Computational Linguistics, pages 410�417. Associa-
tion for Computational Linguistics, 2001.
[28] R. Nelken and S.M. Shieber. Arabic diacritization using w eighted �nite-state
transducers. Computational Approaches to Semitic Languages, 8:79, 2005.
[29] N. Neser. Hanikud halachah lema'ase. Mesada (in Hebrew), 1976.
[30] Z. Qun and C. Yu. Research on chinese word segmentation algorithm based on
special identi�ers. Computing and Intelligent Systems, pages 377�385, 2011.
[31] Asif S. The tablets of verbs for the Hebrew language Luahot ha-pe'alim ba-safa
ha-'Ivrit. Prolog publishing house Ltd. (in Hebrew), 2009.
[32] H. Safadi, O. Dakkak, and N. Ghneim. Computational methods to vocalize arabic
texts. In Second Workshop on Internationalizing SSML, 2006.
[33] M. Saimaiti and Z. Feng. A syllabi�cation algorithm and syllable statistics of
written uyghur. 2008.
[34] S. Shoval. Nikuda, 2010. http://www.nikuda.co.il/.
[35] K. Sima'an, A. Itai, Y. Winter, A. Altman, and N. Nativ. Building a tree-bank
of modern hebrew text. Traitement Automatique des Langues, 42(2), 2001.
[36] Torah Educational Software. Auto nikud plus.
http://www.jewishstore.com/Software/AutoNikud.htm.
BIBLIOGRAPHY 70
[37] M. Spiegel and J. Volk. Hebrew vowel restoration with neural networks. In Class
of 2003 Senior Conference on Natural Language Processing. Citeseer, 2003.
[38] P. Taylor. Text-to-speech synthesis, volume 1. Citeseer, 2009.
[39] P. Taylor, A. Black, and R. Caley. The architecture of the festival speech synthe-
sis system. In The Third ESCA Workshop in Speech Synthesis, pages 147�151.
Citeseer, 1998.
[40] D. Yarowsky. Decision lists for lexical ambiguity resolution: Application to accent
restoration in spanish and french. In Proceedings of the 32nd annual meeting on
Association for Computational Linguistics, pages 88�95. Association for Compu-
tational Linguistics, 1994.
[41] Y. Zamir. Hocr is a hebrew optical character recognition library, 2008.
http://hocr.berlios.de/index.html.
[42] I. Zitouni and R. Sarikaya. Arabic diacritic restoration approach based on maxi-
mum entropy models. Computer Speech & Language, 23(3):257�276, 2009.