Download - Automatic Hebrew Vocalization
�ד �יק�ו נ
או�טו�מט�י
Automatic Hebrew Vocalization
By: Eran TomerAdvisor: Prof. Michael Elhadad
�ד �יק�ו נ
או�טו�מט�י
The computational linguistics field attempts to model and study languages using computational techniques.
The diverse challenges confronted by computational linguistics researchers include: Machine translation Automatic text-summarization Speech-to-text Text-to-speech Etc.
Natural Language Processing
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
2
�ד �יק�ו נ
או�טו�מט�י
Accomplishing NLP tasks for Hebrew is made difficult by 2 factors: Lack of large-scale, annotated resources
Supervised learning is generally hard to apply High ambiguity rate
A given Hebrew word may have an astonishing number of different meanings and
pronunciations.e.g. ספר, שלט, שערה, משנה
Hebrew Natural Language Processing
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
3
�ד �יק�ו נ
או�טו�מט�י
The Hebrew TreeBank 5,000 segmented and morphologically tagged sentences
Mila Various corpora, lexicons and some NLP tools
Word Segmentation
Morphological tagging
Related Work
פוח� ת�•Noun•Singular•Masculine
ה�•Determiner
ת א�•preposition
כול א•Verb•Singular•Masculine•2nd person•Imperative
התה שכאן מאתמול [אתמול][מ] [כאן][ש] [תה][ה]
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
4
�ד �יק�ו נ
או�טו�מט�י
Development of a Hebrew Text-To-Speech system A vocalized and syllabified word may be used as a normalized-
form for a Hebrew TTS systemGeneration of vocalized text for teaching
Vocalized inflected words are difficult to obtain (do not exist in dictionaries), and are widely used for teaching
Improving automatic translation systems
Motivation
אתה תמונה לתפקיד You picture the job
אני עובד משני עד רביעי I work two to four
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
5
�ד �יק�ו נ
או�טו�מט�י
Generation Automatically producing fully vocalized verb inflections with the
corresponding morphological attributes.
Syllable segmentation Automatically segmenting vocalized words into syllables
Unknown verb classification Classifying verbs to their corresponding
patterns Automatically selecting an inflection
schema for an un-known verb
Objectives
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
6
�ד �יק�ו נ
או�טו�מט�י
The Hebrew verb How complex must be the computational model for verb full
morphological and vocalization generation? How much lexical knowledge and exceptions are required to
cover the Hebrew verbs lexicon?
Syllable segmentation How complex is syllable segmentation? What level of knowledge is required for
successful segmentation?
Research Questions
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
7
�ד �יק�ו נ
או�טו�מט�י
Vocalization
Syllable segmentation
Generation
Previous Work
Free systems
Snopi Automatic Nikud
Nikuda
Academic systems
Kontorovich (2001)
Gal (2002)
Spiegel & Volk (2003)
Commercial systems
Nakdan Text (Melingo)
Auto Nikud
Nakdanit
Academic systems
Finkel & Stump (2002)
Commercial systems
Kolan (Melingo)
Free systems
Hspall (2002)
Academic systems
Finkel and Stump (2002)
Dannélls and Camilleri (2010)
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
8
�ד �יק�ו נ
או�טו�מט�י
Vowels vs. Consonants Consonant letters are either vocalized by Shva ( ), or non-vocalized
at the end of a word. There exist two types of Shva, Na and Nach
A letter that functions as a vowel will be vocalized by the following vowel and semi-vowel signs
Background – Hebrew Vocalization
A �
Kamats � Patah � Hataf
Patah E
� Segol � Tsere � Hataf
Segol
U �
Kubuts ו
Shuruk
O � Holam ו Holam
Male � Kamats
Katan � Hataf
Kamats I
� Hirik
ל Yב ל] ב\ נ]•1st person•Plural
בל ל] ב\ י]• 3rd person•Singular ב\ ל] בY לנ]
NaNach
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
9
�ד �יק�ו נ
או�טו�מט�י
Diacritic signs may change pronunciation of letters Dagesh
Dagesh () emphasizes letters, yet in modern Hebrew it affects כ/כ,ב/ב and פ/פ only
Mapik Mapik () denotes a constant (emphasized) Hey at the end of the word
Shin dots Shin dots distinguish the pronunciation of ש as SH (ש) or S ( (ש#
Background – Hebrew Vocalization
Dagesh Kal vocalizes , , , , ת, פ כ ד ג ב At the beginning of a word After a Shva Nach
Dagesh Hazak vocalizes any letter other than , , , ר, ע ח ה א Following certain linguistic phenomena, In some noun/verb patterns
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
10
�ד �יק�ו נ
או�טו�מט�י
Syllables Hebrew words are composed of syllables, a syllable is a
phonological entity that is pronounced in one effort
Stress Hebrew words are stressed by two stress schemes
Milel (מלעיל) denotes the syllable prior to the last is stressed Milra (מלרע) denotes the last syllable is stressed
Deficient spelling vs. Plene spelling In many cases there exist more than one valid
ways to spell a given Hebrew word
Background – Hebrew Vocalization
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
11
�ד �יק�ו נ
או�טו�מט�י
The syllables and vowels rule (כלל ההברות והתנועות) Require: A stressed/non-stressed syllable (s)
if s is a non-stressed syllable then if s is an open syllable vocalize s with a long vowel else vocalize s with a short vowel
else In most cases s should be vocalized with a long vowel, yet the number of
exceptions is considerable
Background – Hebrew Vocalization
Sound group
Long vowel Short vowel
A � �
E , י� �
I י�� ��
U � ו �
O ו� �
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
12
�ד �יק�ו נ
או�טו�מט�י
Stressed syllable
Examples
Background – Hebrew Vocalization
Syllable segmentation
עכבר עכ-בר בר ר ב_ כ] ע\
נהר נ-הר הר ר ה_ נ_
לילה לי-לה לי ה ל_ י] exceptionל\
דלת ד-לת ד exceptionדלת
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
13
Stressed syllable?
Yes
Usuallylong
No
Open syllable?
Yes
Long
No
Short
�ד �יק�ו נ
או�טו�מט�י
Verbs Morphological attributes
Patterns
Background – Hebrew Vocalization
Tense Past Beinoni (Participle) Present Future Imperative
Person First Second Third
Number Singular Plural
Gender Masculine Feminine Both
Verbפועל
Light patternsהבניינים הקלים
Paal Nifal Hifil Hufal
Heavy patternsהבניינים הכבדים
Piel Pual Hitpael
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
14
�ד �יק�ו נ
או�טו�מט�י
The Hebrew paradigms Hebrew verbs are clustered into several paradigms that are
characterized by the manner they inflect verbs Complete paradigms (גזרות השלמים) Crippled paradigms (גזרות נחות) Defective paradigm (גזרות חסרות) Etc.
Inflection tables Paradigms are further partitioned into about 300 specific inflection
tables which describe inflections of specific verb families
Background – Hebrew Vocalization
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
15
�ד �יק�ו נ
או�טו�מט�י
Inflection tables - example
Background – Hebrew Vocalization
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
16
�ד �יק�ו נ
או�טו�מט�י
Verbs list Over 4k manually gathered verbs Morphology - deficient, past, masculine, singular, 3rd person Shin dots are indicated The corresponding inflection table is indicated for each verb
Morphologically analyzed corpora About 50 million fully morphologically disambiguated words Material from “Haaretz” newspaper, “Tapuz” website, the
“Knesset” discussions and other resources
Datasets
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
17
�ד �יק�ו נ
או�טו�מט�י
Method We implemented 264 inflection tables which:
Take: A verb (v) from our verb list dataset A corresponding inflection table
Return: Vocalized inflections of v with appropriate morphological tags
Results A list with over than 240,000 vocalized verbs with appropriate
morphological attributesEvaluation
A sample of over 15,000 inflected verbs were manually validated with 99.4% accuracy
Generation
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
18
�ד �יק�ו נ
או�טו�מט�י
C-20, פצפץ: י gת צ] פ\ צ] gפ,PAST+FIRST+MF+SINGULAR+COMPLETE ת_ צ] פ\ צ] gפ,PAST+SECOND+M+SINGULAR+COMPLETE ת] צ] פ\ צ] gפ,PAST+SECOND+F+SINGULAR+COMPLETE ץ Yפ צ] gפ,PAST+THIRD+M+SINGULAR+COMPLETE ה צ_ פ] צ] gפ,PAST+THIRD+F+SINGULAR+COMPLETE נו צ] פ\ צ] gפ,PAST+FIRST+MF+PLURAL+COMPLETE תם צ] פ\ צ] gפ,PAST+SECOND+M+PLURAL+COMPLETE תן צ] פ\ צ] gפ,PAST+SECOND+F+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+M+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+F+PLURAL+COMPLETE …
Generation – results sample
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
19
�ד �יק�ו נ
או�טו�מט�י
Method Syllable segmentation requires Shva classification
Shva Na marks syllable start*
Shva Nach denotes syllable end*
Each syllable includes exactly one vowel*
* According to Even-Shoshan dictionary
We implemented two Shva classification schemes Heuristic approach - Rabbi-Eliyahu-Behor Shva classification according to the base tense form
Syllable segmentation
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
20
�ד �יק�ו נ
או�טו�מט�י
Heuristic approach By Behor - a Shva is a Shva Na if:
It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long, stressed vowel (stress is needed) It vocalizes a letter with Dagesh Hazak (Dagesh type is needed) It vocalizes the first among two identical letters (many exceptions)
By our (adapted) Heuristic: A Shva is a Shva Na if:
It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long vowel
A Shva is a Shva Nach if: It is followed by another Shva
In any other case, we use Shva Nach as default
Syllable segmentation
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
21
�ד �יק�ו נ
או�טו�מט�י
Shva classification according to the base tense form Through our generation mechanism, we can correlate verb
inflections to their corresponding base-tense form A Shva present in the base-tense form is a Shva Nach Otherwise the Shva is a Shva Na
Matching inflection to base-tense forms We use a dynamic programming string matching algorithm Operations costs were customized to be character dependent,
respecting the Hebrew inflectional model
Syllable segmentation
I I C R C C C C C C C C I R
י ק � �ק � ד � �ז ת � �ק - ק � ד � �ז - - י �
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
22
�ד �יק�ו נ
או�טו�מט�י
Results Thanks to our generation model, we obtain 240k of highly
accurate vocalized verbs We applied our two approaches to receive two lists of verbs
segmented into syllables: By our heuristic approach (based on Behor’s heuristic) By our customized string matching algorithm
Evaluation A sample of 300 segmented verbs were validated for:
81% word accuracy and 85.92% syllable accuracy by the heuristic
99.33% word accuracy and 99.5% syllable accuracy by the string matching approach
Syllable segmentation
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
23
�ד �יק�ו נ
או�טו�מט�י
-גו-למ-תי-ג-למ-תי-גו-למ-ת-ג-למ-ת-גו-למת-ג-למת
Syllable segmentation – results sample
-גו-למ-תם-ג-למ-תם-גו-למ-תן-ג-למ-תן-גו-למו-ג-למו
-גו-לם-ג-לם-גו-למה-ג-למה-גו-למ-נו-ג-למ-נו
גלם
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
24
�ד �יק�ו נ
או�טו�מט�י
Method We implemented a classifier (SVM) which:
Take: A non-vocalized verb (v)
Return: The pattern corresponding to v
The SVM uses: Dataset:
Over 2,700 verbs from our verb list 70% are used for training and 30% for testing
Features: Word length letters positions Guttural letters positions
Evaluation 90.25% of the verbs were classified correctly to
their corresponding Hebrew pattern
Verb classification to patterns
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
25
�ד �יק�ו נ
או�טו�מט�י
Method We implemented a classifier (SVM) which:
Take: A non-vocalized verb (v)
Return: The inflection table corresponding to v
The SVM uses: Dataset:
Over 2,700 verbs from our verb list 70% are used for training and 30% for testing
Features: Word length letters positions Guttural letters positions Corpus level features (50M morphologically disambiguated corpus)
Evaluation Without corpus level features - 68.63% accuracy With corpus level features - 70.08% accuracy
Unknown verb classification to inflection tables
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
26
�ד �יק�ו נ
או�טו�מט�י
The Hebrew verb inflectional model Q: How complex must be the computational model for verb full
morphological and vocalization generation? A: By implementing 264 inflection tables we achieve 99.4% accuracy
Q: How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon?
A: The 260 implemented inflection tables include many exception tables which describe the inflectional model for only several verbs
Our more general, unknowns classification, model, yields 70% accuracy (selecting 1 inflection table out of the total 264 tables)
For comparison the baseline for the most frequent inflection table yields only 34% accuracy
A rough estimation shows over 93% of the verbs in a large corpora exist in our dataset, moreover most unknown verbs are either miss-spelled or falsely tagged as verbs
Discussion
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
27
�ד �יק�ו נ
או�טו�מט�י
Syllable segmentation Q: How complex is syllable segmentation?
In contradiction to traditional grammars, few simple rules do not provide highly accurate segmentation
We achieved 99.3% word accuracy and 99.5% syllable accuracy through Shva classification
Q: What level of knowledge is required for successful syllable segmentation? A: By using the vocalized word only we achieve correct word segmentation with
81% accuracy Using the base tense form as well, improves word accuracy to 99.3%
This improvement suggests: Hebrew phonology uses a constructive process, which
derives inflections from base tense forms Inflections are not generated in a pipeline process, in which
morphology would first generate inflections that are later segmented into phonological units
Discussion
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
28
�ד �יק�ו נ
או�טו�מט�י
Generation Implementing rare inflection tables Implementing inflection tables for nouns
Syllable segmentation Searching for optimal Hebrew string matching weights Machine learning of syllable segmentation
Future work
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
29
�ד �יק�ו נ
או�טו�מט�י
Unknown verbs classification Using vocalized corpora to extract corpus level features Performing feature selection Classification of vocalized verbs into inflection tables Classification of inflections into inflection tables Exploring the SVM parameters
Automatic vocalization We hope to obtain a substantial vocalized corpora (the Aviv
encyclopedia), which will enable: Setting a base line for automatic vocalization using
a modern vocalized corpora Improving the baseline through supervised learning
Future work
Introduction Motivation & Objectives
Research Questions Previous work Background Generation Syllable
segmentationVerb
classificationDiscussion & Future work
30
�ד �יק�ו נ
או�טו�מט�י
The End