automatic hebrew vocalization

Post on 05-Jan-2016

100 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Automatic Hebrew Vocalization. By: Eran Tomer Advisor: Prof. Michael Elhadad. Natural Language Processing. The computational linguistics field attempts to model and study languages using computational techniques. - PowerPoint PPT Presentation

TRANSCRIPT

�ד �יק�ו נ

או�טו�מט�י

Automatic Hebrew Vocalization

By: Eran TomerAdvisor: Prof. Michael Elhadad

�ד �יק�ו נ

או�טו�מט�י

The computational linguistics field attempts to model and study languages using computational techniques.

The diverse challenges confronted by computational linguistics researchers include: Machine translation Automatic text-summarization Speech-to-text Text-to-speech Etc.

Natural Language Processing

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

2

�ד �יק�ו נ

או�טו�מט�י

Accomplishing NLP tasks for Hebrew is made difficult by 2 factors: Lack of large-scale, annotated resources

Supervised learning is generally hard to apply High ambiguity rate

A given Hebrew word may have an astonishing number of different meanings and

pronunciations.e.g. ספר, שלט, שערה, משנה

Hebrew Natural Language Processing

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

3

�ד �יק�ו נ

או�טו�מט�י

The Hebrew TreeBank 5,000 segmented and morphologically tagged sentences

Mila Various corpora, lexicons and some NLP tools

Word Segmentation

Morphological tagging

Related Work

פוח� ת�•Noun•Singular•Masculine

ה�•Determiner

ת א�•preposition

כול א•Verb•Singular•Masculine•2nd person•Imperative

התה שכאן מאתמול [אתמול][מ] [כאן][ש] [תה][ה]

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

4

�ד �יק�ו נ

או�טו�מט�י

Development of a Hebrew Text-To-Speech system A vocalized and syllabified word may be used as a normalized-

form for a Hebrew TTS systemGeneration of vocalized text for teaching

Vocalized inflected words are difficult to obtain (do not exist in dictionaries), and are widely used for teaching

Improving automatic translation systems

Motivation

אתה תמונה לתפקיד You picture the job

אני עובד משני עד רביעי I work two to four

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

5

�ד �יק�ו נ

או�טו�מט�י

Generation Automatically producing fully vocalized verb inflections with the

corresponding morphological attributes.

Syllable segmentation Automatically segmenting vocalized words into syllables

Unknown verb classification Classifying verbs to their corresponding

patterns Automatically selecting an inflection

schema for an un-known verb

Objectives

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

6

�ד �יק�ו נ

או�טו�מט�י

The Hebrew verb How complex must be the computational model for verb full

morphological and vocalization generation? How much lexical knowledge and exceptions are required to

cover the Hebrew verbs lexicon?

Syllable segmentation How complex is syllable segmentation? What level of knowledge is required for

successful segmentation?

Research Questions

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

7

�ד �יק�ו נ

או�טו�מט�י

Vocalization

Syllable segmentation

Generation

Previous Work

Free systems

Snopi Automatic Nikud

Nikuda

Academic systems

Kontorovich (2001)

Gal (2002)

Spiegel & Volk (2003)

Commercial systems

Nakdan Text (Melingo)

Auto Nikud

Nakdanit

Academic systems

Finkel & Stump (2002)

Commercial systems

Kolan (Melingo)

Free systems

Hspall (2002)

Academic systems

Finkel and Stump (2002)

Dannélls and Camilleri (2010)

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

8

�ד �יק�ו נ

או�טו�מט�י

Vowels vs. Consonants Consonant letters are either vocalized by Shva ( ), or non-vocalized

at the end of a word. There exist two types of Shva, Na and Nach

A letter that functions as a vowel will be vocalized by the following vowel and semi-vowel signs

Background – Hebrew Vocalization

A �

Kamats � Patah � Hataf

Patah E

� Segol � Tsere � Hataf

Segol

U �

Kubuts ו

Shuruk

O � Holam ו Holam

Male � Kamats

Katan � Hataf

Kamats I

� Hirik

ל Yב ל] ב\ נ]•1st person•Plural

בל ל] ב\ י]• 3rd person•Singular ב\ ל] בY לנ]

NaNach

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

9

�ד �יק�ו נ

או�טו�מט�י

Diacritic signs may change pronunciation of letters Dagesh

Dagesh () emphasizes letters, yet in modern Hebrew it affects כ/כ,ב/ב and פ/פ only

Mapik Mapik () denotes a constant (emphasized) Hey at the end of the word

Shin dots Shin dots distinguish the pronunciation of ש as SH (ש) or S ( (ש#

Background – Hebrew Vocalization

Dagesh Kal vocalizes , , , , ת, פ כ ד ג ב At the beginning of a word After a Shva Nach

Dagesh Hazak vocalizes any letter other than , , , ר, ע ח ה א Following certain linguistic phenomena, In some noun/verb patterns

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

10

�ד �יק�ו נ

או�טו�מט�י

Syllables Hebrew words are composed of syllables, a syllable is a

phonological entity that is pronounced in one effort

Stress Hebrew words are stressed by two stress schemes

Milel (מלעיל) denotes the syllable prior to the last is stressed Milra (מלרע) denotes the last syllable is stressed

Deficient spelling vs. Plene spelling In many cases there exist more than one valid

ways to spell a given Hebrew word

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

11

�ד �יק�ו נ

או�טו�מט�י

The syllables and vowels rule (כלל ההברות והתנועות) Require: A stressed/non-stressed syllable (s)

if s is a non-stressed syllable then if s is an open syllable vocalize s with a long vowel else vocalize s with a short vowel

else In most cases s should be vocalized with a long vowel, yet the number of

exceptions is considerable

Background – Hebrew Vocalization

Sound group

Long vowel Short vowel

A � �

E , י� �

I י�� ��

U � ו �

O ו� �

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

12

�ד �יק�ו נ

או�טו�מט�י

Stressed syllable

Examples

Background – Hebrew Vocalization

Syllable segmentation

עכבר עכ-בר בר ר ב_ כ] ע\

נהר נ-הר הר ר ה_ נ_

לילה לי-לה לי ה ל_ י] exceptionל\

דלת ד-לת ד exceptionדלת

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

13

Stressed syllable?

Yes

Usuallylong

No

Open syllable?

Yes

Long

No

Short

�ד �יק�ו נ

או�טו�מט�י

Verbs Morphological attributes

Patterns

Background – Hebrew Vocalization

Tense Past Beinoni (Participle) Present Future Imperative

Person First Second Third

Number Singular Plural

Gender Masculine Feminine Both

Verbפועל

Light patternsהבניינים הקלים

Paal Nifal Hifil Hufal

Heavy patternsהבניינים הכבדים

Piel Pual Hitpael

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

14

�ד �יק�ו נ

או�טו�מט�י

The Hebrew paradigms Hebrew verbs are clustered into several paradigms that are

characterized by the manner they inflect verbs Complete paradigms (גזרות השלמים) Crippled paradigms (גזרות נחות) Defective paradigm (גזרות חסרות) Etc.

Inflection tables Paradigms are further partitioned into about 300 specific inflection

tables which describe inflections of specific verb families

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

15

�ד �יק�ו נ

או�טו�מט�י

Inflection tables - example

Background – Hebrew Vocalization

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

16

�ד �יק�ו נ

או�טו�מט�י

Verbs list Over 4k manually gathered verbs Morphology - deficient, past, masculine, singular, 3rd person Shin dots are indicated The corresponding inflection table is indicated for each verb

Morphologically analyzed corpora About 50 million fully morphologically disambiguated words Material from “Haaretz” newspaper, “Tapuz” website, the

“Knesset” discussions and other resources

Datasets

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

17

�ד �יק�ו נ

או�טו�מט�י

Method We implemented 264 inflection tables which:

Take: A verb (v) from our verb list dataset A corresponding inflection table

Return: Vocalized inflections of v with appropriate morphological tags

Results A list with over than 240,000 vocalized verbs with appropriate

morphological attributesEvaluation

A sample of over 15,000 inflected verbs were manually validated with 99.4% accuracy

Generation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

18

�ד �יק�ו נ

או�טו�מט�י

C-20, פצפץ: י gת צ] פ\ צ] gפ,PAST+FIRST+MF+SINGULAR+COMPLETE ת_ צ] פ\ צ] gפ,PAST+SECOND+M+SINGULAR+COMPLETE ת] צ] פ\ צ] gפ,PAST+SECOND+F+SINGULAR+COMPLETE ץ Yפ צ] gפ,PAST+THIRD+M+SINGULAR+COMPLETE ה צ_ פ] צ] gפ,PAST+THIRD+F+SINGULAR+COMPLETE נו צ] פ\ צ] gפ,PAST+FIRST+MF+PLURAL+COMPLETE תם צ] פ\ צ] gפ,PAST+SECOND+M+PLURAL+COMPLETE תן צ] פ\ צ] gפ,PAST+SECOND+F+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+M+PLURAL+COMPLETE צו פ] צ] gפ,PAST+THIRD+F+PLURAL+COMPLETE …

Generation – results sample

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

19

�ד �יק�ו נ

או�טו�מט�י

Method Syllable segmentation requires Shva classification

Shva Na marks syllable start*

Shva Nach denotes syllable end*

Each syllable includes exactly one vowel*

* According to Even-Shoshan dictionary

We implemented two Shva classification schemes Heuristic approach - Rabbi-Eliyahu-Behor Shva classification according to the base tense form

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

20

�ד �יק�ו נ

או�טו�מט�י

Heuristic approach By Behor - a Shva is a Shva Na if:

It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long, stressed vowel (stress is needed) It vocalizes a letter with Dagesh Hazak (Dagesh type is needed) It vocalizes the first among two identical letters (many exceptions)

By our (adapted) Heuristic: A Shva is a Shva Na if:

It vocalizes the first letter of the word It follows another Shva and it is not at the word end It follows a long vowel

A Shva is a Shva Nach if: It is followed by another Shva

In any other case, we use Shva Nach as default

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

21

�ד �יק�ו נ

או�טו�מט�י

Shva classification according to the base tense form Through our generation mechanism, we can correlate verb

inflections to their corresponding base-tense form A Shva present in the base-tense form is a Shva Nach Otherwise the Shva is a Shva Na

Matching inflection to base-tense forms We use a dynamic programming string matching algorithm Operations costs were customized to be character dependent,

respecting the Hebrew inflectional model

Syllable segmentation

I I C R C C C C C C C C I R

י ק � �ק � ד � �ז ת � �ק - ק � ד � �ז - - י �

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

22

�ד �יק�ו נ

או�טו�מט�י

Results Thanks to our generation model, we obtain 240k of highly

accurate vocalized verbs We applied our two approaches to receive two lists of verbs

segmented into syllables: By our heuristic approach (based on Behor’s heuristic) By our customized string matching algorithm

Evaluation A sample of 300 segmented verbs were validated for:

81% word accuracy and 85.92% syllable accuracy by the heuristic

99.33% word accuracy and 99.5% syllable accuracy by the string matching approach

Syllable segmentation

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

23

�ד �יק�ו נ

או�טו�מט�י

-גו-למ-תי-ג-למ-תי-גו-למ-ת-ג-למ-ת-גו-למת-ג-למת

Syllable segmentation – results sample

-גו-למ-תם-ג-למ-תם-גו-למ-תן-ג-למ-תן-גו-למו-ג-למו

-גו-לם-ג-לם-גו-למה-ג-למה-גו-למ-נו-ג-למ-נו

גלם

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

24

�ד �יק�ו נ

או�טו�מט�י

Method We implemented a classifier (SVM) which:

Take: A non-vocalized verb (v)

Return: The pattern corresponding to v

The SVM uses: Dataset:

Over 2,700 verbs from our verb list 70% are used for training and 30% for testing

Features: Word length letters positions Guttural letters positions

Evaluation 90.25% of the verbs were classified correctly to

their corresponding Hebrew pattern

Verb classification to patterns

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

25

�ד �יק�ו נ

או�טו�מט�י

Method We implemented a classifier (SVM) which:

Take: A non-vocalized verb (v)

Return: The inflection table corresponding to v

The SVM uses: Dataset:

Over 2,700 verbs from our verb list 70% are used for training and 30% for testing

Features: Word length letters positions Guttural letters positions Corpus level features (50M morphologically disambiguated corpus)

Evaluation Without corpus level features - 68.63% accuracy With corpus level features - 70.08% accuracy

Unknown verb classification to inflection tables

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

26

�ד �יק�ו נ

או�טו�מט�י

The Hebrew verb inflectional model Q: How complex must be the computational model for verb full

morphological and vocalization generation? A: By implementing 264 inflection tables we achieve 99.4% accuracy

Q: How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon?

A: The 260 implemented inflection tables include many exception tables which describe the inflectional model for only several verbs

Our more general, unknowns classification, model, yields 70% accuracy (selecting 1 inflection table out of the total 264 tables)

For comparison the baseline for the most frequent inflection table yields only 34% accuracy

A rough estimation shows over 93% of the verbs in a large corpora exist in our dataset, moreover most unknown verbs are either miss-spelled or falsely tagged as verbs

Discussion

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

27

�ד �יק�ו נ

או�טו�מט�י

Syllable segmentation Q: How complex is syllable segmentation?

In contradiction to traditional grammars, few simple rules do not provide highly accurate segmentation

We achieved 99.3% word accuracy and 99.5% syllable accuracy through Shva classification

Q: What level of knowledge is required for successful syllable segmentation? A: By using the vocalized word only we achieve correct word segmentation with

81% accuracy Using the base tense form as well, improves word accuracy to 99.3%

This improvement suggests: Hebrew phonology uses a constructive process, which

derives inflections from base tense forms Inflections are not generated in a pipeline process, in which

morphology would first generate inflections that are later segmented into phonological units

Discussion

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

28

�ד �יק�ו נ

או�טו�מט�י

Generation Implementing rare inflection tables Implementing inflection tables for nouns

Syllable segmentation Searching for optimal Hebrew string matching weights Machine learning of syllable segmentation

Future work

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

29

�ד �יק�ו נ

או�טו�מט�י

Unknown verbs classification Using vocalized corpora to extract corpus level features Performing feature selection Classification of vocalized verbs into inflection tables Classification of inflections into inflection tables Exploring the SVM parameters

Automatic vocalization We hope to obtain a substantial vocalized corpora (the Aviv

encyclopedia), which will enable: Setting a base line for automatic vocalization using

a modern vocalized corpora Improving the baseline through supervised learning

Future work

Introduction Motivation & Objectives

Research Questions Previous work Background Generation Syllable

segmentationVerb

classificationDiscussion & Future work

30

�ד �יק�ו נ

או�טו�מט�י

The End

top related