capturing word-level dependencies in morpheme-based language modeling

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Martha Yifiru Tachbelie and Wolfgang Menzel

@

University of Hamburg, Department of Informatics, Natural Languages Systems Group

Outline

Language Modeling Morphology of Amharic Language Modeling for Amharic

− Capturing word-level dependencies Language Modeling Experiment

− Word segmentation− Factored data preparation− The language models

Speech recognition experiment− The baseline speech recognition system− Lattice re-scoring experiment

Conclusion and future work

Language Modeling

Language models are fundamental to many natural language processing

Statistical language models – the most widely used ones

− provide an estimate of the probability of a word sequence W for a given task

− require large training data data sparseness problem OOV problem

Languages with rich morphology:− high vocabulary growth rate => a high perplexity

and a large number of OOV➢ Sub-word units are used in language modeling

Serious for morphologically rich languages

Morphology of Amharic

Amharic is one of the morphologically rich languages

Spoken mainly in Ethiopia− the second spoken Semitic language

Exhibits root-pattern non-concatenative morphological phenomenon

− e.g. sbr Uses different affixes to create inflectional

and derivational word forms➢ Data sparseness and OOV are serious

problems➔ Sub-word based language modeling has been

recommended (Solomon, 2006)

Language Modeling for Amharic

Sub-word based language models have been developed

Substantial reduction in the OOV rate have been obtained

Morphemes have been used as a unit in language modeling loss of word level dependency

Solution:− Higher order ngram => model complexity➢ factored language modeling

Capturing Word-level Dependencies

In FLM a word is viewed as a bundle or vector K parallel features or factors

− Factors: linguistic features Some of the features can define the word

➢ the probability can be calculated on the basis of these features

In Amharic: roots represent the lexical meaning of a word➢ root-based models to capture word-level

dependencies

W n≡ f n1 , f n

2 , ... , f nk

Morphological Analysis

There is a need for morphological analyser− attempts (Bayou, 2000; Bayu,2002; Amsalu and

Gibbon, 2005) − suffer from lack of data can not be used for our purpose

Unsupervised morphology learning tools➢ not applicable for this study

Manual segmentation− 72,428 word types found in a corpus of 21,338

sentences have been segmented polysemous or homonymous geminated or non-geminated

Factored Data Preparation

Each word is considered as a bundle of features: Word, POS, prefix, root, pattern and suffix

− W-Word:POS-noun:PR-prefix:R-root:PA-pattern:Su-suffix

− A given tag-value pair may be missing - the tag takes a special value 'null'

When roots are considered in language modeling:

− words not derived from roots will be excluded➢ stems of these words are considered

Factored Data Preparation -- cont.

Manually seg. word list

Factored representation

Text corpus

Factored data

The Language Models

The corpus is divided into training and test sets (80:10:10)

● Root-based models order 2 to 5 have been developed● smoothed with Kneser-Ney smoothing technique

The Language Models -- cont.

Perplexity of root-based models on development test set

A higher improvement: bigram Vs trigram Only 295 OOV The best model has:

− a logprob of -53102.3− a perplexity of 204.95 on the evaluation test set

Root ngram PerplexityBigram 278.57Trigram 223.26Quadrogram 213.14Pentagram 211.93


Word-based model− with the same training data− smoothed with Kneser-Ney smoothing

A higher improvement: bigram Vs trigram 2,672 OOV The best model has a logprob of -61106.0

Word ngram PerplexityBigram 1148.76Trigram 989.95Quadrogram 975.41Pentagram 972.58


Word-based models that use an additional feature in the ngram history have also been developed

Root-based models seem better than all the others, but might be less constraining➢ Speech recognition experiment – lattice

rescoring

Language models PerplexityW/W2,POS2,W1,POS1 885.81W/W2,PR2,W1,PR1 857.61W/W2,R2,W1,R1 896.59W/W2,PA2,W1,PA1 958.31W/W2,SU2,W1,SU1 898.89

Speech Recognition Experiment

The baseline speech recognition system (Abate, 2006)

Acoustic model:− trained on 20 hours of read speech corpus− a set of intra-word triphone HHMs with 3 emitting

states and 12 Gaussian mixture The language model

− trained on a corpus consisting of 77,844 sentences (868,929 tokens or 108,523 types)

− a closed vocabulary backoff bigram model− smoothed with absolute discounting method− perplexity of 91.28 on a test set that consists of

727 sentences (8,337 tokens)

Speech Recognition Experiment -- cont.

Performance:− 5k development test set (360 sentences read by

20 speakers) has been used to generate the lattices

− Lattices have been generated from the 100 best alternatives for each sentence

− Best path transcription has been decoded 91.67% word recognition accuracy


To make the results comparable− root-based and factored language models has

been developed

The corpus used in the baseline system

factored version

root-based and factored language models


Perplexity of root-based models trained on the corpus used in the baseline speech rec.

Root ngram Perplexity LogprobBigram 113.57 -18628.9Trigram 24.63 -12611.8Quadrogram 11.20 -9510.29Pentagram 8.72 -8525.42


Perplexity of factored models

Language models Perplexity LogprobW/W2,POS2,W1,POS1 10.61 -9298.57W/W2,PR2,W1,PR1 10.67 -9322.02W/W2,R2,W1,R1 10.36 -9204.7W/W2,PA2,W1,PA1 10.89 -9401.08W/W2,SU2,W1,SU1 10.70 -9330.96


Word lattice to factored lattice

Factored version

Word bigram model (FBL)

Word lattice

Factored lattice

Best path transcription

91.60 %


WRA with factored models

Language models WRA in %Factored word bigram (FBL) 91.60FBL + W/W2,POS2,W1,POS1 93.60FBL + W/W2,PR2,W1,PR1 93.82FBL + W/W2,R2,W1,R1 93.65FBL + W/W2,PA2,W1,PA1 93.68FBL + W/W2,SU2,W1,SU1 93.53


WRA with root-based language models

Language models WRA in %Factored word bigram (FBL) 91.60FBL + Bigram 90.77FBL + Trigram 90.87FBL + Quadrogram 90.99FBL + Pentagram 91.14

Conclusion and Future Work

Root-based models have low perplexity and high logprob

But, did not contribute to the improvement of word recognition accuracy

Improvement of these models by adding other word features but still maintaining word-level dependencies

Other ways of integrating the root-based models to a speech recognition system

Thank you

capturing word-level dependencies in morpheme-based language modeling

Technology

word rootbased models

language modelingserious

word types

amharic subword

word ngram perplexitybigram

suffix wword

wordbased models

word sequence w