s tatistical language models for croatian weather - domain corpus

04/21/23 1

STATISTICAL LANGUAGE MODELS FOR CROATIAN

WEATHER-DOMAIN CORPUS

Lucia Načinović, Sanda Martinčić-Ipšić and Ivo IpšićDepartment of Informatics, University of Rijeka

lnacinovic, smarti, ivoi @inf.uniri.hr

04/21/23 2

Introduction• Statistical language modelling estimates the

regularities in natural languages – the probabilities of word sequences which are usually

derived from large collections of text material

• Employed in:– Speech recognition– Optical character recognition– Handwriting recognition– Machine translation– Spelling correction– ...

04/21/23 3

N-gram language models

• The most widely-used LMs – Based on the probability of a word wn given the

preceding sequence of words wn-1

– Bigram models (2-grams) • determine the probability of a word given the

previous word

– Trigram models (3-gram)• determine the probability of a word given the

previous two words

04/21/23 4

Language model perplexity

• The most common metric for evaluating a language model - probability that the model assigns to test data, or the derivative measures of :– cross-entropy– perplexity

04/21/23 5

Cross-entropy

• The cross-entropy of a model p(T) on data T:

• WT -the length of the text T measured in words

)(log1

)( 2 TpW

THT

p

04/21/23 6

Perplexity

• The reciprocal value of the average probability assigned by the model to each word in the test set T

• The perplexity PPp(T) of a model - related to cross-entropy by the equation

• lower cross-entropies and perplexities are better

)(2)( THp

pTPP

04/21/23 7

Smoothing • Data sparsity problem

– N-gram models - trained from finite corpus– some perfectly acceptable N-grams are missing:

probability=0

• Solution – smoothing techiques– adjust the maximum likelihood estimate of probabilities

to produce more accurate probabilities– adjust low probabilities such as zero probabilities

upward, and high probabilities downward

04/21/23 8

Smoothing techniques used in our research

• Additive smoothing

• Absolute discounting

• Witten-Bell technique

• Kneser-Nay technique

04/21/23 9

Additive smoothing• one of the simplest types of smoothing • we add a factor δ to every count: δ (0< δ ≤1) • Formula for additive smoothing:

• V - the vocabulary (set of all words considered)• c - the number of occurrences • values of δ parameter used in our research:

0.1,0.5 and 1

)(||

)()|(

1

111 i

niw

inii

niiadd wcV

wcwwp

i

04/21/23 10

Absolute discounting

• When there is little data for directly estimating an n-gram probability, useful information can be provided by the corresponding (n-1)-gram

• Absolute discounting - the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:

• Values of D used in our research: 0.3, 0.5, 1

)|()1(

)(

0,)(max)|( 1

21

11 1

1

iniiabswi

niw

inini

niiabs wwpwc

Dwcwwp i

ni

i

04/21/23 11

Witten-Bell technique

• Number of different words in the corpus is used as a help at determing the probability of words that never occur in the corpus

• Example for bigram:

)(: )()(

)()(

ixwwci xx

xxi wTwN

wTwwp

04/21/23 12

Kneser-Nay technique

• An extension of absolute discounting

• the lower-order distribution that one combines with a higher-order distribution is built in a novel manner:– it is taken into consideration only when few or

no counts are present in the higher-order distribution

04/21/23 13

Smoothing implementation

• 2-gram, 3-gram and 4-gram language models were built

• Corpus: 290 480 words– 2 398 1-grams, – 18 694 2-grams, – 23 021 3-grams and – 29 736 4-grams

• On each of these models four different smoothing techniques were applied

04/21/23 14

Corpus

• Major part developed from 2002 until 2005 and some parts added later

• Includes the vocabulary related to weather, bio and maritime forecast, river water levels and weather reports

• Devided into 10 parts– 9/10 used for building language models– 1/10 used for evaluating those models in

terms of their estimated perplexities

04/21/23 15

Results given by the perplexities of LM-s

Without

smoothing Additive smoothing Absolute discounting

Witten-Bell

Kneser-Ney

δ parameter D parameter 0,1 0,5 1 0,3 0,5 1

2-gram

19,87 28,8 51,6 73,5 19,61 19,64 21,6 19,75 18,96

3-gram

8,45 30,04 86,9 144,2 8,17 8,22 9,30 8,25 7,63

4-gram

6,04 42,9 142,6 239,87 5,64 5,71 6,76 5,76 5,24

04/21/23 16

Conclusion

• In this paper we described the process of language model building from the Croatian weather-domain corpus

• We built models of different order: – 2-grams– 3-grams – 4-grams

04/21/23 17

Conclusion

• We applied four different smoothing techniques:– additive smoothing– absolute discounting– Witten-Bell technique– Kneser-Ney technique

• We estimated and compared perplexities of those models

• Kneser-Ney smoothing technique gives the best results

04/21/23 18

Further work

• Prepare more balanced corpus of Croatian text and thus build more complete language model

• Other LM– Class based

• Other smoothing techniques

04/21/23 19

STATISTICAL LANGUAGE MODELS FOR CROATIAN

WEATHER-DOMAIN CORPUS

Lucia Načinović, Sanda Martinčić-Ipšić and Ivo IpšićDepartment of Informatics, University of Rijeka

lnacinovic, smarti, ivoi @inf.uniri.hr

04/21/23 20

References• Chen, Stanley F.; Goodman, Joshua. An empirical study of smoothing techniques for

language modelling. Cambridge, MA: Computer Science Group, Harvard University, 1998

• Chou, Wu; Juang, Biing-Hwang. Pattern recognition in speech and language processing. CRC Press, 2003

• Jelinek, Frederick. Statistical Methods for Speech Recognition. Cambridge, MA: The MIT Press, 1998

• Jurafsky, Daniel; Martin, James H. Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, New Jersey: Prentice Hall, 2000

• Manning, Christopher D.; Schütze, Hinrich. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press, 1999

• Martinčić-Ipšić, Sanda. Raspoznavanje i sinteza hrvatskoga govora konteksno ovisnim skrivenim Markovljevim modelima, doktorska disertacija. Zagreb, FER, 2007

• Milharčič, Grega; Žibert, Janez; Mihelič, France. Statistical Language Modeling of SiBN Broadcast News Text Corpus.//Proceedings of 5th Slovenian and 1st international Language Technologies Conference 2006/Erjavec, T.; Žganec Gros, J. (ed.). Ljubljana, Jožef Stefan Institute, 2006

• Stolcke, Andreas. SRILM – An Extensible Language Modeling Toolkit.//Proceedings Intl. Conf. on Spoken Language Processing. Denver, 2002, vol.2, pp. 901-904

04/21/23 21

SRILM toolkit

• Modeli su građeni i evaluirani pomoću SRILM alata

• http://www.speech.sri.com/projects/srilm/

• ngram-count –text TRAINDATA –lm LM

• ngram –lm LM –ppl TESTDATA

04/21/23 22

Language model

• Speech recognition – converting an acoustic signal into a sequence of words

• Through language modelling, the speech signal is being statistically modelled

• Language model of a speech estimates probability Pr(W) for all possible word strings W=(w1, w2,…wi).

04/21/23 23

System diagram of a generic speech recognizer based on statistical models

04/21/23 24

• Bigram language models (2-grams)– Central goal: to determine the probability of a word

given the previous word

• Trigram language models (3-grams)– Central goal: to determine the probability of a word

given the previous two words

The simplest way to approximate this probability is to compute:

-This value is called the maximum likelihood (ML) estimate

)(

)()|(

12

1212

ii

iiiiii wwc

wwwcwwwpML

04/21/23 25

• Linear interpolation - simple method for combining the information from lower-order n-gram models in estimating higher-order probabilities

04/21/23 26

• A general class of interpolated models is described by Jelinek and Mercer:

• The nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the (n-1)-th-order smoothed model

• Given fixed pML, it is possible to search efficiently for the factor that maximizes the probability of some data using the Baum–Welch algorithm

)|()1()|()|( 12int

11

11int 1

11

11

i

niierpw

iniMLw

iniierp wwpwwpwwp i

niini

11

iniw

04/21/23 27

• In absolute discounting smoothing instead of multiplying the higher-order maximum-likelihood distribution by a factor , the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:

• Values of D used in research: 0.3, 0.5, 1

11

iniw

)|()1(

)(

0,)(max)|( 1

21

11 1

1

iniiabswi

niw

inini

niiabs wwpwc

Dwcwwp i

ni

i

s tatistical language models for croatian weather - domain corpus

Documents

probability of words

language model probability

ngram probability

smoothing implementation2gram

language model perplexitythe

ngram language modelsthe

average probability

statistical language