language model in turkish ir melih kandemir f. melih Özbekoğlu can Şardan Ömer s. uğurlu

29
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Upload: charles-robbins

Post on 17-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Model in Turkish IR

Melih KandemirF. Melih Özbekoğlu

Can ŞardanÖmer S. Uğurlu

Page 2: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Outline

• Indexing problem and proposed solution

• Previous Work

• System Architecture

• Language Modeling Concept

• Evaluation of the System

• Conclusion

Page 3: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Indexing Problem

• “A Language Modeling Approach to Information Retrieval” Jay M. Ponte and. W. Bruce Croft, 1998

• Indexing model is important at probabilistic retrieval model

• Current models do not lead to improved retrieval results

Page 4: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Indexing Problem

• Failure because of unwarranted assumptions:

• 2-Poisson model– “elite” documents

• N-Poisson model– Mixture of more than 2 Poission distributions

Page 5: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Proposed Solution

• Retrieval based on probabilistic language modeling

• Language model refers to probabilistic distribution that captures statistical regularities of the generation of language

• A language model is inferred for each document

Page 6: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Proposed Solution

• Estimate probability of generating the query

• Documents are ranked according to these probabilities

• Users have a reasonable idea of terms

• tf, idf are integral parts of language model

Page 7: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Previous Work

• Robertson–Sparck Jones model and Croft–Harper model– They focus on relevance

• Fuhr integrated indexing and retrieval models.– Used statistics as heuristics

• Wong and Yao used utility theory and information theory

Page 8: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Previous Work

• Kalt’s approach is the most similar– Maximum likelihood estimator is used– Collection statistics are integral parts of the

model– Documents are members of language classes

Page 9: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

System OverviewApplication Server

Index DB (PostgreSQL)

LM-Search

JDBC

UI

Document Repository

Query EvaluatorIndexer

USER

Different Resultsets

Page 10: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

System Architecture

Document Repository

Stemming & Term Selection

No Stemming

First 5

Lemmatiser Inverted Index

Generation

tf.idf

Language Model

Index DB

Page 11: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

tf.idf vs. Language model

Different Resultsets GUI for seeing differences

between results

tf.idf LM LM

tf.idf

Page 12: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Vocabulary Extraction• No stemmer

– Turkish is aggluntinative– Expectation: low precision

• First 5 characters– As effective as more complex solutions

• Lemmatiser: – Expectation: high precision. – Zemberek2 (MPL license)

• Open Source Software• Java Interface, easy to use• Find stems of the words• First valid stem will be used,• Word sense disambiguation (using wordnet or

POS) may be added in the future

Stemming & Term Selection

No Stemming

First 5

Lemmatiser

Page 13: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Database

Index DB

Page 14: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Inverted Index Implementation

An example inverted index for m terms :t1 cft d1 P(t1|Md1) tf1d1 dn P(t1|Mdn) tf1dn

t2 cft d3 P(t2|Md3) tf2d2 dp P(t2|Mdp) tf2dp

tm cft d4 P(t4|Md4) tfmd4 dk P(tm|Mdk) tfmdk

If a document does not contain term then probability can be calculated using cft

ft = mean term frequency= mean probability of t in documents containing it

cft =frequency of t in all documents

Page 15: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

The Baseline Approach : tf.idf

We will use the traditional tf.idf term weighting approach as the baseline modelRobertson’s tf score

Standard idf score

Page 16: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Definition• An alternative approach to indexing and retrieval• Definition of Language Model: A probability distribution that captures the statistical regularities of the generation of language• Intuition Behind : Users have a reasonable idea of terms that are likely to occur in documents of interest and will choose query terms that distinguish these documents from others in the collection

Page 17: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : The Approach

The following assumptions are not made :• Term distributions in the documents are parametric• Documents are members of pre-defined classes

•“Query generation probability” rather than “Probability of relevance”

Page 18: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : The Approach

P(t | Md) : Probability that the term t is generated by the language model of document Md

Page 19: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Theory

Maximum likelihood estimate of the probability of term t under the term distribution for document d:

tf(t,d) : raw term frequency in document d

dld : total number of terms in the document

Page 20: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Theory

An additional more robust estimate from a larger amount of data :

pavg : Mean probability of term t in documents containing it

dft : Number of documents that contain term t

Page 21: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Theory

The risk function :

: Mean term frequency of term t in documents which contains it.

Page 22: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : The Ranking Formula

Let the probability of term t being produced by document d given the document model Md :

The probability of producing Q for a given document model Md is :

Page 23: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Language Modeling : Inverted Index Implementation

An example inverted index for m terms :t1 cft d1 P(t1|Md1) dn P(t1|Mdn)

t2 cft d3 P(t2|Md3) dp P(t2|Mdp)

tm cft d4 P(t4|Md4) dk P(tm|Mdk)

Page 24: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Evaluation

• Perform recall/precision experiments– Recall/precision results – Non-interpolated average precision – Precision figures for the top N documents

• For several values of N• R-Precision

Page 25: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Other Metrics

• Compare the baseline (tf.idf) results to our language model.– Percent Change between two result sets– I / D

• I : count of queries performance improved• D : count of queries performance changed

Page 26: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Document Repository

• Milliyet (2001-2005)• XML file ( 1.1 GB )• 408304 news • Ready for indexing

• XML Schema ......(FIXME)

Document Source

Page 27: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Summary

• Indexing and stemming– Zemberek2 lemmatiser– Java environment

• Data– News archive from 2001 to 2005, from Milliyet

• Evaluation– Methods will be compared according to

performance over recall/precision values

Page 28: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Conclusion

• First language modelling approach to Turkish IR

• The LM approach– Non-parametric– Less assumptions– Relaxed

• Expected a better performance than baseline tf.idf method

Page 29: Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Thanks for listening…

Any Questions?