mono & cross language experiments on persian text

Mono & Cross Language Experiments on Persian Text

Abolfazl AleAhmad, Hadi Amiri, Farhad OroumchianDatabase Research Group

School of Electrical and Computer Engineering

University of Tehran

University of TehranDatabase Research Group

18 Sep 2008

Persian@CLEF 2008

OutlinePersian Language

Persian Test Collections

Hamshahri in CLEF 2008

UT Participants Using Part of Speech Tagging in Persian Information Retrieval

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

Local Cluster Analysis Using Part of Speech Tagging

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text

Cross Language Experiments at Persian@CLEF 2008

Next Year

The Persian LanguageA branch of Indo-European Languages

Official Language of Iran, Afghanistan and Tajikistan

Its morphological analysis is Comparably difficult

The word “خبر” has two plural forms:• Persian rules: “خبرها”• Arabic rules: “اخبار”

Writing Style Issues:e.g. ” شود are the same ”میشود“ and “می

e.g. ”کتابها“ and ” ها are the same “کتاب

KASRE:e.g. سوزاند را خانه علی has two چراغdifferent meanings:

• CheraghAli burned the house• Ali’s lantern burned the house

Some Processing Issues

Encoding

Persian in the Middle East

6Source: Internet World Stats, http://internetworldstats.com/

December 31, 2007

User Population Growth on the Web (2000-2009)

Persian Test Collections

IR DomainGhavanin (domain specific)

Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri

NLP DomainBijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan

News articles of Hamshahri newspaper from year 1996 to 2002

Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB)

22 assessors

Evaluation based on DIRECT System

Collection size 564 MB (Unicode text)

No. Of documents 166,774

No. Of unique terms 417,339

Average length of documents 380 Terms

No. Of categories 9

No. Of Topics 50 bilingual

Implementation of our methods

We submitted top 100 for each run

Hamshahri corpusHamshahri tagged

document collection

Stemming

Refine Query

part of speeches with

corresponding weight

Retrieval

POS Tagging

Bijankhan Tagged collection of documents

As train data

Simple Stemming

Stemmed and tagged corpus

POS Tagging

Bijankhan Tagged collection of documents

As train data

Stemming

Simple Stemming

Stemmed and

tagged queries

Using Part of Speech Tagging in Persian Information RetrievalReza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian

Config. Corpus Query

1 Tagged Title with equal weighting for all POS tags

2 Stemmed and tagged Stemmed title with equal weighting for all POS tags

3 Stemmed Stemmed title without POS tagging

4 Stemmed Stemmed Title plus description

5 Stemmed (stop words removed)

Stemmed Title plus description (stop words removed)

6 Tagged Title plus description with equal weighting for all POS tags

7 Tagged Title with various weighting schemes for different POS tags

8 Normal Title (Neither stemmed nor tagged)

Using Part of Speech Tagging in Persian Information Retrieval

20 less used tags omitted, others equal weight

Noun=3

Verb=2

Noun=3

Verb=0

Adv = 0

Noun=0

Verb=2

Noun=0

Verb=0

Noun=0

Verb=0

Average precision

0.2745 0.2635 0.2597 0.1108 0.1198 0.0977

R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111

Using Part of Speech Tagging in Persian Information Retrieval

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian

Weighting Model Description

BB2Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

BM25 The BM25 probabilistic model

DFR_BM25 The DFR version of BM25

IFB2Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

In_expB2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

In_expC2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm

InL2Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization

PL2Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization

TF_IDFThe tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf

15Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/

Weighting Model Average Precision R-Precision

BB2 0.3854 0.4167

BM25 0.3562 0.4009

DFR_BM25 0.4006 0.4347

IFB2 0.4017 0.4328

In_expB2 0.3997 0.4329

In_expC2 0.4190 0.4461

InL2 0.3832 0.4200

PL2 0.43140.4314 0.45480.4548

TF_IDF 0.3574 0.4017

And two other variations of this operator: IOWA and NOWA

Retrieval Method Toolkit Average Precision R-Precision Dif

TF_IDF with unstemmed single terms Terrier 0.3847 0.4122

PL2 with 4gram terms Terrier 0.3669 0.3939

Indri with stemmed terms Lemur 0.3955 0.4149

IOWA 0.4515 0.4708 +5.6

NOWA 0.4522 0.4736 +5.67

Post hoc Results

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian TextAmir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri

Bijankhan Collection

POS Tagger (MLE and TNT)

Hamshahri Clear Collection

Hamshahri Tagged Collection

By MLE

By TNT

Training

Content-less tag removalUseful Tags

Retrieval EngineRetrieval Engine

Retrieved Results

Clustering

Relevant Cluster

Irrelevant Cluster

Cluster AnalysisReranked

Results

By MLE

By TNT

Training

By MLE

By TNT

Training

Content-less tag removalUseful Tags

Retrieval EngineRetrieval Engine

Retrieved Results

Clustering

Relevant Cluster

Irrelevant Cluster

Cluster AnalysisReranked

Results

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text

21But the result was not good on the test set

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

Run tot-ret rel-ret MAP Retrieval Model Tool

Using Light Stemmer

5161 1967 26.89 Vector Space Lucene

Without Stemmer 5161 1991 27.08 Vector Space Lucene

3Grams 5161 1901 26.07 Language Modeling Lemur

Term-Based 5161 2035 28.14 Language Modeling Lemur

Probabilistic Structured Queries (PSQ)

Combinatorial Translation Probability (CTP)

Query Translation

Query Translation Results

1 2 3 4 5 6 7 8 9 10 11Recall

All Meanings; MAP 6.73 First Meaning; MAP 12.4

PSQ_CTP+4Grams; MAP 14.46

Document Translation

Using Shiraz machine translation system from CRL of NMSU

Took 10 days to translate 130,000+ docs from Persian to English

Document Translation & Hybrid Results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Document Translation; MAP 12.88Monolingual; MAP 27.08Query Translation; MAP 14.46Hybrid; MAP 16.19

Next YearHam2 for the Next Year

Extended Version of Hamshahri Collection

2 times larger (~1.5 GB)

<ORIGINALFILE<ORIGINALFILE>>/1385/851011/news/_adabh.htm</ORIGINALFILE></ORIGINALFILE><ISSUE><ISSUE> 4172 - سال چهاردهم - شماره1385 دي 11دوشنبه - Jan 1,

2007</ISSUE></ISSUE><DATE>2007-01-01</DATE><CAT xml:lang="fa">ادب و هنر</CAT><CAT xml:lang="en">Literature and Art</CAT>

<image<image>>/1385/851011/news/008505.jpg</image></image><![CDATA[فارس: مدير كل كتاب و كتاب خواني وزارت فرهنگ و ارشاد اسالمي گفت: آيين نام</TEXT></DOC><DOC>

Questions?Thanks For Your Attention

Database Research Grouphttp://ece.ut.ac.ir/dbrg

mono & cross language experiments on persian text

Documents

in action: stream restoration at mono...

persian empire

persian online unit 1 - laits sites - university of texas...

04 persian empire and persian wars

persian proverbs

achaemenid persian empire krzysztof nawotka. persian empire

persian literature final - iran chamber society persian, a...

persian rugs5

the persian wars athens & sparta vs persian empire

ubau mono ubawareru mono volumen 01

persian language i. early new persian – encyclopaedia...

ï¿½ï¿½the representation of persian language functions...

english-persian, persian-english dictionary

persian wars2

persian and oriental carpets and rugs - … · persian and...

mono & cross language experiments on persian text

single switches · 2011. 11. 10. · plug size in/mm 1/8 /...

current’gccfi’ new’gems’code description code ·...

persian war also sometimes called the greco- persian war

persian manuscript -...