mono & cross language experiments on persian text
Post on 07-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Mono & Cross Language Experiments on Persian Text
Abolfazl AleAhmad, Hadi Amiri, Farhad OroumchianDatabase Research Group
School of Electrical and Computer Engineering
University of Tehran
University of TehranDatabase Research Group
18 Sep 2008
Persian@CLEF 2008
OutlinePersian Language
Persian Test Collections
Hamshahri in CLEF 2008
UT Participants Using Part of Speech Tagging in Persian Information Retrieval
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Local Cluster Analysis Using Part of Speech Tagging
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text
Cross Language Experiments at Persian@CLEF 2008
Next Year
2
The Persian LanguageA branch of Indo-European Languages
Official Language of Iran, Afghanistan and Tajikistan
Its morphological analysis is Comparably difficult
The word “خبر” has two plural forms:• Persian rules: “خبرها”• Arabic rules: “اخبار”
3
Writing Style Issues:e.g. ” شود are the same ”میشود“ and “می
e.g. ”کتابها“ and ” ها are the same “کتاب
KASRE:e.g. سوزاند را خانه علی has two چراغdifferent meanings:
• CheraghAli burned the house• Ali’s lantern burned the house
Some Processing Issues
4
Some Processing Issues
5
Encoding
Persian in the Middle East
6Source: Internet World Stats, http://internetworldstats.com/
December 31, 2007
User Population Growth on the Web (2000-2009)
Persian Test Collections
IR DomainGhavanin (domain specific)
Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri
NLP DomainBijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan
7
Hamshahri in CLEF 2008
8
News articles of Hamshahri newspaper from year 1996 to 2002
Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB)
22 assessors
Evaluation based on DIRECT System
Hamshahri in CLEF 2008
9
Collection size 564 MB (Unicode text)
No. Of documents 166,774
No. Of unique terms 417,339
Average length of documents 380 Terms
No. Of categories 9
No. Of Topics 50 bilingual
Implementation of our methods
We submitted top 100 for each run
10
11
Hamshahri corpusHamshahri tagged
document collection
Stemming
User
Refine Query
part of speeches with
corresponding weight
Query
Retrieval
POS Tagging
Bijankhan Tagged collection of documents
As train data
Simple Stemming
Stemmed and tagged corpus
POS Tagging
Bijankhan Tagged collection of documents
As train data
Stemming
Simple Stemming
Stemmed and
tagged queries
Using Part of Speech Tagging in Persian Information RetrievalReza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
Config. Corpus Query
1 Tagged Title with equal weighting for all POS tags
2 Stemmed and tagged Stemmed title with equal weighting for all POS tags
3 Stemmed Stemmed title without POS tagging
4 Stemmed Stemmed Title plus description
5 Stemmed (stop words removed)
Stemmed Title plus description (stop words removed)
6 Tagged Title plus description with equal weighting for all POS tags
7 Tagged Title with various weighting schemes for different POS tags
8 Normal Title (Neither stemmed nor tagged)
12
Using Part of Speech Tagging in Persian Information Retrieval
13
20 less used tags omitted, others equal weight
Noun=3
Verb=2
Adj=1
Adv=1
Noun=3
Verb=0
Avj=3
Adv = 0
Noun=0
Verb=2
Adj=0
Adv=0
Noun=0
Verb=0
Adj=1
Adv=0
Noun=0
Verb=0
Adj=0
Adv=1
Average precision
0.2745 0.2635 0.2597 0.1108 0.1198 0.0977
R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111
Using Part of Speech Tagging in Persian Information Retrieval
14
Using Part of Speech Tagging in Persian Information Retrieval
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian
Weighting Model Description
BB2Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
BM25 The BM25 probabilistic model
DFR_BM25 The DFR version of BM25
IFB2Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expB2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expC2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm
InL2Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
PL2Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
TF_IDFThe tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf
15Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Weighting Model Average Precision R-Precision
BB2 0.3854 0.4167
BM25 0.3562 0.4009
DFR_BM25 0.4006 0.4347
IFB2 0.4017 0.4328
In_expB2 0.3997 0.4329
In_expC2 0.4190 0.4461
InL2 0.3832 0.4200
PL2 0.43140.4314 0.45480.4548
TF_IDF 0.3574 0.4017
16
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
And two other variations of this operator: IOWA and NOWA
17
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
18
Retrieval Method Toolkit Average Precision R-Precision Dif
TF_IDF with unstemmed single terms Terrier 0.3847 0.4122
PL2 with 4gram terms Terrier 0.3669 0.3939
Indri with stemmed terms Lemur 0.3955 0.4149
IOWA 0.4515 0.4708 +5.6
NOWA 0.4522 0.4736 +5.67
19
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track
Post hoc Results
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian TextAmir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri
20
Bijankhan Collection
POS Tagger (MLE and TNT)
Hamshahri Clear Collection
Hamshahri Tagged Collection
By MLE
Hamshahri Tagged Collection
By TNT
Training
Test
MLE
TNT
Content-less tag removalUseful Tags
Retrieval EngineRetrieval Engine
Pre
pro
ce
ss
ing
Re
tP
os
t P
roc
es
sin
g
Retrieved Results
Clustering
Relevant Cluster
Irrelevant Cluster
Cluster AnalysisReranked
Results
Bijankhan Collection
POS Tagger (MLE and TNT)
Hamshahri Clear Collection
Hamshahri Tagged Collection
By MLE
Hamshahri Tagged Collection
By TNT
Training
Test
MLE
TNT
Bijankhan Collection
POS Tagger (MLE and TNT)
Hamshahri Clear Collection
Hamshahri Tagged Collection
By MLE
Hamshahri Tagged Collection
By TNT
Training
Test
MLE
TNT
Content-less tag removalUseful Tags
Retrieval EngineRetrieval Engine
Pre
pro
ce
ss
ing
Pre
pro
ce
ss
ing
Re
tR
et
Po
st
Pro
ce
ss
ing
Po
st
Pro
ce
ss
ing
Retrieved Results
Clustering
Relevant Cluster
Irrelevant Cluster
Cluster AnalysisReranked
Results
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text
21But the result was not good on the test set
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
22
Run tot-ret rel-ret MAP Retrieval Model Tool
Using Light Stemmer
5161 1967 26.89 Vector Space Lucene
Without Stemmer 5161 1991 27.08 Vector Space Lucene
3Grams 5161 1901 26.07 Language Modeling Lemur
4Grams 5161 1950 26.70 Language Modeling Lemur
5Grams 5161 1983 27.13 Language Modeling Lemur
Term-Based 5161 2035 28.14 Language Modeling Lemur
Probabilistic Structured Queries (PSQ)
Combinatorial Translation Probability (CTP)
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Query Translation
23
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Query Translation Results
24
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7 8 9 10 11Recall
Pre
cisi
on
All Meanings; MAP 6.73 First Meaning; MAP 12.4
PSQ_CTP+4Grams; MAP 14.46
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation
Using Shiraz machine translation system from CRL of NMSU
Took 10 days to translate 130,000+ docs from Persian to English
25
Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation & Hybrid Results
26
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall
Pre
cisi
on
Document Translation; MAP 12.88Monolingual; MAP 27.08Query Translation; MAP 14.46Hybrid; MAP 16.19
Next YearHam2 for the Next Year
Extended Version of Hamshahri Collection
2 times larger (~1.5 GB)
27
<DOC><DOCID>HAM2-851011-001</DOCID><DOCNO>HAM2-851011-001</DOCNO>
<ORIGINALFILE<ORIGINALFILE>>/1385/851011/news/_adabh.htm</ORIGINALFILE></ORIGINALFILE><ISSUE><ISSUE> 4172 - سال چهاردهم - شماره1385 دي 11دوشنبه - Jan 1,
2007</ISSUE></ISSUE><DATE>2007-01-01</DATE><CAT xml:lang="fa">ادب و هنر</CAT><CAT xml:lang="en">Literature and Art</CAT>
<TITLE><TITLE><![CDATA[مديركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسالمي خبر داد<[[آيين نامه خريد كتاب اصالح شد</TITLE></TITLE><TEXT>
<image<image>>/1385/851011/news/008505.jpg</image></image><![CDATA[فارس: مدير كل كتاب و كتاب خواني وزارت فرهنگ و ارشاد اسالمي گفت: آيين نام</TEXT></DOC><DOC>
28
Questions?Thanks For Your Attention
Database Research Grouphttp://ece.ut.ac.ir/dbrg
top related