au-kbc fire2008 submission - cross lingual information retrieval track: tamil- english pattabhi r.k...

19
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai

Upload: rosalyn-wilkerson

Post on 23-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

AU-KBC FIRE2008 Submission - Cross Lingual Information

Retrieval Track: Tamil- English

Pattabhi R.K Rao and Sobha. L

AU-KBC Research Centre,MIT Campus, Chennai

Page 2: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

FIRE 2008 – Tamil – English CLIR

• Problem Definition– Ad-hoc cross-lingual document retrieval task

of FIRE. – The task is to retrieve relevant documents in

English for a given Indian language query – worked on Tamil – English cross lingual

information retrieval system

Page 3: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Our Approach

• The main components in our CLIR system are – Query Language Analyser– Named Entity recognizer– Query Translation engine– Query Expansion– Ranking

Page 4: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Query Language Analyser – Tamil Morphological Analyser• The morphological analyser analyses each word

to give the morphs of the word• E.g.: patiwwAnY ->pati(V) + ww (Past) +

AnY(3SM)• For nouns, the inflections mark the case such as

Dative, accusative• For verbs, the inflections carry information of

Person, Number, Gender, tense, aspect and modal

• Uses paradigm-based approach • Implemented as Finite State Machine

Page 5: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Named Entity Recognizer (NER)• Generic engine uses Conditional Random

Fields (CRFs)

• Trained on 100000 word corpus from various domains

• Uses a hierarchical tagset

• Performs with 80% Recall and Precision 89%

Page 6: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Query Translation

• Uses a bilingual dictionary based approach• Tamil – English bilingual dictionary is 150K size• For Named entities, for which transliteration is

required, transliteration engine is used.• Tamil to English Transliteration is a tough task

– Tamil has few consonants.

• Transliteration is done using a statistical system based on n-grams approach

• The statistical system works with an accuracy of 81%

Page 7: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Query Expansion

• The query terms are expanded using – Thesaurus– Ontology

• Query Expansion is done at two places – Before Query translation– After Query translation

• Synonyms are obtained using WordNet

Page 8: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Query Expansion (2)

• Ontology is used to obtain more world knowledge

Festivals

Hindu Muslim Christian

Holi Diwali

Dussera

Ramazan Christmas

Page 9: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

What is there in the Ontology

• Descriptions about the entity– Ex: Holi- Festival of colours, Good over Evil, – Depavali- Festival of Lights , crackers etc

• We have an ontology of this type for 100 entities– Festivals, Sports, country, Natural Calamities,

Sports, Person Names, etc

Page 10: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Ranking

• Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need

• A parameter called boost factor is introduced to the standard algorithm for calculating the score

• The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

Page 11: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Ranking (2)

• This boost factor parameter show the weightage for certain particular terms in the query

• NEs get more weightage than other terms, it has been give 0.5 times more weightage

• And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

Page 12: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Experiments – Results (1)

• We have submitted two runs• For query 29, “assistance after Tsunami”, on expanding

the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves”

• This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46

• For query ids 27 and 59 the system did not perform well

Page 13: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Experiments – Results (2)

• The query 27 “Sino Indian relationship” is too broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined

• The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document . The terms “Iraq War” get more weight than the terms “fight against”

Page 14: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Experiments – Results (3)

MAP R-prec P@5 P10 P@20 Recall

0.4821 0.4862 0.7280 0.6960 0.6360 0.8912

Overall Results of the Tamil – English cross lingual information retrieval system.

Page 15: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

MAP Score For each query

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Query ID

MA

P S

co

re

Page 16: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Recall For Each Query

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Query IDs

Re

ca

ll %

ag

e

Page 17: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Conclusion

• Here Query language analyser is used• The difference between two runs MAP

score of 0.3921 and 0.4821• The use of query expansion module helps

in increasing the recall• The results obtained are encouraging

– MAP – 0.4821– P@10 – 0.6960– Recall – 0.8912

Page 18: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

References• Mohammad Afraz and Sobha L (2008), ‘English to Dravidian

Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India.

• Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA.

• Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66.

• S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

Page 19: AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Thank you!