overview of risot: retrieval of indic script ocr’d text utpal garainindian statistical institute,...
TRANSCRIPT
Overview of RISOT:Retrieval of Indic Script OCR’d Text
Utpal Garain Indian Statistical Institute, KolkataTamaltaru Pal Indian Statistical Institute, KolkataJiaul Paik Indian Statistical Institute, KolkataKripa Ghosh Indian Statistical Institute, KolkataDavid Doermann University of Maryland, College Park, USADouglas W. Oard University of Maryland, College Park, USA
o Evaluate retrieval of automatically recognized text from machine printed text
o Goals Support experimentation of retrieval from printed
documents Evaluate IR effectiveness for retrieval based on Indic
script OCR Provide venue where IR and OCR researchers can work
together
Task
oBengali newspaper articlesAbout half the FIRE 2008/2010
collection62,875 documentsoTextoRendered imageoOCR’d text
66 topics
RISOT 2011
o Two teams participatedo Techniques
OCR error modeling Query time stemming
oBest absolute OCR results resulted from stemming + error modeling 83% the TEXT MAP for TD queries
oBest same-team relative MAP 90% of TEXT 88% for P@10
RISOT 2011
oN-gram statistics were usedo Stemming beats words or n-grams
o Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model
Further experiments on RISOT 2011 Data
Run Q Doc Term Model MAP MAP% P@5 P@10 RprecTD-C-S TD Clean Stem 0.4229 0.4413 0.3554 0.3940TD-O-S-M TD OCR Stem Multi 0.3619 86% 0.3973 0.3207 0.3379TD-O-S-E TD OCR Stem One 0.3521 83% 0.3858 0.3008 0.3294TD-O-S TD OCR Stem 0.2915 69% 0.3109 0.2489 0.2832
Run Q Doc Term Model MAP MAP% P@5 P@10 Rprec
TD-C-W TD Clean Word 0.3449 82% 0.3826 0.3152 0.3250TD-O-W-M TD OCR Word Multi 0.3434 81% 0.3577 0.2962 0.3131TD-O-W-E TD OCR Word One 0.3251 77% 0.3388 0.2694 0.3068TD-O-W TD OCR Word 0.2293 54% 0.2717 0.2217 0.2336
Run Q Doc Term Model MAP MAP% P@5 P@10 Rprec
TD-O-3-E TD OCR 3-gram One 0.3285 78% 0.3419 0.2709 0.2925TD-O-3 TD OCR 3-gram 0.3072 73% 0.3239 0.2707 0.2903TD-O-4-E TD OCR 4-gram One 0.2972 70% 0.3140 0.2651 0.2717TD-O-2-E TD OCR 2-gram One 0.2795 66% 0.2870 0.2000 0.2635TD-O-4 TD OCR 4-gram 0.2708 64% 0.3000 0.2489 0.2631TD-O-5-E TD OCR 5-gram One 0.2686 64% 0.2930 0.2465 0.2460TD-O-5 TD OCR 5-gram 0.2451 58% 0.2739 0.2283 0.2339TD-O-2 TD OCR 2-gram 0.1984 47% 0.2478 0.1924 0.2085
o English query Bengali collection (OCR’d)oDictionary based translationo Transliteration of OOVsoAdditional resources
o Stemmingo OCR error modeling
CLIR
CLIR Results
Run QRetrieval
ConditionProcessing MAP MAP% P@5 P@10 Rprec
T1 TD Mono+Text -- 0.3205 100% 0.3762 0.3182 0.3083
O1 TD Mono -- 0.2689 84% 0.2420 0.2420 0.4166
O2 TD CLIR DQT 0.0813 25% 0.1025 0.0854 0.0679
O3 TD CLIR DQT (Manual Selection) 0.0848 26% 0.1150 0.0938 0.0864
O4 TD CLIR DQT + OOV 0.1866 58% 0.2529 0.2063 0.1901
O5 TD CLIR DQT+OOV+OEM 0.2650 83% 0.3338 0.2723 0.2509
O6 TD CLIR DQT+OOV+OEM+Stem 0.2915 91% 0.3672 0.2996 0.2760
oDevanagari (Hindi) Dataseto 94,432 articles from two newspaper
o Subset of FIRE datao Texto Rendered imageo OCR’d
o 28 topicso Tasks
o OCR Post-processingo Retrieval from Bengali OCR’d texto Retrieval from Devanagari (Hindi) OCR’d Text
Addition in 2012
oOne team participatedo ISI team
oKripabandhu Ghosh and Anirban Chakraborty
o Methodo Did not use previous OCR error modeling techniqueo Assumed that clean text is not available
o Co-occurrence based synonym searchingo tobacc, 1obacco, etc. are synonyms of tobacco
RISOT Runs
RISOT Results
MAP P@5Clean text 0.2567 0.3485OCR’d Text 0.1791 0.2738OCR’d text + Processing
0.1974 0.2831
oOCR error modeling gave better improvement
oNext RISOT will introduce image degradationo Module of OCRopuso LAMP, UMD tool
o How to attract more teamso Involvement of OCR consortium
oBetter OCRoBetter error modeling
o Summer code projectso Once in two years
RISOT Future