overview of risot: retrieval of indic script ocr’d text utpal garainindian statistical institute,...

Overview of RISOT:Retrieval of Indic Script OCR’d Text

Utpal Garain Indian Statistical Institute, KolkataTamaltaru Pal Indian Statistical Institute, KolkataJiaul Paik Indian Statistical Institute, KolkataKripa Ghosh Indian Statistical Institute, KolkataDavid Doermann University of Maryland, College Park, USADouglas W. Oard University of Maryland, College Park, USA

o Evaluate retrieval of automatically recognized text from machine printed text

o Goals Support experimentation of retrieval from printed

documents Evaluate IR effectiveness for retrieval based on Indic

script OCR Provide venue where IR and OCR researchers can work

together

Task

oBengali newspaper articlesAbout half the FIRE 2008/2010

collection62,875 documentsoTextoRendered imageoOCR’d text

66 topics

RISOT 2011

o Two teams participatedo Techniques

OCR error modeling Query time stemming

oBest absolute OCR results resulted from stemming + error modeling 83% the TEXT MAP for TD queries

oBest same-team relative MAP 90% of TEXT 88% for P@10

RISOT 2011

oN-gram statistics were usedo Stemming beats words or n-grams

o Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model

Further experiments on RISOT 2011 Data

Run Q Doc Term Model MAP MAP% P@5 P@10 RprecTD-C-S TD Clean Stem 0.4229 0.4413 0.3554 0.3940TD-O-S-M TD OCR Stem Multi 0.3619 86% 0.3973 0.3207 0.3379TD-O-S-E TD OCR Stem One 0.3521 83% 0.3858 0.3008 0.3294TD-O-S TD OCR Stem 0.2915 69% 0.3109 0.2489 0.2832

Run Q Doc Term Model MAP MAP% P@5 P@10 Rprec

TD-C-W TD Clean Word 0.3449 82% 0.3826 0.3152 0.3250TD-O-W-M TD OCR Word Multi 0.3434 81% 0.3577 0.2962 0.3131TD-O-W-E TD OCR Word One 0.3251 77% 0.3388 0.2694 0.3068TD-O-W TD OCR Word 0.2293 54% 0.2717 0.2217 0.2336

Run Q Doc Term Model MAP MAP% P@5 P@10 Rprec

TD-O-3-E TD OCR 3-gram One 0.3285 78% 0.3419 0.2709 0.2925TD-O-3 TD OCR 3-gram 0.3072 73% 0.3239 0.2707 0.2903TD-O-4-E TD OCR 4-gram One 0.2972 70% 0.3140 0.2651 0.2717TD-O-2-E TD OCR 2-gram One 0.2795 66% 0.2870 0.2000 0.2635TD-O-4 TD OCR 4-gram 0.2708 64% 0.3000 0.2489 0.2631TD-O-5-E TD OCR 5-gram One 0.2686 64% 0.2930 0.2465 0.2460TD-O-5 TD OCR 5-gram 0.2451 58% 0.2739 0.2283 0.2339TD-O-2 TD OCR 2-gram 0.1984 47% 0.2478 0.1924 0.2085

o English query Bengali collection (OCR’d)oDictionary based translationo Transliteration of OOVsoAdditional resources

o Stemmingo OCR error modeling

CLIR

CLIR Results

Run QRetrieval

ConditionProcessing MAP MAP% P@5 P@10 Rprec

T1 TD Mono+Text -- 0.3205 100% 0.3762 0.3182 0.3083

O1 TD Mono -- 0.2689 84% 0.2420 0.2420 0.4166

O2 TD CLIR DQT 0.0813 25% 0.1025 0.0854 0.0679

O3 TD CLIR DQT (Manual Selection) 0.0848 26% 0.1150 0.0938 0.0864

O4 TD CLIR DQT + OOV 0.1866 58% 0.2529 0.2063 0.1901

O5 TD CLIR DQT+OOV+OEM 0.2650 83% 0.3338 0.2723 0.2509

O6 TD CLIR DQT+OOV+OEM+Stem 0.2915 91% 0.3672 0.2996 0.2760

oDevanagari (Hindi) Dataseto 94,432 articles from two newspaper

o Subset of FIRE datao Texto Rendered imageo OCR’d

o 28 topicso Tasks

o OCR Post-processingo Retrieval from Bengali OCR’d texto Retrieval from Devanagari (Hindi) OCR’d Text

Addition in 2012

oOne team participatedo ISI team

oKripabandhu Ghosh and Anirban Chakraborty

o Methodo Did not use previous OCR error modeling techniqueo Assumed that clean text is not available

o Co-occurrence based synonym searchingo tobacc, 1obacco, etc. are synonyms of tobacco

RISOT Runs

RISOT Results

MAP P@5Clean text 0.2567 0.3485OCR’d Text 0.1791 0.2738OCR’d text + Processing

0.1974 0.2831

oOCR error modeling gave better improvement

oNext RISOT will introduce image degradationo Module of OCRopuso LAMP, UMD tool

o How to attract more teamso Involvement of OCR consortium

oBetter OCRoBetter error modeling

o Summer code projectso Once in two years

RISOT Future

overview of risot: retrieval of indic script ocr’d text utpal garainindian statistical institute,...

Documents