recognition and retrieval from document image collections

Recognition and Retrieval from Document Image Collections

Million Meshesha(Roll No.: 200299004)

Centre for Visual Information Technology,

International Institute of Information Technology,

Hyderabad, India

Advisor: Dr. C. V. Jawahar

Introduction

• Emergence of large Digital Libraries like UDL, DLI, etc.– One million book archival

activities at Mega Scanning center – IIIT-H

• Involvement of Google, Yahoo, Microsoft in massive digitization project

The aim of digitization is for easier preservation and make documents freely accessible to the globe.

• Global effort to digitize and archive large collection of multimedia data – Most of them are printed books

Needs to design efficient means of access to the Needs to design efficient means of access to the content.content.

The Direct Approach• Recognition-based access to documents

– Easy to integrate into a standard IR framework– Success of text image retrieval mainly depends on the

performance of OCRs

Document Images

Preprocessing and

Segmentation

Feature Extraction

Search engine

Classification

Database

Optical Character Recognition

TextualQuery

Cross lingual

Retrieval Text Documents

Post-processing

Text Documents

Challenges• The state-of-the-art OCR engines recognize documents printed in Latin and some Oriental scripts – with few errors in each page for high quality images

• Unavailability of robust OCRs for indigenous scripts of African and Indian languages.

• Challenges in developing OCRs for scripts with complex shape and large number of characters.

• Lack of specialized recognizers for large document image collections.

• Diversity and quantity of documents archived in digital libraries.

Alternate Approach: Recognition-Free

Feature Extraction

Clustering and

Indexing

Document Images

Preprocessing and

Segmentation

RenderingTextualQuery Retrieval

Cross Lingual

Document Images

Database

Search engine

Comparison of the Two ApproachesRecognition-based Recognition-freeNeeds recognition before Retrieve without explicit

retrieval recognition e.g. Text search engines e.g. CBIR, CBVR

Less offline processing High offline processing (excluding recognition)

Fast and efficient algorithmsSlow & inefficient schemes

Compact representation Bulky representation

Content/language More of content/languagedependent independent

Challenging to build Relatively easy to build with(because of recognizers) certain level of acceptable

performance

Review of OCR Systems• Conventional OCRs follow sequential steps:

Thresholding

Normalization

Skew Detection/ Correction

Noise Removal Algorithms

Text/Image Block

identification

Geometric Layout Analysis

Line Segmentation

Word Segmentation

Component Analysis

Structural Features like Shape,

contour etc.

Transformation Domain

Features like DFT, DCT

Global and Local Features

Bayesian statistical classifier

SVM classifier

Neural Network

Lexical Information

Dictionary and Punctuation

Rules

Statistical Information

“Anatomy of a Versatile Page Reader“, H.Baird, Proc. of IEEE, Vol. 80, no.7, July,1992.

“Omnidocument Technologies”, IM. Bokser, Proc. of IEEE, Vol 80, no.7, July,1992

Preprocessing

Document Layout Analysis

Segmentation

Feature Extraction

Classification

Post Processing

Review of Recognition-Free• Manmatha et al:

– Proposed the word spotting idea for matching word images from handwritten historical manuscripts. – Used dynamic time warping (DTW) for word image matching.– Selected profile features for matching handwritten word images.

• Chaudhury et al.: – Exploited the structural characteristics of the Indian scripts to access them at word level. – Employed geometric features, and suffix trees for indexing.

• Trenkle and Vogt: – Experimented on word level image matching. – Extracted features at the baseline, concavities, line segments, junctions, dots and stroke directions and

computed a distance metric.

• Srihari et al.:– Spotting words from document images of Devanagari, Arabic and Latin.– Used Gradient, Structural and Concavity (GSC) features.– Implement correlation similarity measure for word spotting.

• AK Jain and Anoop M. Namboodiri: – Employed DTW based word-spotting for Indexing and retrieval of on-line documents.– Extract features such as the height of the sample point, direction and curvature of strokes.

T. Rath and R. Manmatha, "Word Image Matching Using Dynamic Time Warping", Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2, pp. 521--527, 2003.

Santanu Chaudhury, Geetika Sethi, Anand Vyas and Gaurav Harit, "Devising Interactive Access Techniques for Indian Language Document Images", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, Pp. 885-889

J. M. Trenkle and R. C. Vogt, "Word Recognition for Information Retrieval in the Image Domain", Symposium on Document Analysis and Information Retrieval, pp. 105-122, 1993.

S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts," Vivek: Indian Journal of Artificial Intelligence

A.K. Jain and Anoop M. Namboodiri, "Indexing and Retrieval of On-line Handwritten Documents", Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 655-659

Major Contributions1.Study indigenous African scripts for document understanding

• First attempt to introduce the challenges toward the recognition and retrieval of indigenous African scripts.

2.Design an OCR for recognizing Amharic printed documents • test on real-life document images (books, magazines and newspapers).

3.Propose an architecture of self adaptable book recognizer • demonstrate its application on document images of book.

4.Propose an efficient matching and feature extraction schemes• Performance analysis on datasets of word-form variants, degradations

and printing variations in word images.

5.Construct an indexing scheme by applying IR principles for efficient searching in document images.– experiment its efficiency on document images of book and newspapers.

African Scripts• Africa is the 2nd largest continent in the world, next to Asia.• There are around 2500 languages spoken in Africa, which

are either: – Installed by conquerors of the past and use a modification

of the Latin and Arabic scripts. – Indigenous languages with their own scripts.

E.g. Amharic (Ethiopia), Vai (West Africa), Bassa (Liberia), Mende (Sierra Leone), etc.

• Document image analysis and understanding research is very limited for indigenous African scripts. – Few attempts are available for Amharic scripts. – Other indigenous scripts are not yet studied

Characters are complex in shape

Their existence is not known by most researchers

Most are not used asofficial languages

Mende script Vai scriptBassa script

Amharic Language/Script• Large number of characters

– More than 300 characters

• Vowel formation

• Existence of visually similar characters

• Frequently occurring characters

• Amharic word morphology– Have rich word morphology

• Amharic (like Hindi) is verb-final language, modifiers usually precede the nouns they modify.

– the word order in English sentences: Subject-Verb-Object

– the word order in Amharic and Hindi is Subject-Object-Verb

Recognition from “A” Document Image

• Preprocessing– Binarization:

• Convert gray pixels into binary.

– Skew detection and correction:• Ensure that the page is aligned properly

– Noise removal• Remove artifacts in the image

• Segmentation

– Line segmentation • Identify lines in a text.

– Word segmentation• Identify words in a text line.

– Character segmentation• Detect each character from

segmented word.

• Feature extraction– Consider the entire component

image as a feature.

– PCA• Used for dimensionality

reduction.• Reduces to character/

connected components sub-space.

– LDA• Extracts optimal discriminant

vector and reduces to classification sub-space

• Classification– DDAG based architecture for

multi-class SVMs.

– Support Vector Machines (SVMs) at each node.

1,4

1,3

1,2 2,3

2,4

3,4

1 2 3 4

Amharic OCR is developed on top of an OCR for Indian Languages.

C. V. Jawahar, MNSSK Pavan Kumar, SS Ravi Kiran: A Bilingual OCR for Hindi-Telugu Documents and its Applications. ICDAR 2003: 408-412

andConsider characters

D. H. Foley and J. W. Sammon. An optimal set of discriminant vectors. IEEETrans. on Computing, 24:271-278, 1975.

Experimental ResultsDocument Accuracy (%)

LaserJet Printouts

Fonts (PowerGeez, Visual Geez, Agafari,

Alpas)

96.51

Sizes(10, 12, 14, 16)

98.49

Styles(Normal, Bold, Italic)

95.65

Real-life

Books 91.45

Newspapers 88.23

Magazines 90.37

BlobCutMerge

Comments• Present day OCRs do not improve

the performance over time. – Performance on the first and last

pages of the book are statistically identical.

• OCRs are designed to convert a single document image into a textual representation.

• Omni-font OCRs are rare even for English. – Performance degrades with quality,

unseen fonts, etc.

OCR for a collection (e.g. book) has to be different from OCR designed for an isolated single page.

Can we design a recognizer for document image collections; say, Book recognizer ?

Our Strategy• Enable OCR learn from its experience through feedback at

normal operation that comes from postprocessor.– The conventional open-loop system of classifier followed by post-

processor is closed.

• Learns from both correctly classified and misclassified examples.

• Extends knowledge gained from one page to other pages– Iterates and perfects on a page (a set of pages).

• Improves its performance over time to varying document image collections in fonts, sizes and styles, Quality

Apply machine learning procedures to build an intelligent OCR

ComparisonConventional OCRs• Designed for a single page• No feedback; top-down serial process• Failures are costly: any error at intermediate level results in

wrong output of system• Offline training• Performance declines or static

Our new approach, Book recognizer

• Designed for multiple pages• Feedback based flexible design• Any error at an intermediate level can be corrected by using

proper feedback.• More of online learning• Performance improves overtime

Self adaptable OCR Design

Recognizer

ModelBase

Document Images

RecognizedTexts

Post Processor

Validator

RefinedSamples

SampleDatabase

Sampler

Classifier

FilteredsamplesSamples

SelectedSamples Rejected

Samples

…

• Pass new samples for training

LabelerLabeled samples

•Label unlabelled data

Model

• Produces error-corrected words.

• Such words are candidate for feedback

• Detection of outliers• Validation in image

space

• Incremental learning

lnformation

lntormatlon

lnformation

lnformation

told iold

idol

i

dol

•Add samples to their proper class

Learning online

• Experiment on poor quality book

• Initial accuracy was less than 70% – a very low accuracy was obtained

• Within few iterations of learning, the recognition accuracy improved near to 96%.

2nd iteration accuracy = 88.24%More iteration accuracy = 91.08%More iteration accuracy = 94.82%Initial accuracy = 65.24% Final accuracy = 95.26%

Results on font and style variations

Further Issue• OCR is a long-term solution.

– Needs some time to come up with a workable system.

• But our problem is immediate. – A number of documents are already archived and

ready for use.

Can we access the content of document images without explicit recognition?

Word Spotting

Collection Query Matching ScoreProfessor University 10.38Alexander University 14.44Smith University 12.21until University 9.32recently University 16.43head University 17.34chemistry University 14.56Columbia University 15.10University University 0.51American University 18.71Chemical University 14.32Society University 12.13died University 19.11native University 18.10

Word Search by Word Spotting

Matching

Query

Render

Christian

FeatureExtraction

Efficient Matching Scheme• Matching techniques:

–Cross Correlation –Dynamic Time Warping (DTW)

• Aligns and finds the best match between pairs of word images with different size.

• Trace back to identify the optimal warping path (OWP)

Performance analysis shows that DTW outperforms Cross correlation

Recall Precision F-score

DTW 89.58% 90.81% 90.19%

cross correlation 76.43% 78.83% 77.61%

Challenges in Word Image Search• Degradation of documents

– Cuts, blobs, salt and pepper, erosion of border pixels, etc.

• Print variations– A word image may vary in size, style, font and quality.

• Morphological variation – A word may have different variants.

“Stemming” of Word Images

• Two possible variants of a word:(i) formed by adding prefix and/or suffix to the root word),

e.g. 'connect‘ ‘connects', ‘connecting', 'reconnect‘…

(ii) synonymous words. E.g. ‘connect‘ ‘join', ‘attach‘ …

• It is observed that most of the word form variations takes place either at the beginning or at the end.

• Needs matching algorithm which can “penalize” mismatches in the beginning or at the end.

Propose a novel DTW-based partial matching scheme

DTW-based Morphological MatcherPartition OWP (with length L) into beginning, middle

and end regions of length k (L/3) eachfor i = 1 to k do

if there is matching cost concentration at the beginning reduce extra cost from the total matching scoreelse break.

end for for i = L down to 2k do

if there is matching cost concentration in the end reduce extra cost from the total matching score else break

end for Normalize the matching score by the length of the

optimal warping path.

Performance of partial matching

Item

Before After

Recall Precision F-score Recall Precision F-score

Font 83.35 91.83 86.95 95.90 98.20 97.03

Size 87.38 91.39 89.30 96.80 99.42 98.09

Style 75.62 80.25 77.84 88.94 94.73 91.69

Degradation

85.82 88.49 87.04 91.74 96.26 93.92

Degraded Words

Salt and Pepper Blobs CutsComplex script Historic documents

Degradation Modeling

• Cuts and breaks • Blobs• Salt and pepper• Erosion of boundary

pixels

We built datasets using our degradation models for English, Hindi and Amharic.

Invariant Feature Selection• Investigate various features:

– Profiles (upper, lower, projection, transition)– Statistical moments (mean, standard deviation, skew) – Region-based moments (zero-order moment, first-order moment,

central moment)– Transform Fourier representations

• Global vs. Local Features– Global features: compute a single value.– Local features: compute 1D representation following vertical strips of a

word.– Local features perform better than global features

• For better performance combine local features of profiles, moments and transform domain representations

Recall Precision F-score

Global features 53.32% 50.24% 51.73%

Local features 82.92% 80.53% 81.71%

Invariant Feature Selection

• To test the performance of combined features the DTW matching algorithm is modified

• Combined local features of profiles and moments are invariant to degradations and printing variations.

Test result on degraded word images

Degradation

Hindi Amharic English

Recall Precision

F-score

Recall Precision

F-score Recall Precision

F-score

Cuts 92.34 92.41 92.37 93.72 94.93 94.32 93.76 88.15 90.87

Salts & pepper

93.28 93.17 93.20 96.88 97.11 96.99 96.56 96.02 96.29

Blobs 85.95 92.33 89.03 89.46 93.48 91.43 89.79 86.43 88.08

Erosion 92.77 92.58 92.67 94.91 95.72 95.31 92.38 93.29 92.83

Information Retrieval from “Document Images”

• Users expect more than just searching for documents that contain their query word.– Expectation for the popularity of text search.

• Retrieve relevant documents in ranked order.

• Remove effects of stopwords in the retrieval process.

• Fast search and efficient delivery of documents.

• How can we meet users requirements?

Construct an indexing scheme to organize word images following IR principles.

Mapping IR techniques for Document Image Retrieval

Modules Purpose

Algorithm(s) Used

Text search engine Current work

Stemming

words

Group word variant

Language modeling

e.g. Porter algorithm

Morphological

matching using DTW

Stopword

detection

Remove common words

Stop word list Inverse document

Frequency (IDF)

Relevance

measurement

Rank documents

Term frequency

(TF)

Modified TF/IDF

Clustering Group index

terms

--- Improved hierarchical

clustering

Indexing data structure

Organize index lists

Inverted index and signature file

Inverted index

Indexing Document Images

StemmingStopword

Detection Relevance Measure

Inverted Indexing

Index terms

IR Measures and Clustering

Word Images

Template (Keywords)

Index list

Clustered English Words

Clustered words vary in:

Fonts

Sizes

Styles

Forms

Quality

Clustered Amharic Words

Test results on datasets of the various fonts, sizes and styles

TypeHindi Amharic

Recall Precision F-score Recall Precision F-score

Fonts 91.28 92.85 91.88 93.05 93.87 93.45

Styles 83.29 84.01 83.64 89.09 89.80 89.44

Sizes 94.59 96.94 95.74 95.99 96.34 96.26

Normal

Bold

Italic

10121416

PowerGeezVisualGeez

AgafariAlpas

Performance: Precision vs. Recall graph

• The graph shows effectiveness of our scheme

• it increases both precision and recall by moving the entire curve up and out to the right.

Concluding Remarks• African scripts

– Introduce for the first time indigenous African scripts– Initial attempt to recognize Amharic documents with good results to extend it to

other indigenous African scripts.– Needs engineering effort to make it applicable for real-life situations

• Recognizer design– New attempt to propose self-adaptable recognizer for document image collections

with the help of machine learning algorithms– Encouraging results for developing recognizer for large document image

collections– Further work is needed for extending the framework to many of the complex Indian

and African scripts

• Document image indexing and Retrieval– Propose DTW-based partial matching scheme to perform morphological matching– Design invariant feature extraction scheme to degradation and printing variations– Apply IR principles, and construct clustering and indexing scheme.– Needs solving system related issues for practical online retrieval from large corpus

Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007.

Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007.

Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007.

Million Meshesha and C. V. Jawahar, “Matching Word Images for Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press).

Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted).

Scope for Future Work

• Develop an online system for searching hundreds of books over the Web

• Recognition and retrieval of complex documents (such as camera-based, handwritten, etc.).

• Apply advanced image preprocessing techniques to enhance image quality for large collection of document images.

• Retrieval of documents in presence of OCR errors and scope for hybrid approaches.

Publications: Conference Papers• Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for

Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition and Machine Intelligence (LNCS), 2007.

• A. Balasubramanian, Million Meshesha, C. V. Jawahar, “Retrieval from Document Image Collections", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 1-12.

• Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indiraneel Deb Sikdar, A. Balasubramanian and C. V. Jawahar, “Semi-automatic Adaptive OCR for Digital Libraries", In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872), 2006, pp 13-24.

• K. Pramod Sankar, Million Meshesha, C. V. Jawahar, “Annotation of Images and Videos based on Textual Content without OCR", In Workshop on Computation Intensive Methods for Computer Vision, Part of 9th European Conference on Computer Vision (ECCV), Austria, 2006.

• Million Meshesha and C. V. Jawahar, “Recognition of Printed Amharic Documents", In Proceedings of 8th International Conference of Document Analysis and Recognition (ICDAR), Seoul, Korea, Sep 2005, Volume 1, pp 784-788

• C. V. Jawahar, Million Meshesha, A. Balasubramanian, “Searching in Document Images", In Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), 2004, pp. 622-627.

Publications: Journal Articles• Million Meshesha and C. V. Jawahar, “Matching Word Images for

Content-based Retrieval from Printed Document Images", International Journal of Document Analysis and Recognition (IJDAR) (in press).

• C. V. Jawahar, A. Balasubrahmanian, Million Meshesha and Anoop Namboodiri, “Retrieval of Online Handwriting by Synthesis and Matching", Pattern Recognition (in press).

• Million Meshesha and C. V. Jawahar, “Optical Character Recognition of Amharic Documents”, African Journal of Information and Communication Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007.

• Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African Languages", African Journal of Indigenous Knowledge Systems, Vol. 6, No 2, pp. 132 - 142, 2007.

• Million Meshesha and C. V. Jawahar, Indexing Word Images for Recognition-free Retrieval from Printed Document Databases, Information Sciences: An International Journal (revised & submitted).

Thank you

recognition and retrieval from document image collections

Documents

recognition fast

freeneeds recognition

explicit retrieval recognition

word image match

developing ocrs

oriental scripts

large number of characters

quantity of documents