a document image retrieval system

8
A Document Image Retrieval System Konstantinos Zagoris a , Kavallieratou Ergina b , Nikos Papamarkos a,n a Image Processing and Multimedia Laboratory, Department of Electrical & Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece b Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83100, Greece article info Article history: Received 9 February 2008 Received in revised form 9 March 2010 Accepted 10 March 2010 Available online 13 April 2010 Keywords: Document retrieval Word spotting Segmentation Information retrieval Feature extraction abstract In this paper, a system is presented that locates words in document image archives. This technique performs the word matching directly in the document images bypassing character recognition and using word images as queries. First, it makes use of document image processing techniques, in order to extract powerful features for the description of the word images. The features used for the comparison are capable of capturing the general shape of the query, and escape details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial optical character recognition (OCR) package. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction In the last years, the world has experienced a phenomenal growth of the size of multimedia data and especially document images, which have been increased thanks to the ease to create such images using scanners or digital cameras. Thus, huge quanti- ties of document images are created and stored in image archives without having any indexing information. In order to satisfactorily exploit these collections of document images, it is necessary to develop effective techniques to retrieve the document images. A detailed survey on document image retrieval up to 1997 can be found in Doermann (1998). Historically, the use of index of descriptors for each document provided manually by experts was the first approach to the problem (Salton, 1989). Next, with the improvement in character recognition field, optical character recognition (OCR) packages were applied to documents in order to convert them to text. These techniques transformed the characters, which were contained in the image into a machine-editable text. Thus, Edwards (2004) described an approach to transcribing and retrieving Medieval Latin manu- scripts with generalized Hidden Markov Models. Their hidden states correspond to characters and the space between them. The training instance is used per character and character n-grams are used, yielding a transcription accuracy of 75%. Tan et al. (2002) described an approach to retrieve machine printed docu- ments with a textual query, not necessarily in ASCII notation. He describes both the query and the words occurring in the document images with features, which may then be matched in order to identify query term occurrences. A disadvantage of the above approaches is the considerably low noise tolerance, which yields low retrieval scores (Ishitani, 2001). More recently, with the improvement in document image processing (DIP) field, techniques that make use of images instead of OCR were also introduced. Leydier et al. (2005) used DIP tech- niques to create a pattern dictionary of each document and then they performed word spotting by selecting the feature of the gradient angle and a matching algorithm. Kolcz et al. (2000) described an approach for retrieving handwritten documents using word image templates. Their word image comparison algorithm is based on matching the provided templates to segmented manu- script lines from the Archive of the Indies collection. Konidaris et al. (2007) proposes a technique for keyword guided word spotting in historical printed documents. He creates synthetic image words as query and performs word segmentation using dynamic para- meters and hybrid feature extraction. Finally, he uses user feedback to optimize the retrieval. Matching of entire words in printed documents is also performed by Balasubramanian et al. (2006). In this approach, a dynamic time warping (DTW) based partial matching scheme is used to overcome the morphological differ- ences between the words. Similar technique is used in the case of historical documents (Rath and Manmatha, 2003) where noisy handwritten document images are preprocessed into one-dimen- sional feature sets and compared using the DTW algorithm. Rath et al. (2004) presented a method for retrieving large collections of handwritten historical documents using statistical models. Using a word image matching algorithm, he clustered occurrences of the ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/engappai Engineering Applications of Artificial Intelligence 0952-1976/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2010.03.002 n Corresponding author. Tel.: + 30 25410 79585. E-mail address: [email protected] (N. Papamarkos). Engineering Applications of Artificial Intelligence 23 (2010) 872–879

Upload: konstantinos-zagoris

Post on 26-Jun-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

ARTICLE IN PRESS

Engineering Applications of Artificial Intelligence 23 (2010) 872–879

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

0952-19

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/engappai

A Document Image Retrieval System

Konstantinos Zagoris a, Kavallieratou Ergina b, Nikos Papamarkos a,n

a Image Processing and Multimedia Laboratory, Department of Electrical & Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greeceb Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83100, Greece

a r t i c l e i n f o

Article history:

Received 9 February 2008

Received in revised form

9 March 2010

Accepted 10 March 2010Available online 13 April 2010

Keywords:

Document retrieval

Word spotting

Segmentation

Information retrieval

Feature extraction

76/$ - see front matter & 2010 Elsevier Ltd. A

016/j.engappai.2010.03.002

esponding author. Tel.: +30 25410 79585.

ail address: [email protected] (N. Papama

a b s t r a c t

In this paper, a system is presented that locates words in document image archives. This technique

performs the word matching directly in the document images bypassing character recognition and

using word images as queries. First, it makes use of document image processing techniques, in order to

extract powerful features for the description of the word images. The features used for the comparison

are capable of capturing the general shape of the query, and escape details due to noise or different

fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents

and we compared our results with those of a commercial optical character recognition (OCR) package.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

In the last years, the world has experienced a phenomenalgrowth of the size of multimedia data and especially documentimages, which have been increased thanks to the ease to createsuch images using scanners or digital cameras. Thus, huge quanti-ties of document images are created and stored in image archiveswithout having any indexing information. In order to satisfactorilyexploit these collections of document images, it is necessary todevelop effective techniques to retrieve the document images.

A detailed survey on document image retrieval up to 1997 canbe found in Doermann (1998). Historically, the use of index ofdescriptors for each document provided manually by experts wasthe first approach to the problem (Salton, 1989).

Next, with the improvement in character recognition field,optical character recognition (OCR) packages were applied todocuments in order to convert them to text. These techniquestransformed the characters, which were contained in the imageinto a machine-editable text. Thus, Edwards (2004) described anapproach to transcribing and retrieving Medieval Latin manu-scripts with generalized Hidden Markov Models. Their hiddenstates correspond to characters and the space between them. Thetraining instance is used per character and character n-grams areused, yielding a transcription accuracy of 75%. Tan et al. (2002)described an approach to retrieve machine printed docu-ments with a textual query, not necessarily in ASCII notation.

ll rights reserved.

rkos).

He describes both the query and the words occurring in thedocument images with features, which may then be matched inorder to identify query term occurrences. A disadvantage of theabove approaches is the considerably low noise tolerance, whichyields low retrieval scores (Ishitani, 2001).

More recently, with the improvement in document imageprocessing (DIP) field, techniques that make use of images insteadof OCR were also introduced. Leydier et al. (2005) used DIP tech-niques to create a pattern dictionary of each document and thenthey performed word spotting by selecting the feature of thegradient angle and a matching algorithm. Kolcz et al. (2000)described an approach for retrieving handwritten documents usingword image templates. Their word image comparison algorithm isbased on matching the provided templates to segmented manu-script lines from the Archive of the Indies collection. Konidaris et al.(2007) proposes a technique for keyword guided word spottingin historical printed documents. He creates synthetic image wordsas query and performs word segmentation using dynamic para-meters and hybrid feature extraction. Finally, he uses user feedbackto optimize the retrieval. Matching of entire words in printeddocuments is also performed by Balasubramanian et al. (2006). Inthis approach, a dynamic time warping (DTW) based partialmatching scheme is used to overcome the morphological differ-ences between the words. Similar technique is used in the case ofhistorical documents (Rath and Manmatha, 2003) where noisyhandwritten document images are preprocessed into one-dimen-sional feature sets and compared using the DTW algorithm. Rathet al. (2004) presented a method for retrieving large collections ofhandwritten historical documents using statistical models. Usinga word image matching algorithm, he clustered occurrences of the

ARTICLE IN PRESS

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879 873

same word in a collection of handwritten documents. Whenclusters that contain index terms are labeled, a partial index can bebuilt for the document corpus, which can then be used for ASCIIquerying. The limitation of the above approach is the languagedependent statistical model it requires.

Lu and Tan (2004) presented an approach with the capabilityof searching a word portion in document images. A feature stringis synthesized according to the character sequence in the user-specified word, and each word image extracted from documentsis represented by the corresponding feature string. Then, aninexact string matching technology is utilized to measure thesimilarity between the two feature strings. Unfortunately theabove procedure has the same problems with noisy documentsalthough it bypasses the OCR technique.

In this paper, we propose a Document Image Retrieval System(DIRS) based on word spotting, which has a high noise toleranceand is language independent. The proposed technique encountersthe document retrieval problem using a word matching proce-dure, which performs the word matching directly in thedocument images bypassing OCR and using word images asqueries. The entire system consists of the offline and the onlineprocedures. In the offline procedure, the document images areanalyzed in order to locate the word limits inside them. Then a setof features capable of capturing the word shape and discarddetailed differences due to noise or font differences are calculatedfrom these words and the results are stored in a database. Theuser, in the online procedure, enters a query word and then theproposed system creates an image of it and extracts the sameset of features. Consequently, these features are used in order tofind similar words through a matching procedure. Finally, thedocuments that contain these similar words are presented to theuser. Experimental results show that the proposed documentretrieval system has high mean precision and mean recall rateswhen it is applied on a collection of noisy documents. Also,comparative results with a commercial OCR package gave betterresults while experiments for different sizes and styles of fontsdid not produce significant change to the performance.

The rest of this paper is organized as follows. Section 2describes the overall structure framework of the system whileSection 3 presents and analyzes the features that are used.

Fig. 1. The overall structure of the Do

The matching procedure of our system is described in detail inSection 4. Sections 5 and 6 describe the implementation of theproposed system and present some test results, respectively.Finally, in Section 7 we draw some conclusions and the futuredirections of our research.

2. The Document Image Retrieval System (DIRS)

The overall structure of the proposed system is presented inFig. 1. It consisted of two different parts: the offline and the onlineprocedure. An analytical description of the different stages of bothprocedures is as follows.

2.1. Preprocessing stage

In the offline operation, the document images are analyzed inorder to localize the word limits and the results are stored in adatabase. This procedure consists of three main stages. Initially,the document images pass the preprocessing stage, whichconsists of a median 5�5 filter (Fig. 2(b)), in order to face theexistence of noise, e.g. in case of historical or badly maintaineddocuments, and a binarization method (Fig. 2(c)). The medianfiltering is a nonlinear signal processing technique developed byTukey (1974) that is useful for noise suppression in images (Pratt,2007). The binarization is achieved by using the well-known Otsutechnique (Otsu, 1979). This technique performs binarizationthrough the histogram of the image by minimizing the inter-classvariance between background and foreground pixels.

2.2. Word segmentation

The word segmentation stage follows the preprocessing stage.Its primary goal is to detect the word limits and filter the noiseand punctuation marks. This is accomplished by using theconnected components labeling and filtering method.

Firstly, all the connected components (CCs) (Fig. 3(a)) areidentified. Since the noise of some documents can changesignificantly the shape of the extracted word images, the noiseand the punctuation points must be rejected. To achieve this, the

cument Image Retrieval System.

ARTICLE IN PRESS

Fig. 2. The preprocessing stage: (a) the original noisy document; (b) after applying the median 5�5 filter; and (c) after applying the Otsu technique.

Fig. 3. The connected components labeling and filtering method: (a) the CCs of the

document; (b) the remaining CCs of the document; (c) the expanded CCs; and

(d) the final extracted words after the merging of the expanded CCs.

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879874

most common height of the CCs (CCch) is calculated and thosewith height less than 70% of the CCch are rejected. In Kavallieratouet al. (2002), it has been proven that the height of a word canreach the double of a character mean size due to the presence ofascenders and descenders. This means that the applied CCsfiltering can only reject areas of punctuation points, noise, etc.(Fig. 3(b)).

Due to the facts that Kavallieratou et al. (2002) present aboutthe mean character size and having filtering out accents, noiseand punctuation marks, it is rather rare of the same word to havedistance greater than 20% of the CCch and different words to becloser than that. In order to merge the CCs and create the wordblocks, the left and right sides of the CCs are expanded (Fig. 3(b))by 20% of the CCch as Fig. 3(c) depicts. Finally, to locate the words,the overlapping CCs are merged (Fig. 3(d)).

3. Features

The proposed system is based on six features that are extractedfrom every word capable of capturing the word similarities anddiscarding the small differences due to remaining noise ordifferent style of fonts. They are carefully selected in order todescribe the contour and region shape of the word.

The six-feature-set is

(1)

Width to height ratio: The width to height ratio of the wordforms important information concerning the word shape.

(2)

Word area density: This feature represents the percentage ofthe black pixels included in the word bounding box. It is

ARTICLE IN PRESS

Fig.imag

norm

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879 875

calculated by using the following relation:

E¼ 100ðBPÞ

ðIWÞðIHÞð1Þ

where (BP) is the number of black pixels in the word boundingbox, (IW) is the width and (IH) is the height of the wordbounding box in pixels.

(3)

Center of gravity: It represents the Euclidean distance fromthe word’s center of gravity to the upper left corner of thebounding box. In order to calculate this, the vertical andhorizontal center of gravity must be determined by thefollowing equations:

Cx ¼Mð1,0Þ

Mð0,0Þð2Þ

Cy ¼Mð0,1Þ

Mð0,0Þð3Þ

where Cx is the horizontal center and Cy the vertical center ofgravity and M(p,q) the geometrical moments of rank p+q:

Mpq ¼X

x

Xy

x

width

� �p y

height

� �q

f ðx,yÞ ð4Þ

The x and y determine the image pixels. The image is binary,so the f(x, y) is considered to be 1 when the pixel (x, y) is blackand 0 when the pixel is white. The division of x and y by thewidth and the height of the image, respectively, causes thegeometrical moments to be normalized and be invariant ofthe size of the word. Finally, the center of the gravity feature isdefined as equal to the Euclidean distance from the upper leftcorner of the image:

COG¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiC2

x þC2y

qð5Þ

Vertical projection: This feature consists of a vector with

(4) twenty (20) elements, extracted from the smoothed andnormalized vertical projection of the word image (Fig. 4).These elements correspond to the first twenty (20)coefficients of the discrete cosine transform (DCT) of thesmoothed and normalized vertical projection. The smooth andnormalized vertical projection has common width and heightproperties for all the word images and its calculation consistsof the following steps:Step 1: The vertical projection VPorig[i] (Fig. 4(b)) is defined asthe number of black pixels that they are residing in eachcolumn i.Step 2: From the Eq. (6) a new projection VP[i] is produced,which has width l and maximum height h. In our proposedmethod, l and h are equal to the mean width and height,

4. A visual representation of the vertical projection calculation: (a) original

e; (b) the vertical projection of the original image; and (c) the smoothed and

alized vertical projection.

Fig.(a) o

proje

smo

respectively, of all the word bounding boxes. For ourexperiment database l¼167 and h¼50:

VP i½ � ¼h

hmaxVPorig i

lorig

l

� �ð6Þ

lorig is the original width of the original projection VPorig[i] andhmax the height of the word bounding box.Step 3: The final normalized vertical projection depicted inFig. 4(c) is created after applying a 5� 1 mean mask to theprojection VP[i]. This way, the final projection is more robustto the changes of the size and type of fonts.

(5)

Top–bottom shape projections: As shown in Fig. 5, the top–bottom shape projections can be considered as signatures ofthe word shape. These signatures lead to a feature vector of 50elements, where the first 25 values are the first 25 coefficientsof the smoothed and normalized top shape projection DCT(Fig. 5(d)) and the rest 25 values are equal to the first 25coefficients of the smoothed and normalized bottom shapeprojection DCT (Fig. 5(e)).In order to calculate the top shape projection, the word imageis scanned from top to bottom. As shown in Fig. 5(b), the firsttime a black pixel is found all the following pixels of the samecolumn are converted to black.The bottom shape projection is found similarly. As shown inFig. 5(c), the word image is scanned from bottom to top andall the pixels are converted to black until a black pixel isfound. The smoothed and normalized shape projections(Fig. 5(d) and (e)) are calculated as described at the verticalprojection paragraph.

(6)

Upper grid features: The upper grid features (UGF) is a tenelement vector with binary values extracted from the upperpart of each word image. In order to calculate this, initially theimage’s horizontal projection is extracted, and from it, theupper part of the word is determined by the followingalgorithm:Step 1: Smooth the horizontal projection with a mean 5�1mask.Step 2: Starting from the top, find the position i in thehorizontal projection histogram V(i) where VðiÞZ3

2H asdepicted in Fig. 6(a). The H is the maximum height of thehorizontal projection (max{V(i)}). If the position i is below thehalf of the world then the word has no upper part.

5. A visual representation of the top–bottom shape projection calculations:

riginal image; (b) original top shape projection; (c) original bottom shape

ction; and (d) the smoothed and normalized top shape projection and (e) the

othed and normalized bottom shape projection.

ARTICLE IN PRESS

[ 0 , 0 , 0 , 1195 , 0 , 0 , 0 , 0 , 0 , 0 ]

[ 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ]

[ 0, 0 , 0 , 0 , 0 , 0 , 0 , 598 , 50 , 33 ]

[ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 ]

Fig. 6. A visual representation of the upper grid and down grid information

feature extraction procedures: (a) the original word-box image and its horizontal

projection; (b) the extracted upper part and its separation in 10 parts; (c) the

number of black pixels in each part; (d) the final vector of UGF; (e) the extracted

lower part and its separation in 10 parts; (f) the number of black pixels in each

part; and (g) the final vector of DGF.

Fig. 7. The matching process.

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879876

Step 3: Find the position kA(0, i) in the V(i) horizontalprojection histogram when V(k)�V(k�1)r0. Then k definesthe upper part of the word. If k has a very small value (3 or 2)the word has no upper part.Then the upper part of the word is separated in ten similarparts as depicted in Fig. 6(b). The number of black pixels iscounted for each part. If this number is bigger than the heightof the extracted image, the relative value of the vector is set to1; otherwise it is set to 0. Fig. 6(b) and (c) illustrates anexample. The height of the extracted image is 43. Theobtained feature vector is shown in Fig. 6(d).

(7)

Down grid features: As the name suggests, they are similar tothe UGF but they are extracted from the lower part of theword image. The down grid features (DGF) are calculated byusing the method of the UGF extraction but this time thesearch is starting from the bottom of the V(i) horizontalprojection histogram. The output is again a ten element vectorwith binary values. Fig. 6(e) and (f) gives an example of theDGF extraction. This time the height of the extracted image, inour experimental set, was 50 pixels. Fig. 6(g) shows the finalfeature vector.

Fig. 8. The structure of the descriptor.

4. Comparison

The matching procedure can identify the word images of thedocuments that are more similar to the query word through theextracted feature vectors. Fig. 7 illustrates the comparisontechnique.

First, a descriptor is created by the seven extracted features asit is shown in Fig. 8. The first element is the weight to height

feature, the second the image area density feature and the thirdthe center of gravity feature. The following twenty (20) elementsare the ones extracted from the vertical projection feature and thenext fifty (50) from the top–bottom shape projection features.Finally, the last twenty (20) elements are the ones extracted fromthe upper and down grid features divided by 10 in order to

ARTICLE IN PRESS

Table 1The 30 words that are used in the evaluation.

1. details 2. potential 3. religion 4. technology 5. advent

6. smoothing 7. culture 8. world 9. between 10. further

11. number 12. Greek 13. might 14. century 15. homage

16. period 17. taxes 18. living 19. growth 20. churches

21. neural 22. foreign 23. smaller 24. extensively 25. eventually

26. diplomatic 27. demands 28. political 29. region 30. break

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879 877

prevent to overpower the others features. The rest of the featuresvalues are normalized from 0 to 1.

Next, the Minkowski L1 distance is obtained between thedescriptor of the query image word and the descriptor of eachword in the database:

MDðiÞ ¼X93

k ¼ 1

Q ðkÞ�Wðk,iÞ ð7Þ

where MD(i) is the Minkowski distance of the i word, Q(k) is thequery descriptor and W(k, i) is the descriptor of the i word.

Then the similarity rate of the remaining words is computed.The rate is a normalized value between 0 and 100, which depictshow similar are the words of the database with the query word.The similarity rate for each word is defined as

Ri ¼ 100 1�MDðiÞ

maxðMDÞ

� �ð8Þ

where Ri is the rate value of the word i, MD(i) is the Minkowskidistance of the i word and max(MD) the maximum Minkowskidistance found in the document database.

Finally, the system presents the documents that contain thewords in descending order with respect to the corresponding rate.In our implementation, the documents presented to the user arethose that have a similarity rate above 70.

The query word image is an artificial image that depicts theuser query word and it is created by the proposed system withfont height equal to the average height of all the word-boxesobtained through the word segmentation stage of the offlineoperation. In the implemented DIRS for our experimental set theaverage height is 50. The font type of the query image is Arial.However, the smoothing and normalizing of the various featuresdescribed before suppress small differences between varioustypes of fonts.

Finally, the created query image is being processed inexactly the same way as the document word images. Thatincludes the application of preprocessing and feature extractionprocedures.

Fig. 9. A screenshot of t

5. Implementation

The proposed system is implemented with the help of theVisual Studio 2008 and is based on the Microsoft.NET Framework3.5. The programming language used is C#.

The image documents that are included in the database arecreated artificially from various texts and then noise was added inorder to implement in parallel a text search engine, which makeseasier to verify and evaluate the search results of the DIRS system.Furthermore, the database being used by the implemented DIRS isthe Microsoft SQL Server 2005.

The web address of the implemented system is the http://orpheus.ee.duth.gr/irs2_5. Fig. 9 shows a screenshot of the aboveprogram.

6. Evaluation

The precision and the recall metrics have been used to evaluatethe performance of the proposed system. Recall is the ratio of thenumber of relevant records retrieved to the total number ofrelevant records in the database. Precision is the ratio of thenumber of relevant records retrieved to the total number ofirrelevant and relevant records retrieved. In our evaluation, theprecision and recall values are expressed in percentage.

The evaluation of the proposed system was based on 100document images. In order to calculate the precision and recallvalues 30 searches were made using random words. Table 1shows those random words. The precision and recall valuesobtained are depicted in Figs. 10 and 11, respectively. The mean

he implement DIRS.

ARTICLE IN PRESS

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879878

precision and the mean recall values for this experiment are 87.8%and 99.26%, respectively.

The overall time for the above-mentioned 30 searches in ourAMD Athlon 64 4200+ testing server with 2 GB of RAM is 11.53 swhile the mean time for each search is approximately 0.38 s.

Fig. 10. The variation of the precision coefficients of the proposed DIRS for 30

searches. The mean precision is 87.8%.

Fig. 11. The variation of the recall coefficients of the proposed DIRS for 30

searches. The mean recall is 99.26%.

Fig. 12. The variation of the precision and recall coefficients of the FineReaders

9.0 OCR program for 30 searches. The mean precision is 76.67% and the mean

recall is 58.42%.

Fig. 14. The variation of the precision and recall coefficients for 30 searches with

the query font name ‘‘Tahoma’’. The mean precision is 89.44% and the mean recall

is 88.05%.

Fig. 13. The variation of the precision and recall coefficients for 30 searches with

the font height of the query image double of the proposed DIRS. The mean

precision is 89.315% and the mean recall is 97.593%.

Furthermore, the same 100 document images were scannedfrom the FineReaders 9.0 (ABBYY FineReaders, 2007) OCRprogram, which translated all the characters of the documentimages into machine-editable text documents. Then, these textdocuments searched for the same 30 words of Table 1. The pre-cision and recall values obtained are depicted in Fig. 12. The meanprecision and the mean recall values are 76.667% and 58.421%,respectively, lower enough than the proposed system.

Then an experiment was made to test how the features of theproposed system react to the scaling of the world. The font heightof the query word changed to double (100 pixels) instead of equalto the average height (50 pixels) of all the word-boxes obtainedthrough the word segmentation stage of the offline operation. Theprecision and recall values obtained are depicted in Fig. 13. Themean precision and the mean recall values are 87.3% and 97.59%,respectively, and are nearly the same as those of Figs. 10 and 11.

Furthermore, in order to test the robustness of the featuresrelative to the type of fonts, the query font was changed to‘‘Tahoma’’ and the same 30 searches were made (Table 1). Thesetwo fonts have many differences without compromising theshape of the word. The precision and recall values obtained aredepicted in Fig. 14. The mean precision and the mean recall valuesare 89.44% and 88.05%, respectively.

7. Conclusion

A document retrieval system was presented here. Theproposed system makes use of document image processingtechniques, in order to extract powerful features for the descrip-tion of the word images. Seven meaningful features were used,

ARTICLE IN PRESS

K. Zagoris et al. / Engineering Applications of Artificial Intelligence 23 (2010) 872–879 879

namely the weight to height ratio, the image area density, thecenter of gravity feature, 20 DCT coefficients extracted from thevertical projection, 50 DCT coefficients extracted from the top–bottom shape projection and 20 elements extracted from theupper and down grid. These features were selected in such waythat they describe satisfactorily the shape of the query wordswhile at the same time they suppress small differences due tonoise, size and type of fonts.

Our experiments were performed on a collection of noisydocuments and gave mean precision 87.8% and mean recall99.26%. The same experiment performed in the same database fora commercial OCR package gave lower results while experimentsfor different sizes and styles of fonts did not produce significantchange to the performance.

Acknowledgments

This paper is part of the 03ED679 research project, imple-mented within the framework of the Reinforcement Programmeof Human Research Manpower (PENED) and co-financed byNational and Community Funds (25%) from the Greek Ministryof Development – General Secretariat of Research and Technologyand (75%) from the E.U. – European Social Fund.

References

ABBYY FineReaders. http://finereader.abbyy.com/, 2007.Balasubramanian, A., Meshesha, M., Jawahar, C.V. Retrieval form document image

collections. In: Proceedings of the DAS2006, 2006, pp. 1–12.

Doermann, D., 1998. The indexing and retrieval of document images: a survey.Computer Vision and Image Understanding 70 (3), 287–298.

Edwards, J., Teh,Y.W., Forsyth, D., Bock, R., Maire, M., Vesom, G. Making Latinmanuscripts searchable using gHMM’s. In: Proceedings of the 18th AnnualConference on Neural Information Processing Systems, Cambridge, USA, 2004,pp. 385–392.

Ishitani, Y.Model-based information extraction method tolerant of OCR errors fordocument images. In: Proceedings of the Sixth International Conference onDocument Analysis and Recognition, Seattle, USA, 2001, pp. 908–915.

Kavallieratou, E, Fakotakis, N, Kokkinakis, G, 2002. Un off-line unconstrainedhandwriting recognition system. International Journal of Document Analysisand Recognition 4, 226–242.

Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., Popescu, G.V, 2000. A line-oriented approach to word spotting in handwritten documents. PatternAnalysis & Applications 3 (2), 153–168.

Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.,2007. Keyword-guided word spotting in historical printed documents usingsynthetic data and user feedback. International Journal on Document Analysisand Recognition 9 (2), 167–177.

Leydier, Y., LeBourgeois, F., Emptoz, H., Textual indexation of ancient documents.DocEng’05, November 2–4, 2005, pp. 111–117.

Lu, Y., Tan, C.L., 2004. Information retrieval in document image databases. IEEETransactions on Knowledge and Data Engineering 16 (11), 1398–1410.

Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEETransactions on Systems, Man, and Cybernetics, 62–66.

Pratt, W.K., 2007. Digital image processing fourth ed. John Wiley & Sons Inc., NewYork, NY.

Rath, T.M., Manmatha, R. Word image matching using dynamic time warping. In:Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, vol. 2, 2003, pp. 521–527.

Rath, T.M., Manmatha, R., Lavrenko, V. A search engine for historical manuscriptimages. In: Proceedings of the ACMSIGIR Conference, 2004, pp. 369–376.

Salton, G., 1989. Automatic Text Processing: The Transformation, Analysis andRetrieval of Information by Computer. Addison-Wesley, Reading, MA.

Tan, C.L., Huang, W., Yu, Z., Xu, Y., 2002. Imaged document text retrieval withoutOCR. IEEE Transactions on Pattern Analysis and Machine Intelligence 24,838–844.

Tukey, J.W., 1974. Exploratory Data Analysis. Addison-Wesley, Reading, MA.