extracting text from www images

8/14/2019 Extracting Text From WWW Images

1/5

Extracting Text from WWW Images

Jiangying Zhou Daniel Lopresti

Panasonic Information a nd Networking Technology LaboratoryTwo Research Way, Princeton, N J 08540, USA

AbstractIn th i s paper, we ezamine the p rob lem of l oca t ing

a n d e x t ra c ti n g t e z t f r o m i m a g e s o n t h e Wo r l d W i d eWeb . W e descr ibe a t e z t de t ec t i on a lgor i t hm whichis based o n color c luster ing and connected com ponen tanalysis . Th e a lgori thm f i rst quant izes the color spaceof t h e i n p u t i m a g e i n t o a n u m b e r of color classes usinga parameter-free clustering procedure. It t h e n i d e n t i-f i e s t e z t - l i ke connec t ed componen t san each color classbased o n the i r shapes. Final l y, a post -processing pro-cedure a l igns tez t - l ike comp onents in to text lines. Ex-pe r imenta l r e su l t s sugges t this approach i s p romi r ingdesp it e t he cha ll enging na tu re o f t he i np u t da t a .

1 In troduct ionA considerable portion of the text on the World

Wide Web (WWW) is embedded in images [4]. More-over, a significant fraction of this image text does notappear in coded form anywhere on the page (see, forexample, [5]). Such HTML documents are currentlynot being indexed properly - image text is not ac-cessible via existing search engines. To make this in-

formation available, techniques must be developed forassessing the textual content of Web images.

One of the difficulties to be faced in detecting andextracting text embedded in images is the prodigioususe of color combined with complex backgrounds (seeFigure 1). Text in WW W images can be any color andis often intermingled with differently-colored objectsin the background.

Figure 1: A WWW image containing text'.

In this paper, we present an algorithm for extract-ing text from Web images. T he algorithm first quan-tizes the color space of the input image into a num-ber of color classes using a parameter-free clustering

'The images in the figures throughout this paper arc origi-nally in color.

procedure. Pixels are assigned to color classes closestto their original colors. The algorithm then identi-fies text-like connected components in each color classbased on their shapes. Finally, character componentsare aligned to form text lines.

The rest of the paper is organized as follow: in Sec-

tion 2, we give a brief survey of some related research.Section 3 describes our approach for text extraction.Section 4 presents some experimental results. Finally,we conclude the paper in Section 5.

2 Rev iew of Rela ted WorkLocating text in color images has been addressed

under different contexts in the literature. Mostof these methods, however, are designed to processscanned images at a much higher resolution than whatis available on the WWW. In a paper by Zhong, Karu,and Jain, the authors discussed two methods for auto-matically locating text in CD cover images [9]. Huang,et al . [3] proposed a method based on grouping colorsinto clusters for foreground/background segmentationof color images. In a paper by Doermann, Rivlin andWeiss 113, the authors described methods for extract-ing text (often stylized) from logos and trademark im-ages.

The issue of locating and extracting text fromWWW images was discussed in an earlier paper byLopresti and Zhou, which explored the idea of adapt-ing traditiona l paper-based document analysis tech-niques to the World Wide Web [4]. In that pa-per, we proposed using a connected component anal-ysis for text detection. Another related work is thegenetic-algorithm-based technique proposed by Zhou,Lopresti, and Zhou for extracting text from grayscalescanned images [113.

3 Text Extract ionConceptually, our text extraction algorithm follows

a paradigm similar to Zhong et al.'s first method. Ourapproach assumes tha t the foreground color is roughlyuniform for a given character. In other words, thecolor of a single character varies slowly, and it s edgesare distinc t. Like Zhong et al.'s method, we firstquantize the color space of the input image into colorclasses by a clustering procedure. We adopt a clus-tering method which is intuitive and parameter-free.Pixels are then assigned to color classes closest to theiroriginal colors. For each color class, we analyze theshape of its connected components and classify them

0-8186-7898-4/97 $10.00 0 997 IEEE 248

Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.


2/5

as character-like or non-character-like. Here we use aset of features which are different from Zhong et al.sfor classification. Finally, a post-processing procedureperforms~ a clean-up opera tion by analyzing the lay-

number of holes. A connected component from anon-text region, on the other hand, is irregular, withvarying-size segments and many holes (see Figure 2).

out of character components.3.1 Color Clustering

The clustering method we use is based on the Eu-clidean minimum-spanning-tree (EMST) technique.The EMST-based clustering approach is a well-knownmethod and has been researched extensively over theyears. It has been shown th at the method works wellon a variety of distributions [8, 21.

Given N points in a M-dimensional Euclideanspace, the Euclidean minimum spanning tree problemis usually formated as a problem in graph theory: letthe N points be nodes and let there be a weightededge joining every pair of nodes such that the weightof the edge equals the distance between the two points.The Euclidean minimum-spanning-tree is a tree con-

necting all nodes such that the sum-total of distances(weights) is a minimum. Several robust a nd efficientalgorithms are available to compute the EMST froma given point set [7].

It is easy to see how an EMST can be used as a basisfor clustering. By definition, each edge in the EMSTof a set of points S is the shortest edge connecting twopartitions, P and (S P ), of S. ntuitively, points inthe sa me cluster should be connected by shorter edgesthan points in different clusters. Hence, by remov-ing the longest edges from the EMST we separate thepoints of S into disjoint clusters.

To apply the EMST clustering algorithm, we vieweach color as a point in a three dimensional R-G-Bspace. For a Web image in GIF format, there are atmost 256 unique colors, hence, the number of nodes in

an EMST is no more than 256. The distance betweentwo colors is computed as the Euclidean distance oftheir R-G-B elements. Once the EMST is constructed,we compute the average distanc e of the edges. Edgeswhose distances are larger than the average are thenremoved from the EMST.

The EMST tree thus trimmed may contain severaldisjoint sub-trees, each of which corresponds to a colorcluster. For each cluster, we select the color that oc-curs most frequently in the image as the representativecolor of the cluster.

3.2 Text DetectionAfter the color space is quantized, the next step is

to find connected compone nts belonging to each colorclass. Here we adopt a rocedure based on line adja-cency graph analysis [6f In addit ion to finding con-nected components in a bitmap, this procedure canextract features such as how many holes a connectedcomponent contains, how many upward ends or down-ward ends it has, etc.

The algorithm then classifies connected compo-nents into two classes: text-like an d non-text-like. Theclassification is based on the observation that con-nected components of characters have different shapecharacteristics than those from, say, graphical objects.A text component is usually composed of a few strokes,with a relatively uniform width, and typically a small

Figure 2: Components from text and halftone regions.

Currently, the classification process consists of asequence of conditional tests. A component is not textif:

it is very small or very large, orit is very long, orit has very thick strokes, orit has many holes, orit has many branches ends).

As an example, Figure 3 shows the result of the clas-sification procedure applied to the image of Figure 1.60 connected components are detected aa text-like inthis example. 21 of these are part of the text in theimage.

v 4$11

Ird [Jniversity

IFigure 3: Result of the classification process.

Finally, the algorithm performs a clean-up op-eration. It analyzes the geometric layout of the textcomponents. Currently, we assume th at there is onlyhorizontal text in the image. In order to eliminate spu-rious components, we align co-linear text by a horizon-ta l projection. The projection is done by enumeratingthe number of bounding boxes covering each row. Wethen analyze the peaks and valleys along the projec-tion profile and separate text components into hori-zontal text lines accordingly. For each text line, wemerge neighboring components t o form words. Wordswhich are too short are eliminated. Figure 4 showsthe final result of the text detection algorithm appliedto Figure 1.

4 Experimental EvaluationWe tested our text extraction procedure on 262 im-

ages chosen from real WW W pages. These were se-lected to cover a variety of graphic design styles com-monly seen on the Web. Of th e 26 2 samples, 68 con-tained fewer than 10 characters, while 53 containedmore than 50 characters. The average number of char-acters in each image was 33.

24 9



3/5

~

Figure 4: Result of the text detection algorithm.

0% - 10% 5810% - 20%20% - 30%30% - 40%40%

-50%50% - 60%

60% - 70 %70 % - 80%80% - 90 %

90% - 100%

181113

191722252059

I Ave. detect rate = 47%

To assess performance, we manually transcribed theoriginal text in each test image as well as the char-acter components correctly identified by the extrac-tion process in the corresponding output image. Wethen compared the two transcriptions and computeda character detection rat e for each sample. Table 1summarizes the results.

We found that on about 1/5 of the samples, the de-tection algorithm extracted more than 90% of the textfrom the input. Figures 5-10 show some of the success-ful test results. For each example, the upper image isthe input and th e lower image is the result. (The con-nected components detected as text-like are shown intheir original colors against a background color not a ppearing anywhere in the original image.) On the otherhand, our method failed almost completely on 1/5 ofthe samples. On the remaining cases, the algorithmwas able to detect a varying amount of text.

A close examination of the results reveals that thecauses for failure are many-fold. Some of the sampleswe used in our test are challenging, even for humanreaders. These include cases where text is wrappedaround curves or angled; embossed tex t, outlineand handwritte n fonts, very small (nearly unreadable)type sizes, etc. Figures 11-14 show some of the moredifficult samples we encountered.

The color clustering procedure may fail when col-ors in the i nput image form a long, chain-like spanningtree. Figure 15 shows an example where the two seem-ingly opposite colors (black and white) were groupedinto one cluster due to the gradually changing inter-mediate colors in the image. This example shows tha ttrimming the EMST by removing long edges alonemay not be sufficient to determine the correct clus-

Figure 5: Test example 1.


ters; additional constraints are needed.Our connected component classification procedure

relies on several heuristics to weed out non-text com-ponents. Unfortunately, these must be tuned empir-ically. For our experiment, we tried t o develop onegood set of parameters that would work across awide range of input images (the same settings wereused for all the test samples). However, finding a setof rules which give acceptable results in all situations isdifficult. The classification process currently assumesthat characters are separable, and that text is hori-zontal. Since Web images may contain highly stylizedgraphics, this assumption often causes the classifica-tion process to miss text that has broken or touchingcharacters, or th at is curved or angled. In Figure 16 ,we show an example where th e text is both angled andtouching. Clearly, a way needs to be found to avoidrelying on such heuristics during classification.

5 ConclusionsCoded (ASCII) text constitutes only a fraction of

the information present in Web pages; a significantamount of the text is embedded in images. In this

250



4/5



paper, we have described a method for detecting suchtext.

Our approach is based on color clustering followedby a connected component analysis. The algorithmworks reasonably well given the complexity of theinput data, suggesting that such techniques couldprove useful in Web-based information retrieval a pplications. We are currently working on improving itsrobustness.

A related issue that deserves mention is the recog-nition of WW W image text. This would be a difficultproblem even if the text could be located perfectly.One reason for this is that it is typically rendered at alow spatia l resolution (72 dpi). We are currently inves-tigating me thods for addressing this. For example, inan earlier paper we proposed using a fuzzy n-tupleclassifier [lo]. The idea is to utilize the informationcontained in the color bits to compensate for the lossof information due t o the low spatial resolution. Ini-

I


I

Figure 11: Difficult example 1-

embossed text.

tial experimental results suggest that this may be apromising approach as well.

References[ l] D. Doermann, E. Rivlin, and I. Weiss. Logo

recognition. Technical Report CAR-TR-688,Document Processing Group, Center for Automa-tion Research, University of Maryland, CollegePark, M D 20742-3275, October 1993.

[2] R. Graha m and P. Hell. On the history of min-imum spanning tree problem. Annals of Historyof Computing, 7, 1985.

[3] Q. Huang, B. Dom, D. Steele, J. Ashley, andW. Niblack. Foreground/background segmenta-tion of color images by integration of multiplecues. In Proceedings of Computer Vision and Pat-tern Recognition, pages 246-249, 1995.

[4] D. Lopresti and J. Y. Zhou. Document analy-sis and the World Wide Web. In J. Hull and

Figure 9: Test example 5. Figure 12: Difficult example-2 - wrapped text.

25 1

Authorized licensed use limited to: Madhira Institute of Technology and Sciences Downloaded on January 8 2010 at 02:28 from IEEE Xplore Restrictions apply


5/5

Figure 13: Difficult example 3 - outline font. Figure 16: Angled and touching text.

Figure 14: Difficult example 4 - stylized font.

S. Taylor, editors, Proceedings of the Workshopon Document Analysis Systems, pages 417-424,Marven, Pennsylvania, October 1996.

[5] The New York Times. http://www.nytimes.com/ .

[6] T. Pavlidis. Algo rithm s fo r GTaphics and ImageProcessing. Computer Science Press, Rockville,MD, 1982.

[7] F. P. Preparata and M. I. Shamos. Computation

Geometry-

An Introduction,chapter 5. Springer-Verlag, 1985.

[8] C. T. Zahn. Gra ph theoretical methods for de-tecting and describing gestalt clusters. IEEEZ'ransactions on Computers, 20(1), 1971.

[9] Y. hong, K. aru, nd A.K. Jain. Locating textin complext color images. Pattern Recognition,

[ lo ] J. Y. Zhou, D. Lopresti, and Z. Lei. OCR forWorld Wide Web images. In Proceedings of the

28( 10):1523-1535, 1995.

IS&T/SPIE International Symposium on Elec-tronic Imaging, pages 58-65, San Jose, California,Febuary 1997.

[Ill J. Y. hou, D. Lopresti, and J. Zhou. A genetic

approach to the analysis of complex text format-ting. In Proceedings of the IS&T/SPIE Interna-tiond Symposium on Electronic Imaging, pages126-137, San Jose, California, January 1996.

Figure 15: Bad clustering. Left: inpu t, right: result.

25 2

Authorized licensed use limited to: Madhira Institute of Technology and Sciences Downloaded on January 8 2010 at 02:28 from I EEE Xplore Restrictions apply
http://www.nytimes.com/http://www.nytimes.com/

extracting text from www images

Documents