detection of unknown forms from document...

Detection of Unknown Forms from Document ImagesAndrew Busch Wageeh W. Boles Sridha Sridharan Vinod [email protected] [email protected] [email protected] [email protected]

Research Concentration in Speech, Audio and Video Technology,Queensland University of Technology, Brisbane, Qld, Australia

AbstractThis paper presents a novel technique for distinguishingimages of forms from other document images. Theproposed algorithm detects regions which are likely to beused for text entry, such as lines, boxes, and characterentry fields, and calculates a probability of the documentbeing a form based on the presence of such structures.Experimental results from testing on both filled andunfilled forms, as well as a selection of non-formdocuments are presented. All document images areassumed to have been scanned at a known resolution.

INTRODUCTIONThe extraction and processing of information contained inprinted forms is a task of great importance in many areasof business and government alike. To date, the vastmajority of form processing has been done manually, withhuman operators performing all of the associated tasks upto and including data entry. In recent times, a largeamount of research has been undertaken in the fields ofform identification, field location and data extraction. Allof the research to date, however, makes the assumptionthat the image to be analysed is indeed a form, which maynot always be the case in many applications. For thisreason, this paper presents a technique for classifying adocument image as either a form or non-form, andidentifying likely field areas within any forms detected.

Previous work in the field of form field detection hasprovided an excellent starting point for this research. Thetechnique proposed by Wang and Srihari [1] removesisolated characters, then searches for intersections of linesegments. Yuan et al [2] present a method of detectingfields in forms that relies on segmentation algorithms tofind text and straight lines, and uses adjacency graphs todetect possible entry fields in form images with no textentered. Xingyuan et al [3] propose a more robusttechnique which detects rectangular fields and linesregardless of text or other markings, but does not explicitlydetect other form structures.

A number of techniques have also been proposed toremove the effects of noise and poor image acquisition,which can often cause unwanted line breaks, falseintersections and broken junctions [4-6].

The work presented in this paper is in two parts. The firstsection describes a technique for detecting the primitive

data entry structures that distinguish forms from otherdocuments, namely lines, bounded rectangular areas,checkboxes, and character cell fields, or ‘ tooth’ structures.In the second section we attempt to determine if anunknown document is likely to be a form. Using thepresence of the previously detected structures, combinedwith the amount of text found in the document, a formprobabilit y score is proposed as an indication of thelikelihood of the candidate document being a form.

Results from experiments over 100 form and 200 non-formdocument images from a variety of sources are presented.

DETECTION OF FORM STRUCTURESAn initial investigation of documents contained in [7] hasidentified four major structural elements which can be usedto identify forms. These are: horizontal li nes (either solidor dotted), bounded rectangles, small checkboxes, andcharacter cells or ‘ tooth’ structures (Fig 1). Examinationof all training data has shown that every form documentcontains one or more such structures. Detecting suchstructures in complex document images, however, is not atrivial problem. Attempting to segment a document imageand classify regions is problematic due to frequentoverlapping of neighboring regions, especially whendealing with completed forms. More traditional shaperecognition techniques such as the generalized HoughTransform [8] are also inaccurate in the presence of noise,and also quite slow computationally. As all of the desiredregions consist entirely of vertical and horizontal li nes, ourapproach to the detection problem begins with finding allsuch lines in the candidate image. Once these lines arefound, each is further processed to determine if it is alikely form structure.

Line DetectionWe define a ‘ line’ in a document image to be a contiguousor near-contiguous sequence of n ‘on’ pixels in thehorizontal (vertical) direction, where n is directlyproportional to the resolution of the image. As thesmallest lines of interest are approximately the same widthas a character, n is chosen as to correspond with a distanceof 2mm in the original document. To detect such asequence, we employ a one-dimensional summing filter inthe horizontal (vertical) direction defined by the equation

( ) ( )∑+

=

=nx

xi

yiIn

yxS ,1

, (1)

where I(x,y) is the original binary image. By applying athreshold to the resulting image, the starting points of allpossible lines can be found.

( ) τ≥= yxSLH , (2)

Binary morphological operations can then be used toextend these starting points across all n pixels in the linesegment. Figure 2 shows the result of line detection on atypical form image.

The detection process thus outlined is successful atdetecting regions likely to contain lines, however alsogives rise to a number of false positives. In particular,large regions black regions in the document such asimages, thick vertical li nes and large sections of text areoften falsely detected as horizontal li nes. To remove suchregions we first segment the line image LH into connectedcomponents, and calculate the height and width of eachcomponent. Components which do not satisfy a minimumwidth and width:height ratio are removed. This processalso has the effect of removing valid horizontal li nes whichare connected to thick vertical li nes, however as such linesare almost always borders or part of images, this is notundesirable.

Vertical li nes by themselves do not constitute a possibletext entry field. For this reason, all vertical lines which donot at some point cross a valid horizontal li ne are alsoremoved. To allow for noise, small breaks in lines, andscanning errors, we relax this constraint somewhat,allowing vertical li nes which are close (within n pixels) toeither a horizontal li ne or another valid vertical li ne to bekept as well .

Line GroupingOnce all possible lines have been detected, we then attemptto combine these lines to form one of the four formstructures.

In order to detect character cells or ‘ tooth’ structures, eachhorizontal li ne is analysed for vertical li ne crossings, ornear crossings. Such crossings must extend significantly inthe vertical direction, since we assume that the horizontalline represents the bottom of the tooth structure. We thenlook for a periodic structure within these crossings,constrained by likely cell size. Due to noise, handwritingor other markings within the structure, it is possible thatextra vertical crossings unrelated to the structure arepresent. In order to allow for this, an algorithm has beendeveloped as follows:

For every vertical line crossing not already part ofstructure:

search for more crossings within search dist. xfor each such crossing found:

search line at same dist. ±5%if another crossing found,

recalculate mean distance, search againif #crossings > 4, structure found.

The search distance x is proportional to the resolution ofthe document, and we have used a range of n-5n in ourexperiments with good results. In order to reduce the falsedetection rate, we have also enforced a criterion whereby astructure is not considered valid if more than half of itscrossings are not fully joined. Finally, we search for a topbounding line, which is defined as a horizontal li ne withinn-5n of the original li nes, which crosses (or nearly crosses)each vertical segment of the structure. If two or more suchlines are found, only the closest to the baseline is taken.Any such line found will still be considered for thebaseline of further tooth structures.

For the detection of rectangles and boxes, we use a similaralgorithm to that proposed in [3], whereby each set ofcandidate lines are checked to determine whether theyform an enclosed area. In order to prevent rectanglesbeing found in locations already covered by previouslydetected tooth structures, baselines of such structures areonly considered as the top of a rectangle. Small breaks inthe perimeter of rectangles are permitted, so long as theydo not exceed 5% of the total distance. Rectangles that arecompletely covered by other rectangles are then removed.Regions whose area exceeds a certain size threshold arealso removed, as these are unlikely to be text entry fields,and are more likely borders or frames.

A checkbox is defined as a special case of rectangle, wherethe following three criteria are met:

• The sides are of equal length (square)

(a)

(b) (c)

(d)

Figure 1. Four common form structures, (a)tooth structure, (b) checkbox, (c) rectangle, (d)line

• Side length is within a given range (we use n- 5n)

• Sides do not significantly extend beyond the cornersof the rectangle

All horizontal li nes that do not form part of any of theabove structures are considered lines.

FORM CLASSIFICATIONThe classification of documents into form and non-formclasses is achieved using a score based on the presence ofpreviously detected form structures combined with theamount of text contained in the document. Examination ofa large number of forms has revealed that most do notcontain as much text as other documents of a similar size.Thus, the presence of text in a document image has anegative impact on the probabilit y of that document beinga form. Numerous algorithms exist for the segmentationand extraction of printed text from documents , but foraccuracy we have manually measured the amount of textpresent in each test document. As we are only interested inthe body text of the document, any large segments such asheadlines or titles are not included. We thus define theform probabilit y score as:

textlinerectboxtooth dddddp 54321 wwwww −+++= (3)

where dtype represents the total horizontal li neal distancecovered by the given structure type, and w is a weightingvector. A positive fps value indicates that the document islikely to be a form. In order to obtain a true likelihoodestimate, this value can be normalised, such that:

( )textlinerectboxtooth ddddd

pp

++++=ˆ (4)

In order to calculate the weighting vectors we haveprocessed a large number of both form and non-formdocuments, and examined the relationship between theamounts of each structure present. By constructing plotsof dtext vs dtype for each structure type, it can be seen thatthere exists an almost linear separation between form and

(a) (b)

(c) (d)

Figure 2. Results of form structure detection. (a) Original document image, (b) detected vertical linesegments, (c) detected horizontal line segemnts, (d) final form structures

Figure 3. Plot of d-line vs d-text for aselection of form(+) and non-form(*)document images

non-form documents. We then find the gradient of thisline and use it to calculate the corresponding weightingcoefficient in w, assuming w5 = 1. Figure 3 shows anexample of such a plot. It should be noted that we couldfind no non-form documents containing the toothstructure, meaning that the value of w1 would approachinfinity. For this reason we have made this coefficient verylarge, approximately ten times the value of the next highestcoefficient.

RESULTSExperiments were conducted in two stages, using a set of100 form and 200 non-form images acquired from avariety of sources, including the University of Washingtondatabase [7]. Firstly, the form structure detectionalgorithm was applied to all form images, and resultscompared to those calculated manually. Overall, 2443 of2567 (95%) form structures were successfully detected asthe correct type. Of those structures that were not detectedsuccessfully, approximately two thirds were due tomisclassification of one structure as another, with theremaining missed entirely. An additional 181 formstructures were falsely detected, with almost all of thesebeing small lines. A typical form image with all detectedstructures is shown in Figure 2. Table 1 shows theconfusion matrix for this experiment.

Table 1. Confusion matrix for detection of formstructures

Detected TypeActualType tooth rect. box line missed

tooth 229 0 0 3 0

rect 0 767 9 38 0

box 0 10 318 4 14

line 1 15 1 1253 19

none 0 9 4 168 x

The second stage of experiments involved calculating thenormalised form probability score for each test documentusing the detected structures and known text amounts.Those documents obtaining a positive score were classifiedas forms, with the remaining classified as non-forms.From a total of 300 (100 form, 200 non-form) documentimages, 258 were correctly classified. Of those that weremisclassified, 6 form images were missed, and 36 non-form images falsely detected. The overall error rate of thetest was approximately 14%. Total processing time forboth structure detection and form classification isapproximately 5 seconds on a Pentium 3 600MHzcomputer.

CONCLUSIONS AND FUTURE RESEARCHThis paper has presented a technique to distinguish formdocuments from other types by identifying commonstructures usually present in such images. Experimentalresults have shown our algorithm for detecting suchstructures to be accurate and robust, with over 95% ofstructures detected correctly. Classification of form andnon-form documents is accomplished by comparing thetotal number of such structures to the amount of text in thedocument, creating a form probability score. This statistichas shown to perform well, with almost all form imagescorrectly identified and a false detection rate of under15%.

Future research will aim to more accurately model thetypical line and rectangle structure in forms by examiningsurrounding text. This should greatly reduce the numberof false positive results.

REFERENCES[1] D. Wang and S. N. Srihari, "Analysis of form

images," Proc. of First International Conference onDocument Analysis and Recognition, 1991.

[2] J. Yuan, Y. Tang, and C. Y. Suen, "Four directionaladjacency graphs (FDAG) and their application inlocating fields in forms," Proc. of Third InternationalConference on Document Analysis and Recognition,1995.

[3] L. Xingyuan, D. Doermann, W.-G. Oh, and W. Gao,"A robust method for unknown forms analysis," Proc.of Fifth International Conference on DocumentAnalysis and Recognition, 1999.

[4] H. Shinjo, K. Nakashima, M. Koga, K. Marukawa, Y.Shima, and E. Hadano, "A method for connectingdesappeared junction patterns on frame lines in formdocuments," Proc. of 4th Int. Conf. on DocumentAnalysis and Recognition, 1997.

[5] O. Hori and D. Doermann, "Robust table-formstructure analysis based on box-driven reasoning,"Proc. of Third International Conference on DocumentAnalysis and Recognition, 1995.

[6] H. Fujisawa and Y. Nakano, "Segmentation methodsfor character recognition: from segmentation todocument structure analysis," Proceedings of theIEEE, vol. 80, pp. 1079-1092, 1992.

[7] I. Phillips, S. Chen, and R. Haralick, "CD-ROMDocument Database Standard," Proc. of SecondInternational Conference on Document Analysis andRecognition, 1993.

[8] D. H. Ballard, "Generalizing the Hough transform todetect arbitrary shapes," Pattern Recognition, vol. 13,pp. 111-122, 1981.

detection of unknown forms from document...

Documents