adaboost learning for detecting and recognizing textayuille/pubs/ucla/a188_xchen_cvpr2004.pdf ·...

AdaBoost Learning for Detecting and RecognizingText

X. ChenDepartment of Statistics

University of California at Los AngelesLos Angeles, CA 90095

Alan YuilleDepartment of Statistics

University of California at Los AngelesLos Angeles, CA 90095

[email protected]

In Computer Vision and Pattern Recognition. Vol. 2. pp 366-373. June 2004

1

AdaBoost Learning for Detecting and Reading Text in City Scenes.

X. Chen and A.L. YuilleDept. Statistics

University of Los AngelesLos Angeles, CA 90095.

[email protected]

Preliminary version. To appear as oral presentationin Computer Vision and Pattern Recognition 2004. Notfor distribution.

Abstract

This paper gives an algorithm for detecting and reading tex-t in natural images. The algorithm is intended for use byblind and visually impaired subjects walking through cityscenes. We first obtain a dataset of city images taken byblind and normally sighted subjects. From this dataset, wemanually label and extract the text regions. Next we perfor-m statistical analysis of the text regions to determine whichimage features are reliable indicators of text and have lowentropy (i.e. feature response is similar for all text images).We obtain weak classifiers by using joint probabilities forfeature responses on and off text. These weak classifiersare used as input to an AdaBoost machine learning algo-rithm to train a strong classifier. In practice, we trained acascade with 4 strong classifiers containg 79 features. Anadaptive binarization and extension algorithm is applied tothose regions selected by the cascade classifier. An com-mercial OCR software is used to read the text or reject it asa non-text region. The overall algorithm has a success rateof over 90% (evaluated by complete detection and readingof the text) on the test set and the unread text is typicallysmall and distant from the viewer.

1. IntroductionThis paper presents an algorithm for detecting and readingtext in city scenes. This text includes stereotypical form-s – such as street signs, hospital signs, and bus numbers –as well as more variable forms such as shop signs, housenumbers, and billboards. Our database of city images weretaken in ZZZZ partly by normally sighted viewers and part-ly by blind volunteers who were accompanied by sightedguides (for safety reasons) using automatic camera settingsand little practical knowledge of where the text was locatedin the image. The databases have been labelled to enable us

to train part of our algorithm and to evaluate the algorithmperformance.

The first, and most important, component of the algo-rithm is a strong classifier which is trained by the AdaBoostlearning algorithm [4],[19],[20] on labelled data. AdaBoostrequires specifying a set of features from which to build thestrong classifier. This paper selects this feature set guidedby the principle ofinformative features (the feature set usedin [19] is not suitable for this problem). We calculate join-t probability distributions of these feature responseson andoff text, so weak classifiers can be obtained as log-likelihoodratio tests. The strong classifier is applied to sub-regions ofthe image (at multiple scale) and outputs text candidate re-gions. In this application, there are typically between 2-5false positives in images of 2,048 x 1,536 pixels. The sec-ond component is an extension and binarization [12] algo-rithm that acts on the text region candidates. The extensionand binarization algorithm takes the text regions as inputs,extends these regions, so as to include text that the strongclassifier did not detect, and binarizes them (ideally, so thatthe text is white and the background is black). The thirdcomponent is an OCR software program which acts on thebinarized regions (the OCR software gave far worse perfor-mance when applied directly to the image). The OCR soft-ware either determines that the regions are text, and readsthem, or rejects the region as text.

The performance is as follows: (I) Speed. The current al-gorithm runs in under 9 seconds on images of size 2,048 by1,536. The biggest delay is when we apply the strong classi-fier to the image in the detection stage. By using multiscaletechniques, it should be possible to reduce the run time. (II)Quality of Results. We are able to detect text ofalmost allform with false negative rate of 2.8 %. We are able to readthe detected text correctlty at 93.0 % (correctness is mea-sured per complete word and not per letter). We incorrectlyread non-text as text for 10 % of cases. But only 1 % re-mains incorrectly read after we prune out text which doesnot form coherent words. (Many of the remaining errorscorrespond to outputting ”111” due to vertical structures inthe image.)

1

2 Previous Work

There has been recent succesful work on detecting text inimages. Some has concentrated on detecting individual let-ters [1], [6],[7]. More relevant work is reported in [23],[10], , [22] [8], [9]. In particular, Lucaset al [10] report onperformance analysis of text detection algorithms on a stan-dardized database. It is hard to do a direct comparison tothese papers. None of these methods use AdaBoost learn-ing and the details of the algorithms evaluated by Lucasetal are not given. The performance we report in this paper isbetter than those reported in Lucaset al, but the datasets aredifferent and more precise comparison on the same dataset-s are needed. We will be making our dataset available fortesting.

3 The Datasets

We used two image datasets with one used fortraining theAdaBoost learning algorithm and the other used fortestingit.

The training dataset was 162 images of which 41 of themwere taken by scientists from XXXX (name witheld forconfidentiality) and the rest taken by blind volunteers un-der the supervision of scientists from XXXX.

The test dataset of 117 images was taken entirely byblind volunteers. Briefly, the blind volunteers were e-quipped with a Nikon camera mounted on the shoulder orthe stomach. They walked round the streets of YYYY tak-ing photographs. Two observers from XXXX accompaniedthe volunteers to assure their safety but took no part in tak-ing the photographs. The camera was set to the default au-tomatic setting for focus and contrast gain control.

From the dataset construction, see figure (1), we notedthat: (I) Blind volunteers could keep the camera approxi-mately horizontal. (II) They could hold the camera fairlysteady so there was very little blur. (III) The automatic con-trast gain control of the cameras was almost always suffi-cient to allow the images to have good contrast.

4. Selection of Features for AdaBoostThe AdaBoost algorithm is a method for combining a setof weak classifiers to make a strong classifier. The weakclassifiers correspond to image features. Typically a largeset of features are specified in advance and the algorithmselects which ones to use and how to combine them.

The problem is that the choice of feature set is criticalto the success and transparency of the algorithm. The setof features used for face detection by Viola and Jones [19]consists of a subset of Haar basis functions. But there wasno rationale for this choice of feature set apart from com-putational efficiency. Also there are important differences

Figure 1: Example images in the training dataset taken byblind volunteers (top two panels) and by scientists fromXXXX (bottom two panels). The blind volunteers are, ofcourse, poor at centering the signs and in keeping the cam-era horizontally alligned.

2

between text and face stimuli because the spatial variationper pixel of text images is far greater than for faces. Facialfeatures, such as eyes, are in approximately the same spatialposition for any face and have similar appearance. But thepositions of letters in text is varied and the shapes of lettersdiffer. For example, PCA analysis of text, see figure (2), hasfar more non-zero eigenvalues than for faces (where Pent-land reported that 15 eigenfaces capture over ninety percentof the variance [14]).

0 50 100 150 200 250 300 350 4000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Eigen values

Ene

rgy

capt

ured

Figure 2: PCA on our dataset of text images (40 x 20 pixel-s). Observe that about 150 components are required to get90 percent of the variance. Faces require only 15 compo-nents to achieve this variance [14].

Ideally, we should selectinformative features which givesimilar results on all text regions, and hence have low en-tropy, and which are also good for discriminating betweentext and non-text. Statistical analysis of the dataset of train-ing text images shows that there are many statistical regu-larities.

For example, we allign samples from our text dataset(precise allignment is unnecessary) and analyze the re-sponse of the modulus of the� and� derivative filters ateach pixel. The means of the derivatives have an obviouspattern, see figure (3), where the derivatives are small in thebackground regions above and below the text. The� deriva-tives tend to be large in the central (i.e. text) region whilethe� derivatives are large at the top and bottom of the tex-t and small in the central region. But the variances of the� derivatives are very large within the central region (be-cause letters have different shapes and positions). However,the� derivatives tend to have low variance, and hence lowentropy.

Our first set of features are based on these observations.By averaging over regions we obtain features which havelower entropy. Based on the observation in figure (3), wedesigned block patterns inside the sub-window, correspond-ing to horizontal and vertical derivative. We also designedthree symmetrical block patterns, see figure (4), which arechosen so that there is (usually) a text element within eachsub-window. This gives features based on block based meanand STD of intensity and modulus of� and� derivative fil-

0

10

20

30

40

0

5

10

15

20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

y

ST

D o

f mod

ule

of h

oriz

onta

l der

ivat

ive

0

10

20

30

40

0

5

10

15

200

0.1

0.2

0.3

0.4

0.5

xy

ST

D o

f mod

ule

of v

ertic

al d

eriv

ativ

e

Figure 3: The means of the moduli of the� (left top) and�(right top) derivative filters have this pattern. Observe thatthe average is different directly above/below the text com-pared to the response on the text. The� derivative is smalleverywhere. The� derivatives tend to have large variance(bottom left) and the� derivatives have small variance (bot-tom right).

ters.

Figure 4: Block patterns. Features which compute proper-ties averaged within these regions will typically have lowentropy, because the fluctuations shown in the previous fig-ure have been averaged out.

We build weak classifiers from these features by comput-ing probability distributions. Formally, a good feature��will determine two probability distributions� � ��

and� �� . We can obtain a weak classifierby using the log-likelihood ratio test. This is made easi-er if we can find tests for which� � �� is stronglypeaked (i.e. has low-entropy because it gives similar resultsfor every image of text) provided this peak occurs at a placewhere� �� is small. Such tests are compu-tationally cheap to implement because they only involvedchecking the value of�� within a small range.

We also have a second class of features which are morecomplicated. These include tests based on the histogramsof the intensity, gradient direction, and intensity gradien-t. In ideal text images, we would be able to classify pix-els as text or background directly from the intensity his-togram which should have two peaks corresponding to textand background mean intensity. But, in practice, the his-tograms typically only have a single peak, see figure (5)

3

(top right). But by getting a joint histogram on the intensityand the intensity derivative, see figure (5) (bottom left), weare able to estimate the text and background mean intensi-ties. These joint histograms are useful tests for distinguish-ing between text and non-text.

0 32 64 96 128 160 192 224 2560

500

1000

1500

Intensity

Num

ber

of p

ixel

s

0 32

64 96

128 160

192224

2560

5

10

15

0

50

100

150

200

250

300

Module of intensityderivative

Intensity

Num

ber

of p

ixel

s

0 32 64 96 128 160 192 224 2560

100

200

300

400

500

600

700

800

Intensity

Num

ber

of p

ixel

s

Figure 5: Original image (top left) has intensity histogram(top right) with only a single peak. But the joint histogramsof intensity and intensity gradient shows two peaks (bot-tom left) and shown in profile (bottom right). The intensityhistogram is contaminated by edge pixels which have highintensity gradient and intensity values which are intermedi-ate between the background and foreground mean intensity.The intensity gradient information helps remove this con-tamination.

Our third, and final, class of features based on perform-ing edge detection, by intensity gradient thresholding, fol-lowed by edge linking. These features are more computa-tionally expensive than the previous tests, so we only usethem later in the AdaBoost cascade, see next section. Suchfeatures count the number of extended edges in the image.These are also properties with low entropy, since there willtypically be a fixed number of long edges whatever the let-ters in the text region.

In summary, we had: (i) 79 first class features including4 intensity mean features, 12 intensity standard deviationfeatures, 24 derivative features, (ii) 14 second class features(histograms), and (iii) 25 third class features (based on edgelinking).

Ideally, we would learn a joint distributions� �� and � �� for all features� . In practice, this is impossible because of the dimen-sionally of the feature set and because we do not knowwhich set of features should be chosen. We would need animmense amount of training data.

0 0.5 1 1.5 2 2.5 3 3.5 4

0

1

2

3

4

0

500

1000

1500

2000

2500

3000

Figure 6: Joint histograms of the first features that Ad-aBoost selected.

Instead, we use both single features and joint distribu-tions for pairs of features, followed by log-likelihood ratiotests, as our weak classifiers. See figure (6. These are thencombined together by standard AdaBoost techniques. It isworth to note that all weak classifiers selected by AdaBoostare from joint distributions, indicating that it is more ”dis-criminant” and making the learning process less greedy.

The result of this feature selection approach is that ourfinal strong classifier, see next section, uses far fewer filter’sthan Viola and Jones’ face detection classifier [19]. Thishelps the transparency of the system.

5. AdaBoostThe AdaBoost algorithm [4] has been shown to be ar-guably the most effective method for detecting target ob-jects in images [19]. Its performance on detecting faces[19] compares favourably with other successful algorithmsfor detecting faces [3, 15, 17, 21, 24] and for detecting text[8],[16],[9],[1].

The standard AdaBoost algorithm learns a “strong clas-sifier” �� by combining a set of� “weak classifiers”�� using a set of weights��:

��

��

��

��

The selection of features and weights are learned throughsupervised training off-line [4]. Formally, AdaBoost uses aset of input data�� where�� is the in-put, in this case image windows described below, and� � isthe classification where�� indicates text,�� isnot-text. The algorithm uses a set ofweak classifiers de-noted by��. These weak classifiers correspond to adecision of text or non-text based on simple tests of visualcues (see next paragraph). These weak classifiers are on-ly required to make the correct classifications slightly over

4

half the time. The AdaBoost algorithm proceeds by defin-ing a set of weights�� on the samples. At� � �, thesamples are equally weighted so�� . The updaterule consists of three stages. Firstly, update the weights by��

�� , where�� is a nor-malization factor chosen so that

��

�� . Thealgorithm selects the�� that minimize�� .Then the process repeats and outputs astrong classifier��

��

�� . It can be shown that thisclassifier will converge to the optimal classifier as the num-ber of classifiers increases [4].

AdaBoost requires a set of classified data with imagewindows labelled manually as being text or non-text. Weperformed this labelling for the training dataset and and di-vided each text window into several overlapping text seg-ments with fixed width-to-height ratio 2:1. This lead to atotal of 7,132 text segments which were used as positiveexamples, see figure (7). The negative examples were ob-tained by a bootstrap process similar to Drucker et al [2].First we selected negative examples by randomly samplingfrom windows in the image dataset. After training withthese samples, we applied the AdaBoost algorithm to clas-sify all windows in the training images (at a range of sizes).Those misclassified as text were then used as negative ex-amples for retraining AdaBoost. The image regions mosteasily confused with text were vegetation, repetitive struc-tures such as railings or building facades, and some chancepatterns.

The previous section described the weak classifiers weused for training AdaBoost.

Figure 7: Positive examples used for training AdaBoost.Observer the low quality of some of the examples.

We used standard AdaBoost training methods to learnthe strong classifier [4] [5] combined with Viola and Jones’cascade approach which uses asymmetric weighting [19].The cascade approach enables the algorithm to rule out mostof the image as text locations with a few tests (so we do nothave to apply all the tests everywhere in the image). Thismakes the algorithm extremely fast when applied to the testdataset and yields order of magnitude speed-up over stan-dard AdaBoost [19]. Our algorithm had a total of 4 cas-cade layers. Each layer has 2, 10, 30, 50 tests respectively.The overall algorithm uses 92 different feature tests. Thefirst three layers of the cascade only use mean, STD andmodule of derivative features, since they can be easily cal-

culated from integral images[19]. Computation intensivefeatures, histogram and edge linking, involve all pixels in-side the sub-window. So we only let them be selected in thelast layer.

In the test stage, we applied the AdaBoost strong classifi-er�� to windows of the input images at a range of scales.There was a total of 14 different window sizes, ranging from20 by 10 to 212 by 106, with a scaling factor of 1.2. Eachwindow was classified by the algorithm as text or non-text.There was often overlap between windows classified as tex-t. We merged these regions by taking the union of the textwindows. (Much of the run time of our algorithm is due toapplying the strong classifier to the image windows at thesedifferent scale. A multiscale approach should speed this upsignificantly).

In our test stage, AdaBoost gave very high performancewith low false positives and false negatives (in agreementwith previous work on faces [19]). When applied to over20,000,000 image windows, taken from 35 images, the totalnumber of false positives was just over 118 and the numberof false negatives was 27. By alterering the threshold wecould reduce the number of false negatives to 5 but at theprice of raising the number of false positives, see table (1).We decided to keep not to alter the threshold so as to keepthe number of false positives down to an average of 4 perimage (almost all of which will be eliminated at the readingstage).

Object Thresh False Pos. False Neg. Images SubwindowsText 0.00 118 27 35 20,183,316Text -0.05 1879 5 35 20,183,316

Table 1: Performance of AdaBoost at different thresholds.Observe the excellent overall performance and the trade-offbetween false positives and false negatives.

We illustrate these results by showing the windows thatAdaBoost classifies as text for typical images in the testdataset, see figure (8).

6 Extension and Binarization

Our next stage produces binarized text regions to be usedas inputs to the OCR reading stage. (It is possible to runOCR directly on intensity images but we obtain substan-cially worse performance if we do so). In addition to bina-rization, we must extend the text regions found by the Ad-aBoost strong classifiers because these regions sometimesmiss letters or digits at the start and end of the text.

We start by appling adaptive binarization [12] to the textregions detected by the AdaBoost strong classifier. This isfollowed by a connected component algorithm [13] whichdetects letter and digit candidates and enables us to estimate

5

Figure 8: Results of AdaBoost on typical test images (takenby blind subjects). The boxes display areas that AdaBoostclassifies as text. Observe that text is detected at a rangeof scales and at non-horizontal orientations. Note the smallnumber of false positives. The boxes will be expanded andbinarized in the next processing stage.

their size and the spacing between them. These estimatesare used toextend the search for text into regions directly tothe left, right, above and below of the regions detected byAdaBoost. Binarization is then applied in these extendedtext regions.

More precisely, we use Niblack’s adaptive binariza-tion algorithm [12] which was reported by Wolf [22] tobe the most succesful binarization algorithm (jointly withYanowitz-Bruckstein’s method [25]). This requires select-ing both a threshold for the local sub-windows of the textregion and an appropriate size for the sub-window. The sub-windows have vertical and horizontal lengths in the range 3-11 pixels. At each point in the text region, the sub-windowis chosen to be the smallest possible subwindows whosevariance is above a fixed threshold. Then the binarizationis performed by selecting a threshold� �� based on themean�� and standard deviation�� within the sub-window:

� ��

The value� is used to adjust how much the foregroundobject edges that are taken as a part of the objects. (Forexample:� � �� to check for cases where the foregroundis brighter or darker than the text.)

We show results for the extension and binarization al-gorithms in figure (9) using the text regions shown in fig-ure (8).

7 Text Reading

We applied commercial OCR software to the extended textregions (produced by AdaBoost followed by extension andbinarization). This was used both to read the text and todiscard false positive text regions.

Overall, the AdaBoost strong classifier (plus exten-sion/binarization) detected 97.2 % of the visible text in ourtest dataset (text that could be detected by a normally sight-ed viewer). See figure (10) for typical examples of the textthat AdaBoost fails to detect. Most of these errors corre-spond to text which is blurred or badly shadowed. Othersoccur because we do not train AdaBoost to detection verti-cal text or individual letters. (Our training examples werehorizontal segments usually containing two or three letter-s/digits).

For the 286 extended text regions correctly detected bythe AdaBoost strong classifier (plus extension/binarization),we obtained a correct reading rate of 93.0 % (proportion ofwords correctly read). This required a preprocessing stageto scale the text region. The 7 % errors are caused by s-mall text areas. See figure (11) for examples of text that wecan read successfully and figure (12) for text that we cannotread.

6

Figure 9: Extension (left column) and binarization of thetext regions shown in figure (8). Note that our OCR soft-ware is not yet able to read Chinese or Spanish so we treatthese as non-text.

Figure 10: Examples of different text that we fail to detectby the AdaBoost strong classifier. Some are blurred, badlyshaded, or have highly non-standard font. Others are notdetected because we did not train AdaBoost to detect indi-vidual letters/digits or vertical text.

Figure 11: Examples of different text that can be correctlydetected and read. First row: Road signs and street num-bers. Second and third rows: Commercial and Information-al signs. Fourth row: bus signs and bus stops. Fifth row:House numbers and Phone Numbers.

Figure 12: Examples of text that we can detect by the Ad-aBoost strong classifier but cannot read correctly. Thesecorrespond to small text and blurred text. But improvementsin our binarization process might make some of them read-able.

The OCR algorithm will also sometimes misclassify thefalse positive text regions found by AdaBoost and classifythem as text. See the example in figure (13). This occurredfor about 10 % of false positive text regions, but often thetext read made no grammatical sense and can be removed(though this requires an additional stage after the OCR soft-ware). The most common remaining error are text stringlike ”111” or ”Ill” which correspond to vertical edges in theimage caused, for example, by iron railings.

Figure 13: Examples of OCR output on non-text. Only thebottom window is incorrectly read as text, ”Ill”,

The results presented here use the ABBYY Fine Readersoftware. Other OCR software we tested that gives almostidentical performance includes TOcr and Readiris Pro 8.

8 Summary

This paper used the AdaBoost algorithm to learn a strongclassifier for detecting text in unconstrained city scenes.The key element was the choice of the feature set which

7

was selected to have features with low entropy of the posi-tive training examples (so that they gave similar responsesfor any text input). In addition, we used log-likelihood ratiotests on the joint probability distributions of pairs of fea-tures. The resulting system is small, compared to that usedby Viola and Jones for face detection [19], and required on-ly 92 filters and 4 layers of cascade.

The resulting strong classifier was very effective on ourdataset taken by blind users. This database, with groundtruth, will be made available to other researchers. The de-tection rates resulted in only a small number (2-5) false pos-itive rates in images of size 2,048 x 1,536.

To demonstrate the effectiveness of our approach, weused it as a front end to a system which included an exten-sion and binarization algorithm followed by a commercialOCR system. The resulting performance was highly effec-tive.

The algorithm currently runs at 9 seconds on a 2,048 x1,536 image (which compares favourably to speeds for Ad-aBoost classifiers for faces when the size of the image is tak-en into account). We anticipate that multi-scale processingwill enable us to significantly reduce the algorithm speed.

Our future work involves developing alternative tex-t reading software. Although current OCR algorithms are,as we have shown, very effective they remain a black boxand we cannot modify them or improve them. Instead wewill continue developing our reading algorithms based ondeformable templates [6],[1]. These algorithms have theadditional advantage that they use generative models [18]and can be applied directly to the image intensity withoutrequiring binarization.

References

[1] S. Belongie, J. Malik, and J. Puzicha. ”Matching shapes”, InProc. of the IEEE Intl. Conf. on Computer Vision, 2001.

[2] H. Drucker, R. Schapire, and P. Simard. ”Boosting Perfor-mance in Neural Networks”.International Journal of PatternRecognition and Artificial Intelligence. Vol. 7, no. 4, pp 705-719. 1993.

[3] F. Fleuret, and D. Geman. ”Coarse-to-Fine Face Detection.”International Journal of Computer Vision. 2000.

[4] Y. Freund and R. Schapire, ”Experiments with a new boost-ing algorithm”, Proc. of the Thirteeth Int. Conf. on MachineLearning, 148–156 (1996).

[5] J. Friedman, T. Hastie and R. Tibshirani. ”Additive logisticregression: a statistical view of boosting”, Dept. of Statistics,Stanford University Technical Report. 1998.

[6] S. Gold, A. Rangarajan, C.-P. Lu, S. Pappu, and E. Mjolsness.New algorithms for 2D and 3D point matching: pose estima-tion and correspondence. Pattern Recognition, 31(8), 1998.

[7] M. Revow, G.K.I. Williams and G.E. Hinton, ”Using genera-tive models for handwritten digit recognition”,IEEE Trans .PAMI, 18, pp. 592–606, 1996.

[8] A.K. Jain and B. Tu. ”Automatic Text Localization in Imagesand Video Frames”.Pattern Recognition. 31(12), pp 2055-2076. 1998.

[9] Huiping Li, David Doermann and Omid Kia. ”Automatic TextDetection and Tracking in Digital Video”.IEEE Transactionson Image Processing, 9(1):147–156, 2000.

[10] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong and R.Young. ”ICDAR 2003 Robust Reading Competitions”, In7thInternational Conference on Document Analysis and Recog-nition - ICDAR2003, 2003.

[11] J. Malik, S. Belongie, T. Leung and J. Shi, ”Contour andTexture Analysis for Image Segmentation.” IJCV 43 (1):7-27,June 2001.

[12] W. Niblack.An Introduction to Digital Image Processing.pp. 115-116, Prentice Hall, 1986.

[13] T. Pavlidis. Structural pattern Recognition. Springer-Verlag, Berlin-Heidlesberg-New York. 1977.

[14] M. Turk and A. Pentland. ”Eigenfaces for recognition”.Jour-nal of Cognitive Neuro Science, vol. 3, pp. 71-86, 1991.

[15] H. Rowley, S. Baluja, and T. Kanade. ”Neural network-basedface detection”. InIEEE Patt. Anal. Mach. Intell.. Vol. 20, pp22-38. 1998.

[16] T. Sato, T. Kanade, E. Hughes, and M. Smith. ”Video OCRfor Digital News Archives”. IEEE International Workshop onContent Based Access of Image and Video Databases. 1998.

[17] H. Schniederman and T. Kanade. ”A Statistical method for3D object detection applied to faces and cars”. InComputerVision and Pattern Recognition. 2000.

[18] Z.W. Tu and S.C. Zhu, “Image segmentation by Data DrivenMarkov chain Monte Carlo”,IEEE Trans. PAMI, vol 24, no5. May, 2002.

[19] P. Viola and M. Jones. ”Fast and Robust Classification usingAsymmetric AdaBoost and a Detector Cascade”. InProceed-ings NIPS01. 2001.

[20] P. Viola, M. Jones and D. Snow. ”Detecting Pedestrians usingPatterns of Motion and Appearance”. InInternational Confer-ence on Computer Vision. 2003.

[21] M. Weber, W. Einhuser, M. Welling, P. Perona ”Viewpoint-Invariant Learning and Detection of Human Heads”. InProc.4th IEEE Int. Conf. Automatic Face and gesture Recognition..2000.

[22] C. Wolf and J-M. Jolion, ”Extraction and Recognition ofArtificial Text in Multimedia Documents”, http://rfv.insa-lyon.fr/ wolf/papers/tr-rfv-2002-01.pdf.

8

[23] V. Wu, R. Manmatha, and E. M. Riseman. ”Finding text inimages”, InProc. of the 2nd ACM Conf. on References DigitalLibraries, pages 3C12, 1997.

[24] Ming-Hsuan Yang, N. Ahuja, D. Kriegman. ”Face DetectionUsing Mixtures of Linear Subspaces”. InProc. 4th IEEE Int.Conf. Automatic Face and gesture Recognition.. 2000.

[25] S. D: Yanowitz and A. M. Bruckstein.” A New Method forImage Segmentation”.Computer Vision, Graphics, and Im-age Processing CVGIP, Vol.46, no. 1, pp. 82-95, 1989.

9