lncs 8692 - robust scene text detection with convolution ... · robust scene text detection with...

15
Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees Weilin Huang 1,2 , Yu Qiao 1 , and Xiaoou Tang 2,1 1 Shenzhen Key Lab of Comp. Vis and Pat. Rec., Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 Department of Information Engineering, The Chinese University of Hong Kong, China Abstract. Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel opera- tion inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background compo- nents. In this paper, we propose a novel framework to tackle this problem by leveraging the high capability of convolutional neural network (CNN). In contrast to recent methods using a set of low-level heuristic features, the CNN network is capable of learning high-level features to robustly identify text components from text-like outliers (e.g. bikes, windows, or leaves). Our approach takes advantages of both MSERs and sliding- window based methods. The MSERs operator dramatically reduces the number of windows scanned and enhances detection of the low-quality texts. While the sliding-window with CNN is applied to correctly sepa- rate the connections of multiple characters in components. The proposed system achieved strong robustness against a number of extreme text vari- ations and serious real-world problems. It was evaluated on the ICDAR 2011 benchmark dataset, and achieved over 78% in F-measure, which is significantly higher than previous methods. Keywords: Maximally Stable Extremal Regions (MSERs), convolu- tional neural network (CNN), text-like outliers, sliding-window. 1 Introduction With the rapid evolvement and popularization of high-performance mobile and wearable devices in recent years, scene text detection and localization have gained increasing attention due to its wide variety of potential applications. Although recent progresses in computer vision and machine learning have sub- stantially improved its performance, scene text detection is still an open problem. The challenge comes from extreme diversity of text patterns and highly compli- cated background information. For example, texts appeared in a natural image can be in a very small size or in a low contrast against the background color, and even regular texts can be distorted significantly by strong lightings, occlusion, or blurring. Furthermore, a large amount of noise and text-like outliers, such as D. Fleet et al. (Eds.): ECCV 2014, Part IV, LNCS 8692, pp. 497–511, 2014. c Springer International Publishing Switzerland 2014

Upload: trancong

Post on 04-May-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Robust Scene Text Detection with Convolution

Neural Network Induced MSER Trees

Weilin Huang1,2, Yu Qiao1, and Xiaoou Tang2,1

1 Shenzhen Key Lab of Comp. Vis and Pat. Rec.,Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China

2 Department of Information Engineering,The Chinese University of Hong Kong, China

Abstract. Maximally Stable Extremal Regions (MSERs) have achievedgreat success in scene text detection. However, this low-level pixel opera-tion inherently limits its capability for handling complex text informationefficiently (e. g. connections between text or background components),leading to the difficulty in distinguishing texts from background compo-nents. In this paper, we propose a novel framework to tackle this problemby leveraging the high capability of convolutional neural network (CNN).In contrast to recent methods using a set of low-level heuristic features,the CNN network is capable of learning high-level features to robustlyidentify text components from text-like outliers (e.g. bikes, windows,or leaves). Our approach takes advantages of both MSERs and sliding-window based methods. The MSERs operator dramatically reduces thenumber of windows scanned and enhances detection of the low-qualitytexts. While the sliding-window with CNN is applied to correctly sepa-rate the connections of multiple characters in components. The proposedsystem achieved strong robustness against a number of extreme text vari-ations and serious real-world problems. It was evaluated on the ICDAR2011 benchmark dataset, and achieved over 78% in F-measure, which issignificantly higher than previous methods.

Keywords: Maximally Stable Extremal Regions (MSERs), convolu-tional neural network (CNN), text-like outliers, sliding-window.

1 Introduction

With the rapid evolvement and popularization of high-performance mobile andwearable devices in recent years, scene text detection and localization havegained increasing attention due to its wide variety of potential applications.Although recent progresses in computer vision and machine learning have sub-stantially improved its performance, scene text detection is still an open problem.The challenge comes from extreme diversity of text patterns and highly compli-cated background information. For example, texts appeared in a natural imagecan be in a very small size or in a low contrast against the background color, andeven regular texts can be distorted significantly by strong lightings, occlusion,or blurring. Furthermore, a large amount of noise and text-like outliers, such as

D. Fleet et al. (Eds.): ECCV 2014, Part IV, LNCS 8692, pp. 497–511, 2014.c© Springer International Publishing Switzerland 2014

498 W. Huang, Y. Qiao, and X. Tang

(a) (b) (c) (d)

Fig. 1. The text detection pipeline of our method. The input image is shown in (a).We first apply the MSERs operator on the input image to generate a number of textcomponent candidates (b). We then apply the CNN classifier to generate a componentconfidence map (c). The components with positive confident scores are applied forconstructing text-lines, which are scored by the mean values the components included.The final detection results are generated by a simple thresholding on (d).

windows, leaves, and bricks, can be included in the image background, and oftencause many false alarms in detection.

There are mainly two groups of methods for scene text detection in the lit-erature, sliding-window based and connected component based methods. Thesliding-window based methods detect text information by sliding a sub-windowin multiple scales through all locations of an image [11,3,9,28,29,18,1]. Text andnon-text information is then distinguished by a trained classifier, which oftenuses manually designed low-level features extracted from the window, such asSIFT and Histogram of Oriented Gradients (HoG) [6]. The main challenge liesin the design of local features to handle the large variance of texts, and compu-tational demand for scanning a large amount of windows, which may increaseto N2 for an image with N pixels. Hand crafted features like SIFT and HoGare effective to describe image content information, but these features are notoptimized for text detection.

The connected component based methods achieved the state-of-the-art per-formance in scene text detection. They first separate text and non-text pixelsby running a fast low-level filter and then group the text pixels with similarproperties (e. g. intensity, stroke width, or color) to construct component can-didates [23,24,22,34,7,31,10,32,2]. Stroke width transform (SWT) [7,31,10] andMaximally Stable Extremal Regions (MSERs) [16,23,24,22,34] are two represen-tative low-level filters applied for scene text detection with great success. Themain advantages of these methods are the computational efficiency by detect-ing text components in an one pass computation in complexity of O(N), andproviding effective pixel segmentations, which greatly facilitate the subsequentrecognition task.

It has been shown that MSERs based methods have high capability for de-tecting most text components in an images [24]. However, they also generate alarge number of non-text components at the same time, leading to high ambiguity

Robust Scene Text Detection with CNN Induced MSER Trees 499

between text and non-text in MSERs components. Robustly separating them hasbeen a key issue for improving the performance of MSERs based methods. Effortshave been devoted to handling this problem, but most of current methods forMSERs pruning focus on developing low-level features, such as heuristic charac-teristics or geometric properties, to filter out non-text components [23,24,22,34].These low-level features are not robust or discriminative enough to distinguishtrue texts from text-like outliers, which often have similar heuristic or geometricproperties with true texts. Besides, the MESRs methods are based on pixel leveloperations, and hence are highly sensitive to noise or single pixel corruption.This may lead to incorrect component connections, such as a single componentincludes multiple characters, which significantly affect the performance of text-line construction in the subsequent step.

In order to tackle these inherent problems, this paper aims to develop a robusttext detection system by embedding the high-capability deep learning methodinto the MSERs model, and taking the advantages of both MESRs and sliding-window methods. The main contributions of the paper are:

1. We apply deep convolutional neural network to learn high-level featuresfrom the MSREs components. This high-capability classifier correctly distin-guishes texts from a large amount of non-text components, and shows highdiscriminant ability and strong robustness against complicated backgroundcomponents (see Fig. 1 and 2), and therefore greatly improves capability ofthe MSERs based methods.

2. We incorporate the CNN classifier with sliding-window model and non-maximal suppression (NMS) to handle the multiple characters connectionproblem of the MSREs, and also recover missing characters, as shown in Fig.3. Our method provides better character candidates than previous MSERsmethods. This improvement is a crucial technique for bottom-up scheme toconstruct text-lines.

3. Our system have the advantages of both MSERs and sliding window meth-ods. Comparing to traditional sliding-window methods, our method not onlyreduces the number of search window, but also enhances the detection of lowcontrast texts by using MSREs, as shown in Fig. 2 (a) and (b).

4. Our method achieves state-of-the-art results on the most cited ICDAR 2011benchmark with over 78% in F-measure, which improves current results witha large margin.

The rest of paper is organized as follow. Section 2 describes all details of theproposed system. Experimental verifications are produced on Section 3, followedby the conclusions in Section 4.

2 Our System

The proposed text detection system includes three main steps, as shown in Fig. 1.Text components are first generated by applying the MSERs detector on theinput image. Then, each MSERs component is assigned a confident value by

500 W. Huang, Y. Qiao, and X. Tang

using a trained CNN classifier. Finally, the text components with high confidentscores are employed for constructing the final text-lines. Besides, we also proposea novel approach to enhance character separation by applying siding windowwith the CNN classifier for error-connected MSERs components. Details of thesystem are presented bellow.

2.1 MSERs Component Generation

MSERs define an extremal region as a connected component of an image whosepixels have intensity contrast against its boundary pixels [16,25]. The intensitycontrast is measured by increasing intensity values, and controls the region ar-eas. A low contrast value would generate a large number of low-level regions,which are separated by small intensity difference between pixels. When the con-trast value increases, a low-level region can be accumulated with current levelpixels or merged with other lower level regions to construct a higher level re-gion. Therefore, an extremal region tree can be constructed when it reaches thelargest contrast (e.g. 255 in a gray-scale image). An extremal region is defined asa maximally stable extremal region (MSER) if its variation is lower than bothits parent and child [16,25]. Therefore, a MSER can be considered as a specialextremal region whose size remains unchanged over a range of thresholds.

The MSERs has been one of the most widely-used region detectors. For textregion detection, it can be assumed that pixel intensity or color within a singletext letter is uniform, while the intensity contrast between text and backgroundregions typically exists. Each individual character can be detected as a extremalregion or a MSER. Two promising properties make the MSER detector effectivein text detection. First, the MSERs detector is computationally fast and can becomputed in linear time of the number of pixels in a image [25]. Second, it is apowerful detector with high capability for detecting low quality texts, such aslow contrast, low resolution and blurring. With this capability, MSER is ableto detect most scene texts in a natural images, leading to high recall on thedetection. However, the capability of the MSER is penalized by the increasingnumber of false detections. It would substantially increase the difficulty to iden-tity true texts from a large number of non-text false alarms, which is one of themain challenge for current MSERs based methods. Previous work often balancethe two factors by using a MSERs threshold, which can be changed from 1 to255 for a gray-scale image.

For text detection system, our goal is to detect as many text components aspossible in this step. Because it is difficult or impossible to recover the missedcharacters in the subsequent processes. The MSERs threshold is set to its lowestvalue 1 which makes it possible to capture most challenging cases, as shown inthe top row of Fig. 2. As shown, although they are a number of error detections,the true text characters in highly difficult cases are also correctly detected. Thismakes it possible to construct a robust system to correctly detect those chal-lenge texts and result in a high recall. But at the same time, it needs a powerfulclassifier to identify those low-quality texts from a large number of non-text

Robust Scene Text Detection with CNN Induced MSER Trees 501

components. A high capability classifier based on deep convolutional neuralnetwork is present bellow.

2.2 Deep Convolutional Neural Network

Deep learning has been applied to a number of challenging tasks in computer vi-sion with breakthrough improvements achieved in last several years. It has beenshown that deep network is capable of leaning meaningful high-level features andsemantic representations for visual recognition through a hierarchal architecturewith multiple-layers feature convolutions. The deep structure of the CNN allowsit to refine feature representation and abstract sematic meaning gradually. Thetraditional CNN network has achieved great success on digit and hand-writtencharacter recognition [12,13]. Scene text detection in natural image is a high-level visual task, which is difficult to be solved completely by a set of low-leveloperations or manually designed features. In contrast to previous works, whichoften use a set of heuristic features to distinguish text and non-text compo-nents [23,24,22,34,31,10], we take the advantages of deep learning and adapt

Fig. 2. The performance of the MSERs detector and the CNN classifier for low contrast(a) and low quality texts (b), and text-like outliers (c). The top row are the MESRsdetections on text areas. The middle row are the confident maps generated by the CNNclassifier. The pixels are displayed by their higher confident scores if they belong tomultiple components. The bottom row are the detection results.

502 W. Huang, Y. Qiao, and X. Tang

a deep convolutional neural network to robustly classify the generated MESRscomponents.

The structure of the CNN text classifier is similar to that applied in [29,4].The network includes two convolutional layers, each of which has a convolutionand an average pooling steps. The number of filters for the two layers are 64and 96, respectively. The input patch is with fixed size of 32 × 32. Similar to[29,4,5,19], the first layer is trained in an unsupervised way by using a variantof K-means described in [4] to learn a set of filters D ∈ R

k×n1 from a set of8 × 8 patches randomly extracted from training patches, k is the dimension ofthe convolution kernel, and is 64 for the kernel size with 8 × 8, n1 = 64 is thenumber of the filters in the first layer in our system. The responses (r) of thefirst layer is computed as [29],

r = max{0, |DTx| − θ} (1)

where x ∈ R64 is an input vector for an 8× 8 convolutional kernel, and θ = 0.5.

The resulted first layer response maps are with size of 25×25×64. Then averagepooling with window size of 5× 5 is applied to the response maps to get reducedmaps with the size of 5× 5× 64. The second layer is stacked upon the first layer.The sub-window patch for computing response and average pooling is 4× 4 and2 × 2, respectively. The final output of the two layers is a 96 dimension featurevector, which is input to a SVM to generate the final confident score of theMSERs component. The parameters in the second layer are fully connected andare trained by back-propagating the SVM classification error.

Given a MESR component, we applied the trained CNN classifier to decidewhether it is a text component by assigning a confident score to it. In our exper-iments, we discarded the MSERs components which include very small numbersof pixels (e.g. less than 0.01% of the total pixel number in an image), and keepall other components as input to the CNN classifier. For each retained MSERcomponent, we computed the aspect ratio of its boundary box. If the width ofthe box is larger than its height, we directly resized the image component intothe size of 32× 32; otherwise, we extracted a square patch with the same centerof the boundary box and with the side length equal to the height of box, andthen resized it into 32×32. This alignment scheme makes the input patches con-sistent with the synthetic training samples used in our experiments, which wereoriginally generated by Wang et. al. [28,29]. Two examples for both cases areshown in Fig. 3. The final confident maps for three challenge images are shownin the middle row of the Fig. 2. As shown, our CNN classifier generally assignshigher scores to text components, even for those MSERs with very low qualityof the text characters (see Fig. 2 (a) and (b)), and at the same time, classifythe text-like outliers (such as the masks and bricks in Fig. 2 (b) and (c)) aslow confident scores, demonstrating strong robustness and highly discriminativecapability for filtering the non-text components.

The performance of the SWT methods highly depend on the edge detector,which is often not feasible in many challenge cases. Compared to the SWT basedmethods, MSER operator is capable of detecting more true text components,

Robust Scene Text Detection with CNN Induced MSER Trees 503

Fig. 3.MESRs component alignment with size of 32×32 and synthetic training samples

often leading to a higher recall. But at the same time, the MESRs methodsalso generate a larger number of non-text components. It means that the highperformance of the MSERs methods heavily depend on a powerful componentclassifier. Thus we designed the CNN network by leveraging its high learningcapability to improve the performance of the MSERs methods. Besides, in oursystem, the MESRs method and the CNN classifier are strongly compensated toeach other. Comparing to general siding-window methods, the MSER operatorprovides two promising properties. It not only reduces the number of searchingwindows dramatically by two orders of magnitude, but also provides a significantenhancement on low quality texts, which are difficult to be detected correctlyby a general sliding-window method. In our experiment, the average number ofthe MSERs components in an image input to the CNN classifier is 516 in theICDAR 2011 database.

2.3 Component Splitting with MSERs Tree

As pointed out in the literature, most connected component based methods sufferfrom inherent limitations of low-level filters, which easily cause error connectionsbetween multiple characters or with background components in some difficultcases, such as low quality or seriously affected texts [24,2,10]. In order to tacklethis problem, we proposed a high-level approach by incorporating CNN scoresand MSERs tree structure with a sliding window model.

We define an error-connected component as a MSER component includingmultiple text characters. As mentioned, implementation of the sliding-windowmodel is computationally expensive. We show that only a small number of theMSERs components are considered as the error-connected components by select-ing them using the output CNN scores and the structure of MSERs tree. It canbe observed that an error-connected component generally has three remarkablecharacteristics.

504 W. Huang, Y. Qiao, and X. Tang

– First, it often has a high aspect ratio where the width of the boundary boxis much longer than its height (e.g. the example in the top row of Fig. 3).

– Second, differing from other non-text components, such as long horizontallines or bars which are generally scored with negative confident values byour CNN classifier, the error-connected component actually includes sometext information (multiple characters), but it is not very strong, because ourCNN classifier is trained on single-character components.

– Third, although the components in high-level of the MSERs trees often in-clude multiple text characters, such as the components in the roots of thetree. Most of these components can be further separated correctly by theirchildren components, which often have higher confident scores than theirparents. Thus, we do not consider these components as the error-connectedcomponents.

Therefore, the error-connected components are defined as the components havinghigh aspect ratios (e.g. width/height > 2 in our experiments) and positive CNNscores, (1) but cannot be further separated in their MSERs trees; or (2) allcomponents in their children sets do not have a higher CNN score than them.The first situation includes the texts having multiple characters truly connected,which cannot be separated by most low-level filters. The components in thesecond situation often include error separations (resulted in the low or negativeconfident scores for all their children components), which are caused by somechallenging cases.

To present the proposed splitting scheme, we selected an error-connectedMSER component sample, as shown in Fig 4. The component has high aspectratio and positive CNN scores, which is higher than the scores of its children.We applied a sliding window with our CNN classifier to scan through the com-ponent, which returns an one dimension continuous confident scores. Finally,non-maximal suppression (NMS) method [20] was applied to the continuousscores to estimate the locations for each single characters. The details of thecomponent splitting method are described in Algorithm 1.

As shown in Fig. 4, the proposed high-level component splitting method ef-fectively handles the component connection problem of the MSERs methods,and often generates better component candidates with high confident valuesfor subsequent text-line construction and recognition. Note that, by integrat-ing MSERs tree structure and CNN confident map for carefully choosing theerror-connected components, only a small number of components are selectedfor splitting. While each component is scanned just once by a sliding windowwith a single scale (32× 32) and the size of component is often small. Therefore,the increase of the computational cost for the proposed splitting method is rela-tively trivial. With powerful CNN classifier and efficient splitting scheme, a largenumber of non-text components have been identified correctly and only a smallnumber of text components (with positive confident scores in our experiments)are retained to construct the final text-lines.

Robust Scene Text Detection with CNN Induced MSER Trees 505

Fig. 4. Error-connected component slitting by sliding window model with the CNNclassifier

Algorithm 1. Error-connected Component Splitting

Require: Selected error-connected components, MSERs Tree Structure and CNN con-fident map.

Ensure: Revised MSERs Tree and CNN confident map.1: Given N error-connected components2: for k = 1 → N do3: Get the confident score of the current component, Wk

4: Normalize the size of component into 32×X5: Use sliding window (32× 32) with CNN to compute confident scores Sk

6: Apply NMS [20] for estimating the peak values (Pk) in Sk as,7:

Pk(x) =

{Sk(x) if Sk(x) � Sk(x+Δx),Δx < Θ

0 otherwise(2)

8: Generate new components at location x, where Pk(x) > 0, {P 1k , P

2k , . . . , P

mk }

9: if max(P 1k , P

2k , . . . , P

sk ) > Wk then

10: Replace children set of current component with new generated ones11: Update confident map with new scores and locations, as shown in Fig 412: end if13: end for14: return Revised MSERs tree and new CNN confident map

The text-line construction is now simple and straightforward. Similar to pre-vious work in [7,10], we first grouped two neighboring components into a pair ifthey have similar geometric and heuristic properties, such as similar intensities,colors, heights, and aspect ratios. Then, the pairs including a same component

506 W. Huang, Y. Qiao, and X. Tang

and having similar orientations were merged sequentially to construct the finaltext-lines. The process is ended when no pairs can be merged further. Finally,text-lines were broken into separate words by computing the horizontal distancesbetween consecutive characters. This is different from Yin et al.’s method [34],which is difficult to sperate text and non-text MSERs components discrimina-tively by using heuristic features, and a large number of non-text componentsare retained to construct the text-lines, leading to a large number of false alarmsincluded in the resulted text-lines (e.g. as indicated in [34], only 9% of the fi-nal text-lines are true texts.). It often requires a further computationally costlypost-processing to filter out the false alarms by using sophisticated machinelearning algorithms [34]. In contrast, our system discards the false alarms bysimply thresholding the average confident scores of the text-lines.

3 Experiments and Results

We evaluated the proposed method on two widely cited benchmarks for scenetext detection, the ICDAR 2005 [15,14] and the ICDAR 2011 [26] Robust Read-ing Competition databases.

3.1 Datasets and Evaluation Method

The ICDAR 2005 dataset includes 509 color images having sizes varied from307 × 93 to 1280 × 960. 258 images are included in the training set, while 251images are used for test. There are 229 training images and 255 testing ones forthe ICDAR 2011 dataset. The detection performance were evaluated in the wordlevel in both datasets, which include totally 1114 and 1189 words in their testsets, respectively.

For evaluation, we followed the ICDAR 2011 competition evaluation proto-col, which was proposed by Wolf et al. [30]. This evaluation method presentsobject level precision and recall based on constraints on detection quality. Itevaluates both quantity and quality of rectangle matches through all images inthe database, and considers not only one-to-one matching, but also one-to-manyand many-to-one matchings. The quality of detection or matching is controlledby two parameters which penalizes more on parts matching than larger detec-tion. Specifically, the evaluation is computed by Precision, Recall, and F-measurewhich are defined bellow,

Precision =

∑Ni

∑|Di|j MD(Di

j , Gi)

∑Ni |Di|

(3)

Recall =

∑Ni

∑|Gi|j MG(G

ij , D

i)∑N

i |Gi|(4)

Fmeasure = 2× Precision×Recall

Precision+Recall(5)

Robust Scene Text Detection with CNN Induced MSER Trees 507

where N is the total number of images in a dataset. |Di| and |Gi| are the num-ber of detection and ground true rectangles in the i-th image. MD(D

ij , G

i) and

MG(Gij , D

i) are the matching scores for detection rectangle Dj and ground truerectangle Gj . Their values are set to 1 for one-to-one matching, 0.8 for one-to-many matching and 0 for no matching. Two rectangles are considered as matchedwhen their overlapping ratio is higher than a defined threshold, which controlsthe quality of the matching.

3.2 Experiments and Results

The CNN classifier was trained by using 15000 toy samples generated by Wanget al. [29]. There are 5000 positive and 10000 negative samples in the trainingdataset, and all samples are resized into 32 × 32. Some examples are shown inFig. 3. The training data on the two datasets were not applied for training in ourexperiments, which shows strong generalization power of the proposed system. Inour experiments, the MSERs operator was run twice on each image, correspond-ing to both black-to-white and white-to-black texts. Each MSER component wasclassified by the trained CNN classifier and only the component with positiveconfident score was used for text-line construction. A text-line with an averagecomponent score lower than 1.2 was considered as a fail detection. The finaldetected boundary boxes for each image are the non-overlap combination of theboxes from both sides. The full evaluation results on the two databases are pre-sented in Table 1 and 2, along with the detection results on several challengingimages displayed in Fig. 5.

The proposed method achieved excellent performance on both databases andthe improvements are significant in terms of Precision, Recall, and Fmeasure.In the most recent ICDAR 2011 dataset, our method improved the most closedperformance with 2 ∼ 3% in all three terms and reached the Fmeasure score over78%. Note that the evaluation scheme of the SFT-TCD method [10] did notfollow the standard protocol of the ICDAR 2011. It was evaluated based on eachsingle image and the mean values of all images in the dataset were reported.The improvements by our method mainly gain from two facts. On the one hand,the powerful MSERs detector is able to detect most true texts, which resultedin a high Recall. One the other hand, high capability of the CNN classifier withhigh-level splitting scheme robustly identify true text components from non-textones, leading to a large improvement on Precision.

Fig. 5 shows the successful detection results on a number of challenging cases,which indicate that our system is highly robust to large variations in texts in-cluding small font size, low contrast, and blurring. The images also show thatour system is robust against strong lighting and highly noise background effects.Fig. 6 shows two failure cases in our experiments. For the left one, our methodmissed a number of true text-lines. It is mainly caused by the strong masks cov-ering the texts, which significantly break the low-level structure of texts. Thetext components in the right image do not include strong text information andare easily confused with its background.

508 W. Huang, Y. Qiao, and X. Tang

Fig. 5. Successful text detection results with extreme variances and significant affects

Fig. 6. Failure cases

Robust Scene Text Detection with CNN Induced MSER Trees 509

Table 1. Experimental results on the ICDAR 2005 dataset

Method Year Precision Recall F −measure

Our method – 0.84 0.67 0.75

SFT-TCD [10] 2013 0.81 0.74 0.72

Yao et al. [31] 2012 0.69 0.66 0.67

Chen et al. [2] 2012 0.73 0.60 0.66

Epshtein et al. [7] 2010 0.73 0.60 0.66

Yi and Tian [33] 2013 0.71 0.62 0.63

Neumann and Matas [23] 2011 0.65 0.64 0.63

Zhang and Kasturi [35] 2010 0.73 0.62 –

Yi and Tian [32] 2011 0.71 0.62 0.62

Becker et al. [14] 2005 0.62 0.67 0.62

Minetto et al. [17] 2010 0.63 0.61 0.61

Chen and Yuille [3] 2004 0.60 0.60 0.58

Table 2. Experimental results on the ICDAR 2011 dataset

Method Year Precision Recall F −measure

Our method – 0.88 0.71 0.78

Yin et al. [34] 2014 0.86 0.68 0.76

Neumann and Matas [21] 2013 0.85 0.68 0.75

SFT-TCD [10] 2013 0.82 0.75 0.73

Neumann and Matas [22] 2013 0.79 0.66 0.72

Shi et al. [27] 2013 0.83 0.63 0.72

Neumann and Matas [24] 2012 0.73 0.65 0.69

Gonzalez et al. [8] 2012 0.73 0.56 0.63

Yi and Tian [32] 2011 0.67 0.58 0.62

Neumann and Matas [23] 2011 0.69 0.53 0.60

4 Conclusions

We have presented a robust system for scene text detection and localizationin natural images. Our main contribution lies in effectively leveraging the highcapacity of the deep learning model to tackle two main problems of currentMSERs methods for text detection, and enable our system with strong ro-bustness and highly discriminative capability to distinguish texts from a largeamount of non-text components. A sliding window model was intergradedwith the CNN classifier to further improve text character detection on chal-lenging images. Our method has achieved the state-of-the-art performance ontwo benchmark datasets, convincingly verifying the efficiency of the proposedmethod.

510 W. Huang, Y. Qiao, and X. Tang

Acknowledgments. This work is supported by National Natu-ral Science Foundation of China (913201 01), Shenzhen Basic Re-search Program (JCYJ20120903092050890, JCYJ2012061 7114614438,JCYJ20130402113127496), 100 Talents Programme of Chinese Acad emyof Sciences, and Guangdong Innovative Research Team Program (No. 201001D0104648280). Yu Qiao is the corresponding author.

References

1. Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: Photoocr: reading text in un-contralled conditions (2013), ICCV

2. Chen, H., Tsai, S., Schronth, G., Chen, D., Grzeszczuk, R., Girod, B.: Robust textdetection in natural images with edge-enhanced maximally stable extremal regions.In: ICIP (2012)

3. Chen, X., Yuille, A.: Detecting and reading text in natural scenes. In: CVPR (2004)4. Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D.J.,

Ng, A.Y.: Text detection and character recognition in scene images with unsuper-vised feature learning. In: ICDAR (2011)

5. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervisedfeature learning. In: AISTATS (2011)

6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:CVPR (2005)

7. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with strokewidth transform. In: CVPR (2010)

8. Gonzalez, A., Bergasa, L., Yebes, J., Bronte, S.: Text location in complex images.In: ICPR (2012)

9. Hanif, S., Prevost, L.: Text detection and localization in complex scene imagesusing constrained adaboost algorithm. In: ICDAR (2009)

10. Huang, W., Lin, Z., Yang, J., Wang, J.: Text localization in natural images usingstroke feature transform and text covariance descriptors. In: ICCV (2013)

11. Kim, K., Jung, K., Kim, J.: Texture-based approach for text detection in imagesusing support vector machines and continuously adaptive mean shift algorithm.IEEE Trans. Pattern Analysis and Machine Intelligence 25, 1631–1639 (2003)

12. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W.: Hand-written digit recognition with a back-propagation network. In: NIPS (1989)

13. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE 86, 2278–2324 (1998)

14. Lucas, S.: Icdar 2005 text locating competition results. In: ICDAR (2005)15. Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: Icdar 2003 robust

reading competitions. In: ICDAR (2003)16. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from

maximally stable extremal region. In: BMVC (2002)17. Minetto, R., Thome, N., Cord, M., Fabrizio, J., Marcotegui, B.: Snoopertext: A

multiresolution system for text detection in complex visual scenes. In: ICIP (2010)18. Mishra, A., Alahari, K., Jawahar, C.V.: Top-down and bottom-up cues for scene

text recognition. In: CVPR (2012)19. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits

in natural images with unsupervised feature learning20. Neubeck, A., Gool, L.: Efficient non-maximum suppression. In: ICPR (2006)

Robust Scene Text Detection with CNN Induced MSER Trees 511

21. Neumann, L., Matas, J.: On combining multiple segmentations in scene text recog-nition. In: ICDAR (2013)

22. Neumann, L., Matas, J.: Scene text localization and recognition with orientedstroke detection. In: ICCV (2013)

23. Neumann, L., Matas, K.: Text localization in real-world images using eficientlypruned exhaustive search. In: ICDAR (2011)

24. Neumann, L., Matas, K.: Real-time scene text localization and recognition. In:CVPR (2012)

25. Nister, D., Stewenius, H.: Linear time maximally stable extremal regions. In:Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303,pp. 183–196. Springer, Heidelberg (2008)

26. Shahab, A., Shafait, F., Dengel, A.: Icdar 2011 robust reading competition chal-lenge 2: Reading text in scene images. In: ICDAR (2011)

27. Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S.: Scene text detection using graphmodel built upon maximally stable extremal regions. Pattern Recognition 34,107–116 (2013)

28. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV(2011)

29. Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with con-volutional neural network. In: ICPR (2012)

30. Wolf, C., Jolion, J.-M.: Object count/area graphs for the evaluation of object de-tection and segmentation algorithms. International Journal on Document Analysisand Recognition 8, 280–296 (2006)

31. Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientationsin natural images. In: CVPR (2012)

32. Yi, C., Tian, Y.: Text string detection from natural scenes by structure-basedpartition and grouping. IEEE Trans. Image Processing 20, 2594–2605 (2011)

33. Yi, C., Tian, Y.: Text extraction from scene images by character appearanceand structure modeling. Computer Vision and Image Understanding 117, 182–194(2013)

34. Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural sceneimages. IEEE Trans. Pattern Analysis and Machine Intelligence (to appear)

35. Zhang, J., Kasturi, R.: Character energy and link energybased text extraction inscene images. In: ACCV (2010)