a comprehensive method for arabic video text … comprehensive method for arabic video text...

A Comprehensive Method for Arabic Video Text Detection, Localization, Extraction and Recognition

M. Ben Halima, H. Karray, and A. M. Alimi

REGIM: REsearch Group on Intelligent Machines, University of Sfax, National School of

Engineers (ENIS), BP 1173, Sfax, 3038, Tunisia { mohamed.benhlima, hichem.karray, adel.alimi}@ieee.org

Abstract. With the rapid growth of the number of TV channels, the internet and online information services, more and more information becomes available and accessible. The digitization enhances preservation of records and makes the access to documents easier. However, when the quantity of documents become important the digitalization is not enough to ensure an efficient access. Indeed, we need to extract semantic information to help users to find what we need quickly. The text included in video sequences is highly needed for indexing and searching system. However, this text is difficult to detect and recognize because of the variability of its size, low resolution characters and the complexity of the backgrounds. To resolve these shortcomings, we propose a two task system: As a first step, we extract the textual information from video sequences and second, we recognize this text. Our system is tested on a diverse database composed of several Arabic news broadcast. The obtained results are encouraging and prove the qualities of our approach.

Keywords: Arabic VideoText, Segmentation, Extraction, Recognition.

1 Introduction

News programs are important because of their utility to provide TV spectators with an idea about what’s happening around the world. Besides, we find more and more news video archives exposed on the internet (www.INA.fr). For this reason news text summarization process has become a crucial research area. In this paper, we propose to segment news into stories, then extract from every story a text to prepare the recognition process.

Text recognition, even when applied to lines potentially containing text, remains a difficult problem because of the variety of fonts, colors and the presence of complex background. Recognition is addressed by a segmentation step followed by a step of optical character recognition in a framework of multiple hypotheses. News programs are audio-visual documents they are generally fashioned by a set of semantically independent stories. For this reason, before extracting textual information from the news programs, the segmentation into stories must be done as a first step. The following figure shows an example of image taken from video broadcasting of Aljazeera, Tunis 7, France 24 and Alarabia.

G. Qiu et al. (Eds.): PCM 2010, Part II, LNCS 6298, pp. 6 –6 , 2010.c© Springer-Verlag Berlin Heidelberg 2010

48 59

Fig. 1. Examples of text in video frames

Recognition of text in video is more difficult than many other OCR applications (e.g., reading printed matter) because of degradation, such as background noise, and deformation, like the variation in fonts. Several methods have been proposed for text recognition in videos. However, most of them are not addressed to the problem of the Arabic script.

In this paper, we are going to present a new system for text extraction and recognition from Arabic news videos. Our system integrates text detection, tracking on multiple frames localization and text recognition.

Fig. 2. Global Overview of our System

The rest of the paper will be organized as follows; in section 2, we are going to discuss works related to news segmentation. In section 3, we will present how we detect and localize the textual information in video sequence. In section 4, we will present how we extract the text. In section 5, we will present how we recognize it. Finally we conclude with directions for future work.

2 Video Segmentation

The segmentation of stories is an essential step in any work done on video news sequences. The story in every news program is the most important semantic unit. Story segmentation is an active research field with a lot of categories work. However, the majority of proposed works are based on detecting anchor shots. Indeed, the anchor shot is the only repetitive shot in a news program.

In these step we use the proposed tools of W. Mahdi and al. [1]. According to this work an input video is first divided into. The shot clustering is then performed within

A Comprehensive Method 649

sequences. So, scene transition can be detected through classification of the temporal relations between the shot clusters. These relations are generated through a temporal-clusters graph built for the cluster chain. To reduce the possibility of these undesirable circumstances, an additional constraint is imposed on clustering: the shots which belong to different sequences are considered to be dissimilar and are never merged at the same cluster.

Fig. 3. Structure of a news program

3 Text Detection and Extraction

After video segmentation into stories, we detect and extract text transcriptions from every story. Works on text detection and extraction may be generally grouped into four categories: connected component methods [2][3][4], texture classification methods [5][6], edge detection methods [7][8][9][10][11], and correlation based methods [12][13].

3.1 Text Localization

Text detection is defined as the task that locates text strings in a complex background, without recognizing the individual characters. In computer vision applications, there are many advantages of text detection before performing segmentation and recognition. First, texts included in a video frame do not cover the majority of pixels. Second, precise text localization can provide information about the extent of the text, which is very useful to segment regions of a text from the bottom. In addition, the background inside a located text box is generally less complicated than the rest of the entire image. Thirdly, the text string characteristic can be exploited to search for a text. Since the texts strings have typical shapes and are aligned, this implies that their location is easier and more robust than the location of individual characters.

3.1.1 Text Box Localization The localization methods are therefore expected to be fast and, ideally, not classifying the text regions as non-text regions (background).

650 M. Ben Halima, H. Karray, and A.M. Alimi

A block of text refers to a box Image that contains one or more lines of text. Let R all sites (pixels) in an input image I . For any site r R∈ in the image I , we note ( )E r True= event indicating that the pixel r is found in a block of text. Text

blocks extraction can be addressed by estimating at each site r R∈ in the image I

the probability ( ( ) )p T r True I= , then we group the pixels with high probabilities

in regions. To have a fast algorithm, we exploit the fact that the blocks of text containing short

edges in the vertical and horizontal directions, and that these edges are connected to

each other because of the connections between letters. Let vC and hC refer to the

vertical and horizontal edges of the image, such that for all r R∈ we have:

( ) { if r is a point of vertical edgeotherwise

1 0 C rv = (1)

( ) { if r is a point of horizontal edgeotherwise

1 0 hC r = (2)

Then we connect the vertical edges in the horizontal direction while connecting horizontal contours in the vertical direction using an operator of mathematical morphology, namely the dilatation.

For this reason, two different structural elements vES and hES are used

depending on the type of contour (vertical or horizontal). Dilated images vID and

hID are defined as follows:

ID C ESv v v= ⊕ (3)

h h hID C ES= ⊕ (4)

We chose a size of 5 1× for vES and a size of 3 6× for hES for detecting text in

images that are normalized to a resolution of 320 240× . We estimate the probability

( ( ) )p T r True I= as follows:

( ) ( ) ( )( )p T r True I D r D rv h= = (5)

False alarms resulting from this procedure are often oblique stripes, corners, and groups of small reasons.


3.1.2 Lines Text Extraction To standardize the size of the text, we need to extract text lines from the blocks of candidate's text. This task can be performed by detecting the baselines above and below the text aligned horizontally. The detection of these lines also has two basic objectives. First, it eliminates false alarms. Second, it refines the localization of text in the candidate regions that contain text related to certain background items.

The detection of baselines begins by calculating the projection ( )H y on the Y

axis, where ( )H y denotes the number of text pixels projected onto the Y axis. The

following figure illustrates a block of text and its projection on the Y axis:

Fig. 4. Example of horizontal projection

3.2 Text Extraction

After localizing text in the frame, the next step is to segment and binarize that text . First, we calculate the grayscale image. Second, we create a vector composed of two features, for each pixel in the text box detected: the standard deviation and entropy of eight neighbors pixels calculated as follows:

1

22

1 1

1( ) ( ( , ) )

*

N N

i j

s td p f i j fN N = =

⎛ ⎞= −⎜ ⎟⎜ ⎟⎝ ⎠

∑ ∑

(6)

and,

1 1

( ) - ( ( , )* log( ( , )))N N

i j

ent p f i j f i j= =

= ∑∑ (7)

Where f (i, j) indicates the pixel gray level in (i, j) position. Third, we use the algorithm of fuzzy C-means to classify the pixels in class "text" and class "background". Finally, we binarize the text image obtained by marking the pixels black text.


(a) Localized text

(b) Segmented text frame

Fig. 5. Example of text extraction and binarization

4 Arabic Text Recognition

Recently, several methods have been proposed for the recognition of text in videos. However, most of them are not applied to the problem of the Arabic script. An optical character recognition system was implemented including five successive stages: Pre-processing, Segmentation, Feature extraction, Classification and Post-processing.

4.1 Structure of the Arabic Scripts

The alphabet of the Arabic language has 28 consonants including 15 from one to three points that differentiate between similar characters. The points and Hamza (ء) called secondary characters (complementary). They are located above the primary character as the "alif" (أ) below as the "Ba" (ب), or in the middle as the "jeem" (ج). There are four characters, which can take the secondary nature Hamzah (ء): alif (أإ), waw (ؤ) kaf .(ئ) and ya (ك)

The Arabic script is written and read from right to left. The letters change in shape according to their position (at the beginning, middle or end of the word). Arabic letter can be written from 2 to 4 different forms:

− When the letter is attached to any other: this is the position "isolated". − When it is attached only to a letter from the front (left) is the position "initial". − When it is attached only to a letter from behind (right): It is the position "final". − And when it is attached at both ends at once: this is the position "medium".

Text recognition, even from the detected text lines, remains a challenging problem due to the variety of fonts, colors, presence of complex backgrounds and the short length of the text strings. An optical character recognition methodology was implemented including five successive stages: Pre-processing, Segmentation, Feature extraction, Classification and Post-processing.


Table 1. Example of Different Forms of Arabic Alphabets

Letter isolated Initial Middle Final Other shapes

Alef ـى ى ـا ا Ba’ ـب ـبـ بـ ب Ta’ ة ـة ـت ـتـ تـ ت Tha’ ـث ـثـ ثـ ث H’a’ ـح ـحـ حـ ح Dal ـد د Ra’ ـر ر Seen ـس ـسـ سـ س Dhad ـض ـضـ ضـ ض Dha’ ـظ ـظـ ظـ ظ Ain ـع ـعـ عـ ع Fa’ ف ـ ـفـ فـ ف Kaf ـك ـكـ كـ ك Lam ـل ـلـ لـ ل Meem ـم ـمـ مـ م

4.2 Pre-processing

4.2.1 Text Enhancement We have chosen the method of Wolf, which uses the contents of all frames of an apparition of the same text to produce an enhanced image. This is done in a robust way, based on statistics calculated on the gray level of each pixel during the time of onset. To automate this procedure we assume that the text appears only once.

Fig. 6. Bilinear interpolation

We chose the bilinear interpolation, which calculates the gray level of a pixel as an average of gray levels of its neighbors. The weight of each neighbor is the distance to the pixel calculated:

( )( ) ( )2 2

1,p x y

x ix y iy=

− + −

(8)


4.2.2 Text Normalization

In different images, text may occur with various widths and heights. To have consistent features through all the text images and to reduce the variance of the size of the characters, the height of a text image is normalized. The height of the text image is scaled to a predefined size (26 pixels).

Fig. 7. Example of character normalization

4.3 Script Segmentation

The segmentation process extracts the basic scripts from the textline. The segmentation of a textline into its scripts is based on the topological relation between the textline, represented as a graph, and the baseline of the text.

Baseline plays an essential role on Arabic writing. Most characters connect to each other on the baseline. The baseline detection is done by detecting the peak in the horizontal density histogram of the text line.

After baseline detection and diacritic elimination, a segmentation stage is necessary. This segmentation is based on a vertical projection. The segmenter should avoid cases of under-segmentation and/or over-segmentation. Each segment will be recognized using a base of segments [14].

Fig. 8. Example of text segmentation

The segmentation of the Arabic word into individual characters is a crucial step in recognizing Arabic text extracting from news sequences. Most of the recognition errors occur from segmentation errors. To minimize error, the segmenter, has to check, if there is an intersection pixel between the baseline and the letter to before segmenting (T-junction). This condition gives the following decomposition of letters:

− The letter "ل" at the end of the word is considered as a whole segment. − The letter "س" will be segmented into three elements. − The letters " ق ب, ف, ث ," will be segmented into two elements, by the segmenter, if

they are at the end of the word. − The letters " ض, ص " will be broken into two segments. − Other letters will be segmented as follows: − The characters " ذ, د, ث, ب, ف, ث " will be detected by their sizes and positions

relative to the subword. Indeed, they are always at the end of the word or subword.


Such an error will be corrected by connecting the last segment, with its size, the previous segment.

− For the characters " ض, ص, ش, س ", segments have generated a strong resemblance to the characters " ي, ن, ث, ت, ب " at the beginning or middle of a word or subword. A segmentation fault will be corrected by testing diacritics above or below the segment. If the segment contains no diacritics, it is connected to the next segment. The problem here concerns the "سـ" character. However, since the last segment of the character has no diacritic, it will be connected directly to the character that precedes it. And since an Arab character and is the most fragmented into three segments, the maximum number of connections should not exceed two connections.

For the characters " ض, ص, ش, س ", the last segment obtained is similar to character" ن ". The difference lies in the above diacritic. A segmentation fault will be corrected by testing the diacritic mark. If this diacritic doesn’t appear above the segment (not the character "ن"), it will be connected to the previous segment.

4.4 Feature Extraction

The objective of feature extraction is to extract the essential characteristics of the letter, and it is generally considered as one of the most difficult problems of pattern recognition. The extracted features are divided into two parts. In the first part, we distinguish between similar letters and, in the second part, we distinguish between dissimilar letters. For feature identification, we use an approach similar to that proposed by J. Shanbehzadeh et al. [15].

In the first part we employ the information on letters’ dots to generate features. The number of components of each letter is considered as the first feature and the number of the dots as the second feature (ن ي ث ت ب). We present the dots located above the line by 1, otherwise by -1.

In the second part, features are extracted from the main body of the letter:

− Extraction of occlusion: They correspond to the internal contour of the primary plots characters.

− The projection Feature: In addition to the horizontal and vertical projection, we make a projection of the diagonal and the slanting diagonal.

− The transition Features: the number of transition from 0 to 1 of the row, column, diagonal and slanting diagonal.

4.5 Classification

Classification in an OCR system is the main decision making stage in which the extracted features of a test set are compared to those of the model set. Based on the features extracted from a pattern, classification attempts to identify the pattern as a member of a certain class. When classifying a pattern, classification often produces a set of hypothesized solutions instead of generating a unique solution.


Supervised classification methods can be used to identify a sample pattern. We use the k-nearest neighbour algorithm (k=10). The coefficient of belonging of a new

segment x i to class y is given by the formula:

( )1

,n

i i

i i i

x yd x y

x y=

−=

+∑ (9)

4.6 Post-processing

The last step is the post-processing. It improves the recognition by refining decisions previously prized in the classification stage and recognizes words by using context. It is ultimately responsible for issuing the best solution and is often implemented as a set of techniques, based on character frequencies, lexical items and other context information.

5 Experimental Results

To validate the text extraction system, we have used a varied database composed of news sequences from TV7 Tunisia (tunisiatv.com), Aljazeera (Aljazeera.net) and Alarabia (alarabiya.net) which are specialized channels presenting news continually. We have used the following measures of precision and recall to evaluate the segmentation method. It is very difficult to maintain that recall is more important than precision or vice versa. Usually, there are compromises between these two measures. For a better comparison, we calculated the score of the precision and recall.

Number of right detected text regions

Number of all ground_truth text regionsRecall =

(10)

Number of right detected text regions

Number of detected text regionsPrecision =

(11)

2 Precision RecallScore

Precision + Recall

× ×=

(12)

Table 2. Evaluation results of news story segmentation

Channel Duration Recall Precision Score

TV7 Tunisia 10 hours 92.47% 91.23% 91,85 Al Jazeera 10 hours 94.53% 92.67% 93,59 Al Arabiya 10 hours 94.51% 91.33% 92,89


We notice that Al Jazzeera and Al Arabia channels present the best rate of recall

and precision because they present the best quality of graphic text. In order to investigate the effectiveness of our recognition sub-system in

recognizing Arabic Text extracted from various news sequences, a series of tests were performed using KNN classifier with K=3, K=5 and K=10. Table 2 shows the recognition results.

Table 3. Evaluation results of text recognition

Channel K=3 K=5 K=10 TV7 Tunisia 80.58% 82.73% 86.14%

Al Jazeera TV 83.66% 88.10% 92.91% Al Arabia TV 83.72% 90.89% 96.51%

We note that the rate of recognition of texts extracted from Al-Arabiya and Al-Jazeera TVs are better than those extracted from Tunis 7 TV. This is explained by the fact that the text extract from Al-Arabiya and Al-Jazeera TVs is clearer and more readable. We find it is difficult to recognize small size characters that appear frequently in news videos.The best recognition rate is in all cases for K=10.

6 Conclusion and Perspectives

Used OCR techniques do not perform very well when the extracted characters have a too low resolution. The development of new OCR techniques to recognize low resolution characters is still necessary. Other aspect is computation reduction for mobile image text recognition applications. Most mobile devices, such as mobile phones, have less computation power and less memory resources than a desktop computer. In order to build a video text extraction application on these devices, the algorithms proposed in this paper need to be optimized or even modified to reduce the computation cost.

Acknowledgment

The authors would like to acknowledge the financial support of this work by grants from General Direction of Scientific Research (DGRST), Tunisia, under the ARUB program.

References

1. Mahdi, W., Chen, L., Fontaine, D.: Improving the Spatial-Temporal Clue BasedSegmentation by the Use of Rhythm. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL1998. LNCS, vol. 1513, pp. 169–181. Springer, Heidelberg (1998)


2. Jain, A.K., Yu, B.: Automatic text location in images and video frames. Patternrecognition 31(12), 2055–2076 (1998)

3. Lee, C.M., Kankanhalli, A.: Automatic extraction of characters in complexscene images. International Journal of Pattern Recognition and Artificial Intel-ligence 9(1), 67–82 (1995)

4. Lienhart, R., Stuber, F.: Automatic text recognition indigital videos. In: Proceed-ings of SPIE Image and Video Processing IV, vol. 2666, pp. 180–188 (1996)

5. Karray, H., Ellouze, M., Alimi, M.A.: Indexing Video Summaries for Quick VideoBrowsing. In: Computer Communications and Networks 2009, pp. 77–95 (2009)

6. Wu, V., Manmatha, R., Riseman, E.M.: TextFinder: anautomatic system to detectand recognize text in images. IEEE Trans. PAMI 21, 1224–1229 (1999)

7. Agnihotri, L., Dimitrova, N.: Text detection for video analysis. In: Workshop onContent Based Access to Image and Video libraries, Conjunction with CVPR,Colorado (1999)

8. Gao, X., Tang, X.: Automatic news video caption extraction and recognition. In:Proc. Of Intelligent Data Engineering and Automated Learning 2000, pp. 425–430(2000)

9. Garcia, C., Apostolidis, X.: Text detection and segmentation in complex colorimages. In: Proc. of IEEE International Conf. on Acoustics, Speech, and SignalProcessing, vol. 4, pp. 2326–2329 (2000)

10. Agnihotri., L., Dimitrova, N., Soletic, M.: Multi-layered Videotext ExtractionMethod. In: IEEE International Conference on Multimedia and Expo. (ICME),pp. 213–216 (2002)

11. Hua, S., Chen, X.-R., et al.: Automatic Location of Text in Video Frames. In: IntlWorkshop on Multimedia Information Retrieval (MIR 2001), pp. 24–27 (2001)

12. Karray, H., Alimi, A.M.: Detection and Extraction of the Text in a video sequence.In: Proc. IEEE 12 International Conference on Electronics, Circuits and Systems2005 ( ICECS 2005), vol. 2, pp. 474–478 (2005)

13. Kherallah, M., Karray, H., Ellouze, M., Alimi, A.M.: Toward an Interactive Devicefor Quick News Story Browsing. In: ICPR 2008, pp. 1–4 (2008)

14. Ben Halima, M., Karray, H., Alimi, A.M.: Arabic Text Recognition in Video Se-quences. In: The 2010 International Conference on Informatics, Cybernetics, andComputer Applications (ICICCA 2010) (July 2010)

15. Shanbehzadeh, J., Pezashki, H., Sarrafzadeh, A.: Features Extraction from FarsiHand Written Letters. In: Proceedings of Image and Vision Computing, NewZealand 2007, pp. 35–40 (2007)


a comprehensive method for arabic video text … comprehensive method for arabic video text...

Documents