[ieee 2009 second international conference in visualisation (viz) - barcelona, spain...

5
Segmentation of Printed Urdu Scripts Using Structural Features Hamna Malik Lahore College for Women University, Lahore, Pakistan Muhammad Abuzar Fahiem Lahore College for Women University, Lahore, Pakistan University of Engineering and Technology, Lahore, Pakistan [email protected] Abstract Character segmentation forms the basis for optical character recognition. In this paper, we have proposed a character segmentation approach for printed Urdu script. Urdu is cursive by nature and its script is written from right to left. Both these factors make the segmentation more difficult and require special attention. Our approach is based on structural features and we have overcome different problems like over segmentation and under segmentation, present in previous approaches. We have achieved an accuracy rate of 99.4% which is better than others. The approach may be very useful for developing an optical character recognition system for Urdu language. Keywords - Character segmentation, Cursive scripts, Ligature, Structural features, Urdu alphabets 1. Introduction Urdu is a cursive language with a total of 39 alphabets written from right to left. Each alphabet can have either of four shapes; isolated, initial, medial or final depending upon its position in a word as shown in Fig. 1. This shape is known as a glyph and these are connected with each other. Fig. 3. Different shapes of an Urdu alphabets ‘Ghaeen’ (a) is the initial shape (b) is the medial shape (c) is the final shape (d) is the isolated shape A ligature is a connected component of a pixel and there are tow types of ligatures; A primary ligature is any shape other than ‘dots’, ‘hamza’ and ‘tuppa’ while these are known as secondary ligatures (shown in Fig. 2). Fig. 2. Ligatures (a) Urdu word (b) Seven ligatures (c) Three primary ligatures (d) Four secondary ligatures. Optical character recognition (OCR) is an important research area and it needs more attention when a cursive language is to be recognized. Segmentation forms the basis for OCR and our research is focused on it. Previously various approaches have been proposed for Arabic text segmentation [1,2,3,4]. A very good survey is presented in [5] and gray scale segmentation is discussed in [6]. Some approaches use neural networks [7,8] while some use hidden Markov models [9]. In this paper, we have performed segmentation of printed Urdu text on the basis of structural features of the alphabets. Section 2 describes our approach while the discussion is concluded on a comparison of our approach with previous ones in section 3. 2. Our Approach Our approach is divided into two stages as described below. 2.1. Line segmentation 2009 Second International Conference in Visualisation 978-0-7695-3734-4/09 $25.00 © 2009 IEEE DOI 10.1109/VIZ.2009.12 191 2009 Second International Conference in Visualisation 978-0-7695-3734-4/09 $25.00 © 2009 IEEE DOI 10.1109/VIZ.2009.12 191 2009 Second International Conference in Visualisation 978-0-7695-3734-4/09 $25.00 © 2009 IEEE DOI 10.1109/VIZ.2009.12 191 2009 Second International Conference in Visualisation 978-0-7695-3734-4/09 $25.00 © 2009 IEEE DOI 10.1109/VIZ.2009.12 191

Upload: muhammad-abuzar

Post on 13-Mar-2017

217 views

Category:

Documents


3 download

TRANSCRIPT

Segmentation of Printed Urdu Scripts Using Structural Features

Hamna Malik Lahore College for Women University,

Lahore, Pakistan

Muhammad Abuzar Fahiem Lahore College for Women University,

Lahore, Pakistan University of Engineering and Technology,

Lahore, Pakistan [email protected]

Abstract

Character segmentation forms the basis for optical character recognition. In this paper, we have proposed a character segmentation approach for printed Urdu script. Urdu is cursive by nature and its script is written from right to left. Both these factors make the segmentation more difficult and require special attention. Our approach is based on structural features and we have overcome different problems like over segmentation and under segmentation, present in previous approaches. We have achieved an accuracy rate of 99.4% which is better than others. The approach may be very useful for developing an optical character recognition system for Urdu language. Keywords - Character segmentation, Cursive scripts, Ligature, Structural features, Urdu alphabets 1. Introduction

Urdu is a cursive language with a total of 39 alphabets written from right to left. Each alphabet can have either of four shapes; isolated, initial, medial or final depending upon its position in a word as shown in Fig. 1. This shape is known as a glyph and these are connected with each other.

Fig. 3. Different shapes of an Urdu alphabets

‘Ghaeen’ (a) is the initial shape (b) is the

medial shape (c) is the final shape (d) is the isolated shape

A ligature is a connected component of a pixel and there are tow types of ligatures; A primary ligature is any shape other than ‘dots’, ‘hamza’ and ‘tuppa’ while

these are known as secondary ligatures (shown in Fig. 2).

Fig. 2. Ligatures (a) Urdu word (b) Seven ligatures

(c) Three primary ligatures (d) Four secondary ligatures.

Optical character recognition (OCR) is an

important research area and it needs more attention when a cursive language is to be recognized. Segmentation forms the basis for OCR and our research is focused on it. Previously various approaches have been proposed for Arabic text segmentation [1,2,3,4]. A very good survey is presented in [5] and gray scale segmentation is discussed in [6]. Some approaches use neural networks [7,8] while some use hidden Markov models [9].

In this paper, we have performed segmentation of printed Urdu text on the basis of structural features of the alphabets. Section 2 describes our approach while the discussion is concluded on a comparison of our approach with previous ones in section 3. 2. Our Approach

Our approach is divided into two stages as described below. 2.1. Line segmentation

2009 Second International Conference in Visualisation

978-0-7695-3734-4/09 $25.00 © 2009 IEEE

DOI 10.1109/VIZ.2009.12

191

2009 Second International Conference in Visualisation

978-0-7695-3734-4/09 $25.00 © 2009 IEEE

DOI 10.1109/VIZ.2009.12

191

2009 Second International Conference in Visualisation

978-0-7695-3734-4/09 $25.00 © 2009 IEEE

DOI 10.1109/VIZ.2009.12

191

2009 Second International Conference in Visualisation

978-0-7695-3734-4/09 $25.00 © 2009 IEEE

DOI 10.1109/VIZ.2009.12

191

Line segmentation deals with the detection of text lines in the image. For this purpose the image is scanned horizontally from right to left direction, upwards to downwards, in search of a text pixel. Afterwards, it is determined whether this pixel belongs to a primary ligature or a secondary ligature. The freeman chain codes (FCC) of the ligature is compared with already calculated FCC of the secondary ligatures. In case of 90% match, the ligature is secondary otherwise it is a primary ligature. If it is a primary ligature then it is marked as start of text line and to mark the end of text line the horizontal image scan is continued for the next coming empty row. However, if the found text pixel is of a secondary ligature then it is necessary to determine whether it belongs to its upper text line or to the text line below to it. For this purpose, its distance from the text line upper to it as well as lower to it is calculated and the decision is taken on the basis of least one. The result of line segmentation stage is shown in Fig. 3.

Fig. 3. Line segmentation (a) Original Image (b)

Image after line segmentation 2.2. Character Segmentation

In character segmentation, position of individual characters in a word is determined. The text is skeletonized and a label matrix is constructed which contains the identifiers of all ligatures in the image. Character segmentation can be performed on the basis of only primary ligatures and in Urdu the first character in any word is always the part of some primary ligature. Primary Ligatures for the start of a word are identified by scanning the text line vertically and extracting all very first text pixels coming after an empty column.

After extracting primary ligature identifiers, font size is identified on the basis of ligature’s width and height as shown in Table 1.

Table 2. Font Size Font Size Height (pixels) Width (pixels)

Large >=17 >=65 Medium >=9 && <17 >=35 && <65

Small <9 <35

Every Primary Ligature passes through the stages mentioned in the following sections. 2.2.1. Base Segmentation

Base segmentation starts with the scanning of the rectangular boundary of ligature vertically. While scanning, a character is assumed to be started when a vertical scan contains a non-consecutive sequence of ligature’s pixels. We would call it column overlapping state (COS). The character is assumed to be ended when a vertical scan contains a consecutive sequence of ligature’s pixels. We would call it an anti-column overlapping state (ACOS). COS is true when encountered first time and disabled when ACOS is found. ACOS is ignored if COS is false.

It is observed that over segmentation may arise with the increase in font size as shown in Fig. 4, and to overcome it, next stage is executed.

Fig. 4. Effect of font size on base segmentation (a) Large font size image (b) Small font size image (c) Over segmentation of (a), (d) Over segmentation of

(b)

Over segmentation because of unnecessary pixels mostly affects the shapes of ‘Seen’ and ‘Sheen’. Since such over segmentation is caused because of large or medium fonts thus the COS up to three consecutive columns in large font and up to two consecutive columns in medium font is ignored during the base segmentation.

Segmentation process of this stage highly over segments the last character of a ligature. This over segmentation is illustrated in Fig.5.

Fig. 5. Over segmentation of last character of a

ligature (a) Image after line segmentation (b) Image after base segmentation (c) Desired segmentation

Fig. 6 shows the correct segmented characters after

base segmentation.

192192192192

Fig. 6. Correctly Segmented Characters after base

segmentation 2.2.2. Over Segmentation

This stage is responsible for overcoming over segmentation of last character in a character. For this purpose, left most column of the rectangular boundary of a ligature is scanned and the row with first occurrence of a ligature pixel is marked as candidate row and the pixel is marked as candidate pixel. A candidate pixel always belongs to the last character of a ligature. It is observed that a pair pixel always exists opposite to this candidate pixel in the last character of a ligature. Thus, all invalid segmentation lines causing over segmentation on the left of a pair pixel are removed to correctly segment the last character.

However, ‘Alif’ is to be recognized on the basis of its structural features as it has no pair pixel. For that, baseline of the ligature [10] is determined on the basis of row having maximum number of pixels in a ligature. Now, the difference between the height of ligature and the distance of baseline from top of ligature is calculated. If it is more than 65% of ligature’s height then it is considered as reference height otherwise ligature’s height is considered as reference height. Now, if the sum of pixels of last three columns of ligature is 92% of reference height then the last character is ‘Alif’.

After the execution of this stage, some shapes still have incorrect segmentation like wrong segmentation point before semicircle as shown in Fig. 7. This problem is handled in the following section.

Fig. 7. Wrong segmentation (a) Original image (b)

Wrongly segmented last character 2.2.3. Semicircle Segmentation

This stage removes the incorrect segmentation points before the semicircles of ‘Sheen’, ‘Soaad’, and ‘Zoaad’. For these shapes, the last segment must have only two peaks and the right peak must be higher than the left one. Moreover, there must be no secondary ligature in the last segment to ensure that the last segment is not ‘Noon’ shape, or any other shape. However, last character without secondary ligature can

be ‘laam’ so it is checked that the number of pixels above the baseline must be less than the number of pixels below it. Another check that there is no dot below the baseline in the second last segment and a

‘Yey’ ( ) ensures that the shape is not ‘Noon-Ghunnah’. These facts are illustrated in Fig. 8.

A problem arises due to the occurrence of ‘Seen’ shape in the start or middle of ligature which causes under segmentation so the next stage is employed

Fig. 8. Semicircle segmentation (a) Four peak points in the last segment (b) Secondary ligature in the last

segment (c) ‘Laam’ in the last segment (d) ‘Yey’ before ‘Noon-Ghunnah’

2.2.4. Under Segmentation

The occurrence of ‘Seen’ shape in the start or middle of ligature causes under segmentation so a segmentation point after ‘Seen’ should be marked. The occurrence of ‘Seen’ in the start or middle of ligature is detected by using the fact that it has three vertical lines without any secondary ligature. The identification of these lines is dependent on font size. In case of small font size, these lines start from base line or one row above base line. If the font size is medium or large then these lines start from baseline or start at most two rows above the baseline. These vertical lines can never start below the baseline. Moreover, the first vertical line must have sum of pixels greater than 20% of ligature height, and the difference between number of pixels of second as well as third vertical line from the first should not be greater than 80%. 2.2.5. Fine Tuning

It is observed from experiments that a valid character must have horizontal and vertical strokes and if, after the execution of above four stages, a segment has only horizontal strokes then the segmentation point is removed. 3. Conclusion

We tested our approach on images, in ‘Batool’ font, scanned at 300 dpi and found excellent results. Sample results for small and large font sizes are shown in Fig. 9. We have achieved an accuracy rate of 99.4% which is better than previous approaches, presented in Table

193193193193

(a) (d)

(b) (e)

(c) (f)

Fig. 9. Sample results (a) Sample image in small font size (b) Line segmentation of (a) image (c) Character segmentation of (b) image (d) Sample image in large font size (e) Line segmentation of (d) image (f) Character

segmentation of (d) image

Table 3. Comparison of segmentation approaches Approach % Accuracy [11] 96.9% [12] 99.3% [13] 90% - 100% [14] 86% [15] 69.72% Our Approach 99.4%

3. The 0.6% inaccuracy is due to under segmentation of “Laam” shape when it occurs in the start or middle of ligature. References:

[1] H. Al-Yousefi and S.S Udpa, Recognition of Arabic Characters, IEEE Transactions on Pattern Analysis and Machine Intelligence. 14(1992) 853-857

194194194194

[2] D. Motawa, A. Amin and R.Sabourin, Segmentation of Arabic Cursive Script, Proceedings of the 4th International Conference on Document Analysis and Recognition. (1997)

[3] A. Amin and H. B. Al. Sadoun, A New Segmentation Technique of Arabic Text, Proceedings of the 11th IAPR International Conference on Pattern Recognition, and Pattern Recognition Methodology and Systems. (1992)

[4] A. M. Elgammal and M. A. Ismail, A Graph-Based Segmentation and Feature Extraction Framework for Arabic Text Recognition, Proceedings of the 6th International Conference on Document Analysis and Recognition. (2001)

[5] A. Amin, Off line Arabic Character Recognition - A Survey, Proceedings of the 4th International Conference on Document Analysis and Recognition. (1992)

[6] S.W Lee, D-J Lee, Member, and H-S Park, A New Methodology for Gray-Scale Character Segmentation and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 18 (1996) 1045-1050

[7] A. Amin and W. Mansoor, Recognition of Printed Arabic Text using Neural Networks, Proceedings of the 4th International Conference on Document Analysis and Recognition. (1997)

[8] M. Blumenstein and B. Verma, Neural-based Solutions for the Segmentation and Recognition of Difficult Handwritten Words from a Benchmark Database, Proceedings of the 5th International Conference on Document Analysis and Recognition. (1999)

[9] M. Mohamed and P. Gader, Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques, IEEE Transactions on Pattern Analysis and Machine Intelligence. 18 (1996) 548-554

[10] N. Arica, and F. T. Yarman-Vural, Optical Character Recognition for Cursive Handwriting, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (2002) 801-813

[11] U. Pal and A. Sarkar, Recognition of Printed Urdu Script, Proceedings of the 7th International Conference on Document Analysis and Recognition. (2003)

[12] K. R. Pakker, H. Miled, and Y. Lecourtier, A New Approach for Latin/Arabic Character Segmentation, Proceedings of the 3rd International Conference on Document Analysis and Recognition. (1995)

[13] B.A Najoua and E. Noureddine, A Robust Approach for Arabic Printed Character Segmentation, Proceedings of the 3rd International Conference on Document Analysis and Recognition. (1995)

[14] T. Sari, L. Souici and M. Sellami, Off-line Handwritten Arabic Character Segmentation Algorithm: ACSA, Proceedings of 8th International Workshop on Frontiers in Handwriting Recognition. (2002)

[15] A. M. Zeki, The Segmentation Problem in Arabic Character Recognition - The State Of The Art, Proceedings of 1st International Conference on Information and Communication Technologies. (2005)

195195195195