final a two stage character segmentation technique

Post on 25-Nov-2014

123 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

two stage character segmentation

TRANSCRIPT

TWO STAGE CHARACTER SEGMENTATION FOR PRINTED TELUGU TEXT

Under the guidance of

M.Sirisha (Asst.prof)

S.Padmavathi(07H71A0431) K.Gafoor raja(08H75A0403)

MD.Jasmin(07H71A0423) J.Suresh(07H71A0459)

T.Sekhar(07H71A0450)

Introduction:Optical character recognition (OCR) deals with the processing

of optically processed characters.

Character recognition provides a solution for processing large volumes of data automatically in a large variety of scientific and business applications.

Not much work has been reported on the development of Optical Character Recognition (OCR) systems for Telugu text. Therefore, it is an area of current research.

A compound character may contain one or more connected symbols.

Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands.

Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language.

Block Diagram

Pre processing

Text document

Line Segmentation

Word Segmentation

Character Segmentation

User Input

Segmentation:Uses the classical approach in which the scanned

image is dissected into individual building blocks to be recognized as characters.

It is one of the decision stages in OCR system because incorrectly segmented characters will not be recognized properly.

So, recognition rate will be reduced.

The two stages involved in segmentation are:

1)Only the suffixes are segmented from the word using connected component processing.

2)Remaining characters from the word are easily segmented using the traditional vertical projection profile.

• The major strength of proposed two stage method is it works faster than classical single stage method of segmenting characters using connected component analysis only.

Segmentation Methodology:

This method starts by segmenting the lines from the scanned document by using Horizontal Projection Profile.

The words are segmented by using Vertical Projection Profile.

If the subscript characters are present in the word they are extracted using Connected Component method.

If the subscript characters are not present the main characters are segmented using Vertical Projection Profile.

Types of segmentation required:

(1)Line Segmentation:

White spaces between the text lines is used to segment the lines.

To separate the text lines the horizontal projection profile of the text document image is found.

The Horizontal projection profile is the histogram of number of ON pixels along every row of the image.

Line segmentation

Word Segmentation: Spacing between the words is used for word

segmentation since spacing between the words is greater than spacing between the characters.

• The Spacing between the words is found by taking the vertical projection profile (VPP) of an input text line.

• Vertical projection profile is the sum of ON pixels along every column of the image .

..

Word Segmentation:

(3)Character Segmentation:

Spacing between the characters can be used for segmentation.

For character segmentation also VPP is used. But, some

times in the Vertical Projection Profile of the word there will not be any zero-valued valleys due to the presence of subscript characters.

1) A word without subscripts:

2) A word with subscripts:

Fig 2. Figure showing the word whose subscripts are removed.

Fig 1. Figure showing a word with subscripts and the threshold level.

RESULTSInput Image:

Fig. 1: Input Image for Line Segmentation

Line Segmentation:

Fig. 2: First Line After Line Segmentation

Fig 3: Second Line After Line Segmentation

Fig. 4: Third Line After Line Segmentation

Fig. 5: Input Image For Word Segmentation

Word Segmentation:

Fig. 6: First Word After Word Segmentation

Fig. 7: Second Word After Word Segmentation

Fig. 8: Third Word After Word Segmentation

Fig. 9: Fourth Word After Word Segmentation

Fig. 10: Fifth Word After Word Segmentation

Character segmentation:

Fig 1: Character 1

Fig 2: Character 2

Fig 3: Character 3

Fig 4: Character 4

Fig 5: Character 5

Fig 6: Character 6

Document matching system:

The given document is matched with the pure document which is in database. If both are same then returns as exact match otherwise returns as duplicate.

• Document speaking system

• Document Database System

• Full-text Search

• Processing Documents with Signatures, Company Stamps

• Re-creation of Document Logical Structure and Formatting

•Retention of Fonts and Font Styles

References:References:• http://ieee.org/

• http://portal.acm.org/citation.cfm?id=231611

• tcts.fpms.ac.be/publications/papers/2004/isspit04_cmtbg.pdf

•  [1] T. Bayer U. Kressel and M. Hammelsbeck, "Segmenting Merged

Characters," <i>Proc. 11th Int'l Conf. Pattern Recognition,</i> vol. 2.

conf. B: Pattern Recognition, Methodology, and Systems, pp. 346-349,

1992.

• [2]. S. Bercu and G. Lorette, "On-line Handwritten Word Recognition: An

Approach Based on Hidden Markov Models," <i>Pre-Proc. IWFHR III,</i>

Buffalo, N.Y., p. 385, May 1993.

• [3]. D. G. Elliman , I. T. Lancaster, A review of segmentation and contextual

analysis techniques for text recognition, Pattern Recognition, v.23 n.3-4,

p.337-346, March 1990  [doi>10.1016/0031-3203(90)90021-C]

top related