multi-script line identification system for indian languages

5
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 107 Multi-Script Line Identification System for Indian Languages Prakash K. Aithal, Rajesh Gopakumar , and Dinesh U. Acharya  Abstract  —India is a multilingual multi-script country. There are totally 18 official languages and 12 scripts in India. For Optical Character Recognition (OCR) of such a multi-lingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Malayalam, T elugu, T amil, Gujarati, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile, Vertical projection profile and Top pitch information to distinguish the seven scripts. The knowledge base of the system is developed based on 50 different document images containing about 250 text lines of each script. The proposed system is tested on 50 different document images containing about 250 text lines of each script and an overall classification rate of 97.64% is achieved. Index Terms  — Multilingual Indian Scripts, Script identification, Horizontal projection profile, Vertical projection profile and Top pitch. ——————————   —————————— 1 INTRODUCTION n multi-lingual document analysis, it is important to automatically identify the scripts before feeding each text line of the document to the respective OCR system. Quite a few results have already been reported in the lite- rature, identifying the scripts in a multi-lingual and mul- ti-script document dealing with Roman and other Orien- tal scripts such as Chinese, Korean, Japanese, Arabic and Hindi. The Classifiers used include statistical analysis, linear discriminate analysis, cluster analysis and template matching based on the features like texture, upward con- cavities, optical densities and characteristic shapes or symbols [17]. The earliest work we have found on textline-wise script identification in Indian documents was reported by Pal and Chaudhuri The method uses projection profile, sta- tistical and topological features, and stroke features for decision tree-based classification of printed Latin, Urdu, Devnagari and Bengali script-lines. Later, they proposed an automatic system for identification of Latin, Chinese, Arabic, Devnagari and Bengali textlines in printed docu- ments [18]. In India, a multi-lingual multi-script country, there are 18 official languages. They are Kannada, Malayalam, Te- lugu, Tamil, Gujarati, Marathi, Rajasthani, Urdu, Oriya, Bengali, Gurumukhi, Sanskrit, Nepali, Kashmiri, Assa- mese, Konkani, Hindi and English. Many languages have same script. Devnagari script is used to wr ite Hindi, Ra-  jasthani, Marathi, Sanskrit and Nepali language. Bangla script is used to write Bengali and Assamese. So there are 12 scripts in India. In the context of Indian language document analysis, major literature is due to Pal and Choudhari. The auto- matic separation of text lines from multi-script documents by extracting the features from profiles, water reservoir concepts, contour tracing [1, 2]. Santanu Choudhury, Gaurav Harit, Shekar Madnani and R. B. Shet has pro- posed a method for identification of Indian languages by combining Gabor filter based technique and direction distance histogram classifier considering Hindi, English, Malayalam, Bengali, Telugu and Urdu [4]. Chanda and Pal have proposed an automatic technique for word wise identification of Devnagari, English and Urdu scripts from a single document [6]. Gopal Datt Joshi et al have proposed script Identification from Indian Documents [7]. Word level script identification in bilingual documents through discriminating features has been developed by B V Dhandra et al [8]. Neural network based system for script identification (Kannada, Hindi and English) of In- dian documents is proposed by B asavaraj Patil et al [9]. Lijun Zhou Yue Lu and Chew Lim Tan have developed a method for Bangla and English script identification based on the analysis of connected component profiles [10]. Vi-  jaya and Padma has developed methods for English, Hindi and Kannada script identification using discrimi- nating features and top and bottom profile based features [11]. This paper deals with line-wise script identification for Kannada, Malayalam, Telugu, Tamil, Gujarati, Hindi and English script pertaining documents from India. Script identification is done based on the features extracted from Horizontal Projection Profile, Vertical Projection Profile ————————————————   Prakash K Aithal is with the Department of Computer Science and Engi- neering, Manipal Institute of Technology, Manipal-576104, Karnataka, INDIA.  Rajesh Gopakumar is with the Department of Computer Science and Engi- neering, Manipal Institute of Technology, Manipal-576104, Karnataka, INDIA.  Dr. U. Dinesh Acharya is with the Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal-576104, Karnata- ka, INDIA. I

Upload: journal-of-computing

Post on 09-Apr-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Script Line Identification System for Indian Languages

8/8/2019 Multi-Script Line Identification System for Indian Languages

http://slidepdf.com/reader/full/multi-script-line-identification-system-for-indian-languages 1/5

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ 

WWW.JOURNALOFCOMPUTING.ORG 107

Multi-Script Line Identification System forIndian Languages

Prakash K. Aithal, Rajesh Gopakumar, and Dinesh U. Acharya 

Abstract —India is a multilingual multi-script country. There are totally 18 official languages and 12 scripts in India. For Optical

Character Recognition (OCR) of such a multi-lingual document, it is necessary to identify the script before feeding the text lines

to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Malayalam,

Telugu, Tamil, Gujarati, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal

projection profile, Vertical projection profile and Top pitch information to distinguish the seven scripts. The knowledge base of

the system is developed based on 50 different document images containing about 250 text lines of each script. The proposed

system is tested on 50 different document images containing about 250 text lines of each script and an overall classification rate

of 97.64% is achieved.

Index Terms — Multilingual Indian Scripts, Script identification, Horizontal projection profile, Vertical projection profile and Top

pitch.

——————————    ——————————

1 INTRODUCTION

n multi-lingual document analysis, it is important toautomatically identify the scripts before feeding each

text line of the document to the respective OCR system.Quite a few results have already been reported in the lite-rature, identifying the scripts in a multi-lingual and mul-ti-script document dealing with Roman and other Orien-tal scripts such as Chinese, Korean, Japanese, Arabic andHindi. The Classifiers used include statistical analysis,linear discriminate analysis, cluster analysis and templatematching based on the features like texture, upward con-cavities, optical densities and characteristic shapes orsymbols [17].

The earliest work we have found on textline-wise scriptidentification in Indian documents was reported by Paland Chaudhuri The method uses projection profile, sta-tistical and topological features, and stroke features fordecision tree-based classification of printed Latin, Urdu,Devnagari and Bengali script-lines. Later, they proposedan automatic system for identification of Latin, Chinese,Arabic, Devnagari and Bengali textlines in printed docu-ments [18].

In India, a multi-lingual multi-script country, there are18 official languages. They are Kannada, Malayalam, Te-

lugu, Tamil, Gujarati, Marathi, Rajasthani, Urdu, Oriya,Bengali, Gurumukhi, Sanskrit, Nepali, Kashmiri, Assa-

mese, Konkani, Hindi and English. Many languages havesame script. Devnagari script is used to write Hindi, Ra-

  jasthani, Marathi, Sanskrit and Nepali language. Banglascript is used to write Bengali and Assamese. So there are12 scripts in India.

In the context of Indian language document analysis,major literature is due to Pal and Choudhari. The auto-matic separation of text lines from multi-script documentsby extracting the features from profiles, water reservoirconcepts, contour tracing [1, 2]. Santanu Choudhury,Gaurav Harit, Shekar Madnani and R. B. Shet has pro-

posed a method for identification of Indian languages bycombining Gabor filter based technique and directiondistance histogram classifier considering Hindi, English,Malayalam, Bengali, Telugu and Urdu [4]. Chanda andPal have proposed an automatic technique for word wiseidentification of Devnagari, English and Urdu scriptsfrom a single document [6]. Gopal Datt Joshi et al haveproposed script Identification from Indian Documents [7].Word level script identification in bilingual documentsthrough discriminating features has been developed by BV Dhandra et al [8]. Neural network based system forscript identification (Kannada, Hindi and English) of In-dian documents is proposed by Basavaraj Patil et al [9].

Lijun Zhou Yue Lu and Chew Lim Tan have developed amethod for Bangla and English script identification basedon the analysis of connected component profiles [10]. Vi-

  jaya and Padma has developed methods for English,Hindi and Kannada script identification using discrimi-nating features and top and bottom profile based features[11].

This paper deals with line-wise script identification forKannada, Malayalam, Telugu, Tamil, Gujarati, Hindi andEnglish script pertaining documents from India. Scriptidentification is done based on the features extracted fromHorizontal Projection Profile, Vertical Projection Profile

———————————————— 

  Prakash K Aithal is with the Department of Computer Science and Engi-neering, Manipal Institute of Technology, Manipal-576104, Karnataka,INDIA.

  Rajesh Gopakumar is with the Department of Computer Science and Engi-neering, Manipal Institute of Technology, Manipal-576104, Karnataka,INDIA. 

  Dr. U. Dinesh Acharya is with the Department of Computer Science andEngineering, Manipal Institute of Technology, Manipal-576104, Karnata-ka, INDIA. 

I

Page 2: Multi-Script Line Identification System for Indian Languages

8/8/2019 Multi-Script Line Identification System for Indian Languages

http://slidepdf.com/reader/full/multi-script-line-identification-system-for-indian-languages 2/5

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ 

WWW.JOURNALOFCOMPUTING.ORG 108

 

and Top Pitch Information of the Script line.The rest of the paper is organized as follows: Section 2

gives the segmentation method used. The feature extrac-tion method is discussed in Section 3. Section 4 presentsthe classification based on rule based classifier. Finally,result analysis is done in Section 5.

2 SEGMENTATION White space between text lines is used to segment the textlines. The line segmentation is carried out by calculatingthe horizontal projection profile of the whole document.The horizontal projection profile is the histogram of num-ber of ON (black) pixels along every row of the image.The projection profile exhibits valleys of zero height cor-responding to white space between the text lines. Linesegmentation is done at these points. Fig. 1 shows thehorizontal profile for a sample document. White spacebetween words is more than white space between charac-ters. Vertical projection is used to do the word and cha-racter level segmentation. Fig. 2. Shows the word level

segmentation of English text line and Fig. 3. Shows thecharacter level segmentation of the English text line. 

3 FEATURE EXTRACTION

The proposed system has used horizontal projection pro-file, vertical projection profile and top pitch informationbased feature extraction technique to classify the scripts.Following features are extracted.

1) The first and second maxima of horizontal projectionprofile (V1 & V2 respectively).

2) Position of the first and second maxima.3) Character height (H).

4) Top pitch information of each text line (Top pitch isthe first three rows of the text line).Using the above features the seven scripts are identified.

4 RULE-BASED CLASSIFIER 

The Proposed system uses a rule-based classifier for scriptidentification. The rules are as follows, for language iden-tification

1)  Hindi: If V1>1.5*V2 then the line is Hindi else otherlanguages. Figure.9. shows ranges for Hindi andnon-Hindi text lines.

2)  English: If vertical run length of the character is

more than or equal to 75% of the total height of thecharacter then the character is English character. Ifmore than 20% of the characters in a text line areEnglish characters then the line is English else oth-er languages. Vertical run length of the EnglishScript line is given in Figure. 4. Vertical run lengthof the Telugu Script line is given in Figure.5.

3)  Kannada: If more than 45% of the characters haveON top pitch value then the line is Kannada. (Thatis if more than 45% of the character has top modifi-er then there top pitch value will be ON).Figure. 6.Shows the top pitch of the Kannada document.

4)  Telugu: If less than 45% of the characters have ONtop pitch value then the line is Telugu. Top pitch ofTelugu document is shown in Figure.7.Top pitch ofGujarati, Malayalam and Tamil are shown in Fig-ure.8.

5)  Tamil: If the number of histograms crossing themean between first and second maxima is 4, 1 or 0

then the line is Tamil. The proposed system usesminimum distance classifier for this purpose. Fig-ure.12. shows the minimum distance index fordocument containing Tamil and English text lines.

6)  Malayalam: Malayalam is identified as followsi)  Calculate valley mean Vm (Valley mean is

the mean of projection profile between thefirst and second maxima including both.)

ii)  Find the value of the point Vp ( imme-diately after the first maximum in the ho-rizontal projection profile.)

iii)  Compare the Vp with Vm if the rangefalls between 0.88 to 1.2 then the line is

Malayalam.Gujarati: If V1 & V2 both fall in same halve then the lineis Gujarati. Figure.10 shows the position of first andsecond maxima for a Gujarati document. Figure.11 showsthe position of first and second maxima for documentcontaining Kannada, Tamil, Malayalam and Telugu.

5 EXPERIMENTAL RESULTS 

The dataset includes 3500 text lines from 100 differentdocument images. The document images are downloadedfrom e-news papers (Udayavani, Eenadu, Dinamalar,Malayala Manorama, Gujarat samachar, Navbharat timesand Times of India respectively for Kannada, Telugu,Tamil, Malayalam, Gujarati, Hindi and English). Theknowledge base (rules) of the system is developed basedon 50 document images containing about 1750 text linesof all seven scripts. The proposed system is tested on 50different document images containing about 250 text linesof each script and an overall classification rate of 97.64%is achieved.

Fig. 1. Horizontal Projection Profile of a sample document with all 7scripts (English, Hindi, Gujarati, Kannada, Malayalam, Tamil, andTelugu respectively).

Page 3: Multi-Script Line Identification System for Indian Languages

8/8/2019 Multi-Script Line Identification System for Indian Languages

http://slidepdf.com/reader/full/multi-script-line-identification-system-for-indian-languages 3/5

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ 

WWW.JOURNALOFCOMPUTING.ORG 109

TABLE 1CONFUSION MATRIX FOR SCRIPT IDENTIFICATION 

Kan- Kannada, Tel- Telugu, Tam- Tamil, Mal- Malayalam, Guj-

Gujarati, Hin- Hindi, Eng - English.

Fig. 2. Word level segmentation of English text line

Fig. 3. Character level segmentation of English text line

TABLE 2COMPARISON OF WORK WITH OTHER CONTEMPORARY WORKS

 

Fig. 4. Vertical run length of English text line

Fig. 5. Vertical run length of Telugu text line

Fig. 7. Top pitch of Telugu document

Fig. 6. Top pitch of Kannada document

Page 4: Multi-Script Line Identification System for Indian Languages

8/8/2019 Multi-Script Line Identification System for Indian Languages

http://slidepdf.com/reader/full/multi-script-line-identification-system-for-indian-languages 4/5

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ 

WWW.JOURNALOFCOMPUTING.ORG 110

 

6 CONCLUSION 

In this paper, a simple and efficient algorithm for script identi-

fication of Kannada, Telugu, Tamil, Malayalam, Gujarati, Hindi

and English text lines from printed documents is proposed. The

approach is based on the analysis of horizontal projection pro-

file, vertical projection profile and top pitch information. The

system does not require any training. The system exhibits an

overall accuracy of 97.64%. The work could be extended to

word level script identification and for all Indian scripts.

REFERENCES 

[1]  U. Pal, B. B. Choudhuri, “Script line separation from Indianmulti-Script documents,” Proc. of fifth Intl. Conf. on Document

  Analysis and Recognition (IEEE computer society press), pp. 406-409, 1999.

[2]  U. Pal, S. Sinha and B. B. Chaudhuri, “Multi-Script line identifi-cation from Indian documents,” Proc. of seventh Intl. conf. ondocument analysis and Recognition (ICDAR 2003), vol. 2, pp.880-884, 2003.

[3]  M.C. Padma and P. Nagabhushan, “Identification and separa-tion of text words of Kannada, Hindi and English languagesthrough discriminating features,” Proc. of second national confe-rence on document analysis and recognition, Karnataka, India, pp.252-260 2003.

[4]  Santanu Choudhury, Gaurav Harit, Shekar Madnani, R.B. Shet,“Identification of Scripts of Indian Languages by CombiningTrainable Classifiers,” ICVGIP, Bangalore, India, Dec.20-22, 2000.

[5]  T. N. Tan, “Rotation Invariant Texture Features and their use inAutomatic Script Identification,” IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. 20, no 7, pp. 751-756, July 1998.

[6]  S. Chanda, U. Pal, “English, Devanagari and Urdu Text Identi-fication,” Proc. Intl. Conf. on Document Analysis and Recognition ,pp. 538-545, 2005.

[7]  Gopal Datt Joshi, Saurabh Garg and Jayanthi Sivaswamy,“Script Identification from Indian Documents,” LNCS 3872,DAS, pp. 255-267, 2006.

[8]  B.V. Dhandra, Mallikarjun Hangarge, Ravindra Hegadi andV.S.Malemath, “Word Level Script Identification in BilingualDocuments through Discriminating Features,” IEEE - ICSCN 2007, Chennai, India, pp.630-635, Feb. 2007.

[9]  S Basavaraj Patil and N.V. SubbaReddy, “Neural networkbased system for script identification in Indian documents,”Sadhana, vol. 27, part1, pp. 83-97, February 2002.

Fig. 8. Top pitch of Gujarati, Malayalam and Tamil Script

Fig. 9. Range V2/V1 for Hindi and non-Hindi scripts

Fig. 10. Gujarati script lines where both f irst and second maxima

fall in same halve

Fig. 11. Kannada, Tamil, Malayalam and Telugu Script lines

where First and second maxima fall in different halves 

Fig. 12. Minimum distance index for a document containing

English and Tamil script line 

Page 5: Multi-Script Line Identification System for Indian Languages

8/8/2019 Multi-Script Line Identification System for Indian Languages

http://slidepdf.com/reader/full/multi-script-line-identification-system-for-indian-languages 5/5

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ 

WWW.JOURNALOFCOMPUTING.ORG 111

[10]  Lijun Zhou, Yue Lu and Chew Lim Tan, “Bangla/English Scrip-tIdentification Based on Analysis of Connected ComponentProfiles,” Proc. of seventh DAS, pp. 243-254, 2006.

[11]  P. A. Vijaya, M. C. Padma, “Text line identification from a mul-tilingual document,” Proc. of Intl. Conf. on digital image

 processing (ICDIP 2009) Bangkok , pp. 302-305, March 2009.[12]  Ahmed M Elgammal Mohamed A. Ismail, “Techniques for

language identification for hybrid Arabic-English documentimages,” Proc. of the Sixth Intl. Conf. on Document Analysis and

Recognition, Seattle, pp. 1100-1104, September 2001.[13]  Rangachar Kasturi, Lawrence O’Gorman Venu Govind Raju,

“Document image analysis- a primer,” Sadhana, vol. 27, part1,pp. 3-22, February 2002.

[14]  D Dhanya, A G Ramakrishnan and Peeta basa pati, “ScriptIdentification in printed bilingual documents,” Sadhana, vol. 27,part1, pp. 73- 82 February 2002. 

[15]  M. C. Padma and P. A. Vijaya, “Identification and separation ofText words of Kannada, Telugu, Tamil, Hindi and English lan-guages through visual discriminating features,” Proc. of Intl.conf. on Advances in Computer Vision and Information Technolo-

 gy(ACVIT-2007), Aurangabad, India, pp. 1283-1291, 2007.[16]  Prakash K. Aithal, Rajesh G., Dinesh U. Acharya, Krishnamoorthi M.,

Subbareddy N. V “Script Identification for a Multi-Lingual Docu-

ment”. Proc. National Conference on Recent Trends in Emerging

Technologies, Nitte, Karnataka, India. pp 78-80, 2010.

[17]  Prakash K. Aithal, Rajesh G., Dinesh U. Acharya, Krishnamoor-thi M., Subbareddy N. V “Text Line Script Identification for aTri-Lingual Document” ICCCN464,2010Second InternationalConference on Computing, Communication and NetworkingTechnologies, Karur, Tamil Nadu,India

[18]  D. Ghosh, T Dube and A.P. Shivaprasad “Script Recognition areview” IEEE transactions on pattern analysis and machine in-telligence, 2009.

Mr. Prakash K. Aithal received his B.E. degree in ComputerScience & Engineering from Kuvempu University in 2000 andM.Tech degree in Computer Science and Engineering from ManipalUniversity in 2010. He is currently working as Lecturer in Departmentof Computer Science and Engineering at Manipal Institute of Tech-nology. He is a life member of Indian Society for Technical Education

(ISTE).

Mr. Rajesh Gopakumar received his B.E. degree in ComputerScience & Engineering from Gulbarga University in 1999 and M.Techdegree in Systems Analysis and Computer Applications from NITKSurathkal (Deemed) in 2005. He is currently working as Senior Lec-turer in Department of Computer Science and Engineering at Manip-al Institute of Technology and pursuing PhD in Computer Scienceand Engineering from Manipal University. He is a life member ofIndian Society for Technical Education (ISTE).

Dr. U. Dinesh Acharya received his B.E. degree in Electrical andElectronics from University of Mysore in 1983. He received hisM.Tech degree in Computer Science and Engineering in 1996 fromMangalore University and PhD in Computer Science & Engineeringin 2008 from Manipal University. He is currently working as a Profes-sor in Computer Science and Engineering at Manipal Institute of

Technology, Manipal. Currently he is guiding PhD’s in Pattern Rec-ognition, Knowledge Based Systems and Database Systems. He is alife member of Indian Society for Technical Education (ISTE).