development of arabic ocr - wordpress.com...nemlar (network for euro-mediterranean language...

Post on 04-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Development of Arabic OCR

Team members in UofG: Qiying He (Tina)

SangYu Lee

Leonardo Nunes Parente

----Opportunities and challenges

& in IUG: Ghadeer Abu-Oda

Shadia Baroud

OCR = Optical Character Recognition

What is OCR?

Situation & Problems 1

Solutions 2

Evaluation & Future Work 3

Content

Background in Gaza, Arabic language and

existing problems of Arabic OCR

Hidden Markov Model, Open software ,and

their advantages and disadvantages

Best solution, limitations, and future trend

Situation & Problem

Part 1

Blind

People

Free

Software

Cannot

afford

new apps

ATC in

IUG[1]

Background DOLOR

ATC = Assistive Technology Centre

Complexity of Arabic

DOLOR

28 characters, 22 are cursive, 6 are

non-cursive.

Cursiveness

The character can have up to 4 shapes

depending on its position (Table 1).

Shapes

[2]

Problems about OCR

Most Apps Focus on English or Latin

based language

A.

Not many techniques for handwritten

Arabic recognition

C.

Arabic OCR is still in the early stage

(inaccurate)

B.

Solutions

Part 2

Solution 1: Statistical Methods

Algorithm Accuracy Rate

Logistic Regression 89.4% [4]

Linear SVM 85.4% [4]

kNN (3) 89.5% [4]

HMM 92.1% [5]

- Hidden Markov Model (HMM)

A. What is HMM

Tool for representing probability distribution over sequences of observations [3]

B. Why

- Based on “process-focused approach”

Suitable for recognising handwriting

- High accuracy rate

[6]

Solution 1: Statistical Methods

- Hidden Markov Model (HMM)

D. Evaluation

- One of the most suitable algorithm for handwriting recognition

- Can be further developed by adapting appropriate software

C. How does it work?

A pattern is assigned to the model

with highest posterior probability (i.e.

the model that best explains the

pattern) [6]

Software Price

Sakhr £ 650.00

Omnipage (Pro) £ 292.00

Abby £ 100.00

B. Why?

- Price: Free

Solution 2: OCR Software

- Tesseract

A. What is Tesseract?

OCR engine for various operating systems, developed by HP in 1995

[7]

Solution 2: OCR Software

- Tesseract

C. Evaluation

- Easy accessibility: no cost & open for input

- Necessity for more participation & better accuracy for Arabic

B. Why?

Character Word

Change of error rate -7.31% -5.339%

- Open Source Software (OSS)

More opportunities to adapt software users’ input

Evaluation & Future Work

Part 3

Online

Community

Developers

:

University

students

Base:

Tesseract

+ HMM

Free collaborative Arabic OCR software

Free collaborative Arabic OCR software

- Android

- Ubuntu

- Debian

- Fedora

- 35 million articles in 288 different languages[9] - Since 2005: 12,000 developers from more

than 1,200 companies[8]

Linux:

NEMLAR (Network for Euro-Mediterranean

Language Resources) project[10]: (2003-2005)

- Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia, West Bank &

Gaza Strip, Denmark, France, Greece and The Netherlands.

- BLARK (Basic Language Resource Kit) for Arabic

NEMAR project[10]: (2008-2010)

- Machine Translation

- Multilingual Information Retrieval for Arabic

- Supported by the European Commission's ICT programme

Free collaborative Arabic OCR software

Free collaborative Arabic OCR software

Limitations

- Lack of interest in making efforts to develop free software by other Arab countries

- Programmers disinterested in participating in the project

Future approaches

- Crowdsourcing and database

- Text-to-speech

ReferencesOLOR

[1] Elaydi H, Shehada H. A Source of Inspiration: ATC for Visually Impaired Students at the Islamic University of Gaza[J]. ICTA, 2007, 7: 12-14.

[2] Asebriy Z, Bencharef O, Raghay S, et al. Comparative systems of handwriting Arabic character recognition[C]//Complex Systems (WCCS),

2014 Second World Conference on. IEEE, 2014: 90-93.

[3] Sargur, N. S. Hidden Markov Models. [PowerPoint slides]. Presented at a CSE 574 lecture at Buffalo University.

[4]George, M. [no date]. Optical Character Recognition: Classification of Handwritten Digits and Computer Fonts.

[5]Huaigu, C. et al. (2014).Progress in the Raytheon BBN Arabic Offline Handwriting Recognition. International on Frontiers in Handwriting

Recognition.

[6] RWTH-OCR. (2007) Arabic Handwriting Recognition.[online] Available from https://www-i6.informatik.rwth-aachen.de/~dreuw/arabic.php.

[7] Ray, S. [No date]. The Tesseract open source ocr system. [online] Available from http://static.googleusercontent.com/…/pubs/archive/33418.pdf

[8] Corbet, J., Kroah-Hartman, G. and McPherson, A. (2015) The Linux Foundation Releases Linux Development Report. Available at:

http://www.linuxfoundation.org/ (Accessed: 25 August 2015).

[9] Safer, M. (2015) Wikipedia cofounder Jimmy Wales on 60 Minutes. Available at: http://www.cbsnews.com/…/wikipedia-jimmy-wales-morley-

safe…/ (Accessed: 29 August 2015).

[10] MEDAR, Speech and Language Technologies for Arabic (no date) Available at: http://www.medar.info/index.php (Accessed: 29 August 2015).

Thank

You!

Q & A

top related