brailleocr: an open source document to braille converter application

An Open Source Tesseract based Tool for Extracting Text from Images with Application

in Braille Translation

Pijush Chakraborty [Roll: 32]Calcutta Institute of Engineering and

Management

CS681 Seminar

Introduction

Contribution of the Application in real life:o Our application integrates the working of an OCR with Braille

Translation.o BrailleOCR is currently the only application that supports

conversion of Image document to Braille format.o Will help in converting large documents to Braille format and

eventually help a lot of Visually Impaired people.o Project site: code.google.com/p/brailleocro DOI IJCA Paper reference: 10.5120/11664-7254

Open Source APIs used:o Tesseract Engine[Open-source OCR Engine]o Tess4J API [JNA Wrapper for using Tesseract with Java] o JOrtho API [Java open-source spell checking API]o Swing Graphics API

Introduction: Use of our Application

Introduction: BrailleOCR GUI

Methodology

Conversion of an Image Document to Braille consists of the following steps:

Methodology: Steps to be Followed

Fig. 1. Steps to be Followed

Pre Processing Step

Pre Processing Steps:◦ Conversion to grayscale◦ Conversion of grayscale image to binary◦ The second sub-step is handled by Tesseract

using adaptive threshold. Reason for Grayscale conversion:

◦ Increases the accuracy in the Recognition step as stated in Ref. [2].

◦ Table 1 gives the Accuracy rate for certain input images.

Pre Processing: Image Type

Input Image No. of Images

Accuracy

Color Image 10 89%

Grayscale Image 10 93%

Table 1: Accuracy of Tesseract

Different Algorithms available: Averaging Luminosity method

Luminosity method Benefits: Human perception has more sensitivity for green more that red and red

more than blue Wight of green color component is highest followed by red and blue

i.e weight of color channel ∝ sensitivity

Algorithm Used:The color image can be represented as a discrete function f(x,y)=(xi,yj), 0<=i<N, 0<=j<M where N is the height of the image and M is the width of the image.

for i=0 to N-1 for j=0 to M-1 gr(xi,yj) = 0.299*r(xi,yj)+0.587*g(xi,yj)+0.114*b(xi,yj)

Here gr(xi,yj) is the grayscale image pixel, r(xi,yj) is the red channel, g(xi,yj) is the green channel and b(xi,yj) is the blue channel

Pre Processing: Grayscale Conversion

Pre Processing: Implementing the Algorithm

Fig. 2. Scanned Image

Fig. 3. Grayscale Image

Text Extraction Step

What is Optical Character Recognition?◦ Conversion of Scanned Image

document to Machine Encoded Text.◦ Useful in keeping backup of

important documents as text format.

Brief History:◦ 1929-1975: OCR without Electronic

computers◦ 1985-2000: Development in OCR for

computers◦ 2000-2013: Developments of

industrial standard OCR

Text Extraction: What is OCR?

Fig. 4. OCR implementation

Tesseract is currently the best Open Source OCR Engine.

Developed at HP between 1984 and 1994. Released Tesseract for open source in 2005 and

since then Google has taken over the Project. Project site:

Google recently launched Tesseract v3.0 Used with Java Applications using a JNA wrapper

Tess4J. Project site: code.google.com/p/tesseractocr

Text Extraction: Tesseract History

Get outlines by connected component analysis.

Organize outlines to Blobs

Organize Blobs to Text Lines

Characters are chopped and features are extracted

Text Extraction: Tesseract Architecture

Fig. 5. Architecture

Features are extracted using polygonal approximation.

Matched with prototype to find matching patterns.

The adaptive classifier scans the image twice to get better result the second time.

Text Extraction: Tesseract Charcter Recognition

Fig. 6. Prototype Matching

Post Processing Step

Why Post Processing?◦ Corrects errors in the previous step◦ Gives error free text for Braille Conversion◦ Spell checking systems provide the best results for post

processing step.

JOrtho API◦ JOrtho is an open source Java spell checking API that gives

suggestions for commonly misspelled words in the text.◦ The key algorithms include phonetic matching algorithms

such as Soundex ◦ Project site: jortho.sourceforge.net

Post Processing: Correcting the Text

Soundex Code:◦ The Soundex Code of a word returns a

alphabet followed by 3 numbers using the algorithm bellow

Algorithm:◦ Retain the first letter of the name and

drop all other occurrences of a, e, i, o, u, y, h, w.

◦ Replace consonants with digits as follows (after the first letter):

b, f, p, v = 1c, g, j, k, q, s, x, z = 2d, t = 3l = 4m, n = 5r = 6

◦ Two adjacent letters with the same number are coded as a single number. Two letters with the same number separated by 'h' or 'w' are coded as a single number

Post Processing: Soundex Algorithm

Example: “Metacalt”and “Metacalf” return the same string M324 as they are phonetically same

Fig. 7. Spell Cheking

Braille Translation Step

History of Braille:◦ Invented by Louis Braille in the 19th century◦ Accepted throughout the world as aform of

written communication for blind individuals◦ There have been some modifications to the

Braille system such as inclusion of concatenated words.

Use of Braille:◦ Braille is the primary reading and writing

system used by the visually impaired.◦ Helps in increasing literacy among the

visually impaired.◦ In modern world Braille technologies are

supported by various electronic devices. Braille Cell:

◦ Braille cells are 6-dot cells having some dots raised or lowered.

◦ 64 possible combinations.◦ Used in Braille Refreshable Display

What is Braille?

Fig. 9. six-dot Braille cell

Fig. 8 Braille Refreshable Display

Braille Details:◦ Grade 1 and Grade 2 are the most

commonly used.

◦ Grade 1 Braille includes single letters, numbers while grade 2 Braille includes concatenated words such as for,with,you, etc..

◦ Numbers (0,1 to 9) are denoted by (j,a to i) preceded by the number denoting cell

◦ Compounds letters (ex: and, with, wh, the,th…) have separate Braille representations.

◦ Uppercase alphabets have a preceding Braille cell denoting capital letter.

Braille: Braille Types

Fig. 10. Braille representations

Braille ASCII:◦ Subset of ASCII character set.◦ Contains all 64 Braille representations (6-dot cell).◦ Maps one-to-one ASCII input to Braille code. ◦ Supported by all Braille embossers.◦ It uses ASCII codes to send information to Braille displays.

Braille Patterns:◦ Braille Patterns are Unicode patterns that represent Braille characters.◦ Consists of 256 combinations of the 8-dot Braille cell. We require only 64.◦ Braille embossers and Braille Displays are recently upgraded to support

Unicode Braille.◦ The Unicode Braille set ranges from U+2800 to U+28FF though we need

only U+2800 to U+283F◦ In our application, we have focused on Unicode Braille representation.

Braille Translation: Electronic Braille

Braille Code Example:String: “6 dot Braille Cells for 64 combinations” Braille:

The flowchart bellow gives the entire algorithm of translation.

Braille Translation: Algorithm

Fig. 11. Flow Chart for Translation

Implementation

Extracting Text and correcting errors.

Implementation: BrailleOCR

Fig. 12. Extracting Text and Correcting Errors

Translation to Braille

Implementation: Braille Conversion

Fig. 13. Converting Text to Braille

Conclusion

We have showed the process of integrating Tesseract OCR Engine with Braille Translation.

Our Future plans are to make it multilingual such that it can support Bharti Braille too which has Bengali, Hindi, Gujarati and all other Indian languages.

We will also provide better support for Grade 2 Braille as Grade 2 Braille is common now-days.

Project Site: code.google.com/p/brailleocr

Conclusion and Future Plans

[1] Tesseract Project Site: code.google.com/p/tesseractocr [2] Chirag Ptel, AtulPatel, Dharmendra Patel, Optical Character

Recognition using Tool Tesseract: A Case Study, IJCA, October 2012 [3] Pijush Chakraborty and Arnab Mallik, An Open Source Tesseract

based Tool for Extracting Text from Images with Application in Braille Translation for the Visually Impaired, IJCA, April 2013

[4] R.Smith, An Overview of the Tesseract OCR Engine, Proc. Ninth Int. Conference on Document Analysis and Recognition , IEEE Computer Society (2007)

[5] Ray Smith, Tesseract OCR Engine, OSCON 2007 [6] Tess4J Project Site: http://tess4j.sourceforge.net/ [7] JOrtho Project Site: http://jortho.sourceforge.net/ [8] Soundex Reference: http://en.wikipedia.org/wiki/Soundex [9] The Rules of Unified English Braille, International Council on English

Braille(ICEB), June 2001 [10] Braille ASCII: http://en.wikipedia.org/wiki/Braille_ASCII [11] BrailleOCR Project Site: code.google.com/p/brailleocr

References:

Questions?

Thank You!..

brailleocr: an open source document to braille converter application

Software