sindhi optical character recognition

13
Sindhi Optical Character Recognition By: Mutee U Rahman Muhammad Rafi Waleed Butt پ ڻ ا ڃ س ي ج رن ک ا ي س عڪ ي ڌ ن س

Upload: cheri

Post on 04-Jan-2016

50 views

Category:

Documents


5 download

DESCRIPTION

Sindhi Optical Character Recognition. سنڌي عڪسي اکرن جي سڃاڻپ. By: Mutee U Rahman Muhammad Rafi Waleed Butt. Total 15 main bodies are considered Due to complications diacritics are not considered Tesseract & Decision Tree training m odels generated and tested - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sindhi Optical Character Recognition

Sindhi Optical Character Recognition

By: Mutee U RahmanMuhammad Rafi

Waleed Butt

سنڌي عڪسي اکرن جي سڃاڻپ

Page 2: Sindhi Optical Character Recognition

Summary of the Project

Total 15 main bodies are consideredDue to complications diacritics are not

consideredTesseract & Decision Tree training models

generated and testedAccuracy calculated by counting

generated correct ids

Page 3: Sindhi Optical Character Recognition

Data Description Data Set-I

15 main bodies

35 Tokens of Training Data

10 Tokens of Testing Data

Page 4: Sindhi Optical Character Recognition

Data Set-II

56 random MBs

Page 5: Sindhi Optical Character Recognition

Syllable IDTotal of Strings

Correctly Recognized

با 502 10 10بد 503 10 10ٻو 504 10 10د 505 10 10

ني 506 10 10۽ 507 10 1

هو 508 10 10جي 509 10 10خو 510 10 10۾ 511 10 10ن 512 10 10ر 513 10 10

سا 514 10 10س 515 10 10و 516 10 10ي 517 10 10

Subtotal 160 151Accuracy 94.375

Tesseract Recognition Results on Data-Set I (Test Data)

Page 6: Sindhi Optical Character Recognition

Tsseract Accuracy Results on Data-Set II Data-File

100% Accuracy on random data file

Syllable IDTotal of Strings

با 502 5بد 503 0ٻو 504 1د 505 4

ني 506 6۽ 507 5

هو 508 0جي 509 3خو 510 5۾ 511 6ن 512 7ر 513 6

سا 514 4س 515 3و 516 5ي 517 1

Total 56

Page 7: Sindhi Optical Character Recognition

Decision Tree Results

Syllable IDTotal of Strings

Correctly Recognized

با 502 10 10بد 503 10 10ٻو 504 10 9د 505 10 9

ني 506 10 8۽ 507 10 1

هو 508 10 8جي 509 10 9خو 510 10 10۾ 511 10 9ن 512 10 10ر 513 10 9

سا 514 10 10س 515 10 10و 516 10 9ي 517 10 8

Subtotal 160 139Accuracy 86.875%

Page 8: Sindhi Optical Character Recognition

Preprocessing Line Segment

◦Sample pages are given with different numbers of lines

◦All lines were extracted correctly -100%

Page 9: Sindhi Optical Character Recognition

Preprocessing Line Segment

◦Pages with different number of lines given for segmenting line

◦All lines were extracted correctly -100%◦100%

Page 10: Sindhi Optical Character Recognition

Preprocessing

Syllable/Ligature Segmentation◦From every page, we have successfully

extracted syllable/ligature◦Performance of syllable/ligature 80%

Page 11: Sindhi Optical Character Recognition

Preprocessing Main Body (MB)

◦We have selected 15 MB from Sindhi Alphabets ◦We have not able to isolate diacritics, hence

the MB are not correctly identifiable.◦ Total main

bodiesCorrectly classified as main bodies

% Accuracy

15 12 80%

Page 12: Sindhi Optical Character Recognition

Preprocessing Diacritics

◦We are not able to extract diacritics from the text.

Page 13: Sindhi Optical Character Recognition

Conclusion

Tesseract accuracy is 94.4% and DT accuracy is 86.7% on Dataset-I

On Dataset-II accuracy for Tesseract is 100%

Line Extraction 100%, Syllable 80%