![Page 1: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/1.jpg)
Optical Character Recognition with a NeuralNetwork Model for Coptic
Kirill Bulert So Miyagawa Marco Buechler
December 8, 2017DH2017 Montreal, Canada
Virtual Short Paper
![Page 2: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/2.jpg)
Coptic
![Page 3: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/3.jpg)
About Coptic
• The final stage of the AncientEgyptian language (thirdcentury)
• Multiple dialects (mostimportant: Bohairic, Sahidic)
• Many manuscripts in SahidicCoptic
1
![Page 4: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/4.jpg)
Alphabet
• Simple: Ca. 30 unique characters• No upper/lower case in historic texts• Diacritics add some complexity, but not much• Based on Greek alphabet
2
![Page 5: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/5.jpg)
Coptic projects
• SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html
• Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/
• Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/
3
![Page 6: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/6.jpg)
OCR Systems
![Page 7: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/7.jpg)
Tesseract vs Ocropy, free OCR frameworks
Tesseract• Font based learning
• Single characters aredecomposed
• Decomposed parts arematched against given text
Input
• Fonts
Problem
• Few Coptic fonts available,not all historic variationscovered
Ocropy/Ocropus• Neural nets with online
learning
• Model accuracy proportionalto ground truth size
• But, size of ground truth islimited
Input
• Ground truth andcorresponding images
Problem
• Limited data for learning
4
![Page 8: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/8.jpg)
Tesseract vs Ocropy, free OCR frameworks
Tesseract• Font based learning
• Single characters aredecomposed
• Decomposed parts arematched against given text
Input
• Fonts
Problem
• Few Coptic fonts available,not all historic variationscovered
Ocropy/Ocropus• Neural nets with online
learning
• Model accuracy proportionalto ground truth size
• But, size of ground truth islimited
Input
• Ground truth andcorresponding images
Problem
• Limited data for learning
5
![Page 9: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/9.jpg)
Pre-Processing
![Page 10: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/10.jpg)
GIGO - principle
• Training: Data → training → model
• Garbage in, garbage out
• Cleaner images improve results tremendously
• Dust and stains can be cleaned algorithmically, but …
• What if text itself is noise the problem?
6
![Page 11: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/11.jpg)
Original Page
• Flawless, almost …
7
![Page 12: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/12.jpg)
Original Page
• Easily removable with ScanTailor
8
![Page 13: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/13.jpg)
Foreign language
• No multilingual OCR models for Coptic
9
![Page 14: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/14.jpg)
Annotations
• Special characters might not be part of any model (⸤ ⸥)• Not all annotations wanted
10
![Page 15: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/15.jpg)
Language specific variations
• Might also not be included in a model
11
![Page 16: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/16.jpg)
Previous work
• Coptic models created by Moheb for Tesseract (2013)
• Trained with several Coptic fonts, no non-Coptic letters support
• No support for diacritics
• Non-Coptic letters get replaced with similar Coptic letters
12
![Page 17: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/17.jpg)
The good, the bad, and the problematic
• Even clean scans still contain noise
13
![Page 18: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/18.jpg)
Results
![Page 19: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/19.jpg)
Without line numbers
Accuracy Input
• Without time consuming pre-processing
• Ocropy model trained on 10 pages more accurate
• Non-multilingual Tesseract model less accurate
14
![Page 20: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/20.jpg)
Without line numbers
Accuracy
• Without non-Coptic letters
• Difference results mostly fromdiacritics
Input
15
![Page 21: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/21.jpg)
Without line numbers
Accuracy
• Diacritics removed
• Pure Coptic Tesseract modeloutperforms Ocropys mixedmodel
Input
16
![Page 22: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/22.jpg)
Workload comparison
• Utilising OCR decreases human workload
17
![Page 23: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/23.jpg)
Roundup
• Utilisation of OCR beneficial for most clean documents
• Tesseract best for monolingual documents with limited fonts andfont variations
• Ocropy best for large documents with multiple languages
18
![Page 24: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/24.jpg)
Road ahead
• Approached by Google for collaboration on OCR for Coptic
• Create data set for Coptic OCR testing
• Transition from typeset to handwritten Coptic texts
• Combination of different models
19
![Page 25: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/25.jpg)
Thank you.Questions?
Get our Ocropy models at
Figure 1: https://github.com/somiyagawa/CopticOCR-1/tree/master/Besa
19
![Page 26: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/26.jpg)
PresentationKirill Bulert, So Miyagawa
Team (in alphabetical order)Kirill Bulert, Marco Büchler, So Miyagawa.
Visit ushttp://www.etrap.euhttp://www.uni-goettingen.de/de/517150.html
[email protected] [email protected]
20
![Page 27: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/27.jpg)
OCR References
1. Uwe Springmanns OCR Workshophttp://www.cis.uni-muenchen.de/ocrworkshop/program.html
2. Scantailor for pre-processinghttp://scantailor.org/
3. Ocropy/Ocropushttps://github.com/tmbdev/ocropy
4. Kraken an Ocropy forkhttp://kraken.re/
5. Tesseract OCRhttps://github.com/tesseract-ocr/
6. Moheb’s Coptic Pageshttp://www.moheb.de/ocr.html
21
![Page 28: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/28.jpg)
Coptic References
1. SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html
2. Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/
3. Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/
22
![Page 29: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •](https://reader033.vdocuments.site/reader033/viewer/2022043000/5f74bdc0355ec54c5b2ea33b/html5/thumbnails/29.jpg)
Licence
The LaTeX theme this presentation is based on is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License.Changes to the theme are the work of eTRAP.
cba
23