bne impact ocr_part1

14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT OCR in a nutshell Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España

Upload: impact-centre-of-competence

Post on 22-Nov-2014

508 views

Category:

Technology


0 download

DESCRIPTION

Presentation introducing OCR general features in relation to IMPACT project presented by Clemens Neudecker during demo session held at the BNE 5th of October 2011.

TRANSCRIPT

Page 1: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT OCR in a nutshellClemens Neudecker, National Library of the Netherlands

IMPACT Demo Day, Biblioteca Nacional de España

Page 2: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR Process Binarisation

= transform greyscale or colour images to bitonal (b/w)

in order to separate foreground (text) from background

Segmentation

= detection of layout elements in hierarchical order

(blocks/regions, lines, words, glyphs)

Pattern Matching (Recognition)

= matching of character shapes with internal font database (classifiers)

Page 3: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY FineReader Main OCR technology provider in IMPACT OCR technologies experts since 30 years IMPACT uses FineReader Engine (SDK)

Page 4: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Binarisation

Page 5: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Adaptive Binarisation

Original scan

Prev. binarization

New binarization

Page 6: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Binarisation

6

Original State of the Art IMPACT

Page 7: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Segmentation

Blocks/Regions Words Glyphs

Page 8: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplePre-Impact FR Engine 9 FR Engine 10

Part of column was misclassified as image

8

Page 9: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation example

v. 9 v. 10

Linear word order errors

9

Page 10: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplev. 9 v. 10

Lost text

10

Page 11: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Fraktur recognition

Page 12: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Languages and Dictionaries Goal:

• Develop an interface so that external dictionaries can be integrated into the FineReader Engine

2008 - 2009:• External Dictionary beta interface• Same quality as with internal dictionaries possible

2010 - 2011:• Make interface work reliably• Teach partners how to use it• Support for any language, any time period

12

Page 13: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ALTO: New native export format

Available since FRE 10 R2 Supports most recent schema: ALTO v. 2.0 Line coordinates available

Page 14: Bne impact ocr_part1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you! Questions?