Download - ENP Belgrade WS refinement introduction
![Page 1: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/1.jpg)
Europeana Newspapers -
Refinement Workshop
WP2 – Introduction to Refinement
Belgrade, 13 June 2013
Clemens Neudecker (@cneudecker)
![Page 2: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/2.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Overview of Refinement Dataset
• Introduction to Refinement: Workflow & Technologies
• Questions & Answers
2
![Page 3: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/3.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana
- Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies
- Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
![Page 4: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/4.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple & standardised workflow with clear checkpoints
• Diverse partners supplying content with different digitisation & access policies
• Large variety of content in terms of file formats, fonts, languages, etc.
4
![Page 5: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/5.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
The data
![Page 6: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/6.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
![Page 7: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/7.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
![Page 8: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/8.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
![Page 9: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/9.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
![Page 10: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/10.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement Workflow steps
10
![Page 11: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/11.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Master List
![Page 12: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/12.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method
• Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers
12
![Page 13: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/13.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Purpose: Support content holders in preparing their data in the correct format
• Background: Ensure folder structure and file naming requirements for automated processing are met
13
![Page 14: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/14.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Purpose: Final quality check of data before refinement
• Background: Ensure content and refinement partners that all preparation steps have been executed successfully
14
![Page 15: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/15.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OCR@UIBK
• OCR = Optical Character Recognition
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• Result: METS/ALTO package containing images, metadata & full text
15
![Page 16: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/16.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OLR@CCS
• OLR = Optical Layout Recognition
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• Result: METS/ALTO package containing images, metadata & full text
16
![Page 17: ENP Belgrade WS refinement introduction](https://reader038.vdocuments.site/reader038/viewer/2022103114/55501d1db4c90555618b518d/html5/thumbnails/17.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: NER@KB
• NER = Named Entities Recognition
• Number of pages to be refined: 2 million
• Technologies: Stanford CRF-NER
• Languages supported: German, Dutch, English (+ French, Latvian)
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
• Feedback cycle with manual training step better results
17