dh101 2013/2014 course 7 - ocr, printed text recognition, handwriting recognition, ornaments...

82
Digital Humanities 101 - 2013/2014 - Course 7 Digital Humanities Laboratory Andrea Mazzei and Fr´ ed´ eric Kaplan andrea.mazzei,frederic.kaplan@epfl.ch

Upload: frederic-kaplan

Post on 06-May-2015

1.741 views

Category:

Education


2 download

TRANSCRIPT

Page 1: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Digital Humanities 101 - 2013/2014 - Course 7

Digital Humanities Laboratory

Andrea Mazzei and Frederic Kaplan

andrea.mazzei,[email protected]

Page 2: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

A Job offer

•Running an OCR transcription of 320 pages

•about 60 hours of work

•25 CHF / hour.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 2o

Page 3: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Results of the peer grading process

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 3o

Page 4: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Results of the peer grading process

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 4o

Page 5: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Results of the peer grading process

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 5o

Page 6: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Results of the peer grading process

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 6o

Page 7: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Results of the peer grading process

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 7o

Page 8: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

New projects

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 8o

Page 9: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Venitian opera staging andmachinery

•A project that find way for better understanding and visualizing opera staging

based on evidences found in historical sources (treatise, music prints, etc.)

•Rosand, E. 1990. Opera in Seventeenth-Century Venice : The Creation of a Genre.

Berkeley : University of California Press.

•Bjurstrom, P. 1962. Giacomo Torelli and Baroque Stage Design. Stockholm :

Almqvist and Wiksell.

•Leclerc, H. 1987. Venise et l’avenement de l’opora public A l’age baroque. Paris :

A. Colin.

•Larson, O. K. 1980. Giacomo Torelli, Sir Philip Skippon, and Stage Machinery for

the Venetian Opera, Theatre Journal, Vol. 32, No. 4, pp. 448-457.

www.jstor.org/stable/3207407

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 9o

Page 10: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Venetian storytelling in theMiddle-Age

•Marin Sanudo was an historical writer. In contrast to others writer of the

epoch, he wrote a diary noting all the events happend in Venice. Of

course it is not the only one diary wrote in Venice. Imagine how to use

this personal information.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 10o

Page 11: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Looking at music printing typefaces

•A project that looks at the different music typefaces used in Venetian

prints. Typical questions are : the size of the typeface, when they were

used, for what repertoire, what printers used them, etc.

•Agee, R. 1998. The Gardano Music Printing Firms, 1569-1611.

Rochester, University of Rochester Press.

•Bernstein, J. 1998. Music Printing in Renaissance Venice. The Scotto

Press (1539-1572). Oxford, Oxford University Press.

•Bernstein, J. 2001. Print Culture and Music in Sixteenth-Century Venice.

Oxford, Oxford University Press.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 11o

Page 12: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Music at SanMarco

•A project that can look at how the capella di San Marco evolved over

time : how many musicians, where they played in the Basilica, what they

played, etc.

•Selfridge-Field, E. 1994. Venetian instrumental music from Gabrieli to

Vivaldi. New York : Dover.

•Moretti, L. 2004. Jacopo Sansovino and Adrian Willaert at St Mark’s,

Early Music History, Vol. 23, pp. 153-184.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 12o

Page 13: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Venetianmusic prints in libraries today

•A project that looks at the production of music prints in Venice and

where they are hold today in libraries and archives around the world

•The Repertoire International des Source Musicales, Series A/I on music

prints. http ://www.rism.info [will be made available digitally for the

project]

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 13o

Page 14: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Semester 1 : Content of each course

• (1) 19.09 Introduction to the course / Live Tweeting and Collective note

taking

• (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment

• (3) 2.10 Introduction to the Venice Time Machine project / Zotero

•9.10 No course

• (4) 16.10 Digitization techniques / Deadline first assignment

• (5) 23.10 Datafication / Presentation of projects

• (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first

assignment

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 14o

Page 15: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Semester 1 : Content of each course

• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation

• (8) 13.11 Historical Geographical Information Systems, Procedural modelling

/ City Engine / Deadline Project selection

• (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap

• (10) 27.11 Cultural heritage interfaces and visualisation / Museographic

experiences

•4.12 Group work on the projects

•11.12 Oral exam / Presentation of projects / Deadline Project blog

•18.12 Oral exam / Presentation of projects

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 15o

Page 16: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Today’s course

•Printed Text Recognition

•Hand Writing Recognition

•Ornament Recognition

•Text Mining and semantic disambiguation : Extracting named entities

(people, places, etc.) in a text using Wikipedia

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 16o

Page 17: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Part I : Printed Text Recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 17o

Page 18: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

OCR : Optical Character Recognition

A system that provides a full recognition of all the printed characters by

simply scanning the support.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 18o

Page 19: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Mori et al. (1992). Historical review of OCR R&D

•1940 : The first version of OCR

•1950 : The first OCR machines appear

•1960 - 1965 : First generation OCR : NOF, Farrington 360, IBM 1418.

They all used a special font

•1965 - 1975 : Second generation OCR : IBM 1287, NEC, Toshiba. They

could also recognize constrained hand-printed alpha-numerals.

•1975 - 1985 : Third generation OCR : IBM 1975, Poor print quality or

handwritten characters. 275 fonts. Handwriting recognition.

•1986 - Today : OCR to the people

Eikvil, L. (1993). Optical Character Recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 19o

Page 20: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

OCR capabilities

The recognition performance depends on the type and number of fonts

recognized.

•Fixed font : the sytem can recognize only one font

•Multi font : the system can recognize multiple fonts

•Omni font : the system can recognize most nonstylized fonts without

having to maintain huge databases of specific font information

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 20o

Page 21: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Omni-font OCR Overview Of Processing

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 21o

Page 22: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Preprocessing : Text Lines Straightening

Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using polynomial regression. In Image Processing. 2002.Proceedings. 2002 International Conference on (Vol. 3, pp. 977-980). IEEE.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 22o

Page 23: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Preprocessing : Character Detection

• Image binarization using local adaptive thresholding

•Character detection using region growing-based methods. PROBLEM !

Eikvil, L. (1993). Optical Character Recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 23o

Page 24: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Segmentation Problems : Touching and fragmented characters

•Joints will occur if the document is a dark photocopy or if it is scanned

at a low threshold.

•Joints are common if the fonts are serifed.

•The characters may be split if the document stems from a light

photocopy or is scanned at a high threshold

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 24o

Page 25: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Segmentation Problems : Distinguishing noise from text

Dots and accents may be mistaken for noise, and vice versa.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 25o

Page 26: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Segmentation Problems : Mistaking graphics for text

This leads to non-text being sent or text not being sent to recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 26o

Page 27: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Feature Extraction

From each character several features can be extracted :

•Rasterized pixels

•Geometric moment invariant

•Morphological features

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 27o

Page 28: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Feature Extraction : Zoning

MxN zones are computed as average gray level from the image of the

character.

Due Trier, O., Jain, A. K., & Taxt, T. (1996). Feature extraction methods

for character recognition-a survey. Pattern recognition, 29(4), 641-662

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 28o

Page 29: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Feature Extraction : Projection Profile

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 29o

Page 30: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Feature Extraction : Structural Analysis

Strokes, bays, end-points, intersections between lines and loops.

High tolerance to noise and style variations.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 30o

Page 31: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Classification

The principal approaches to decision-theoretic recognition are minimum

distance classifiers, statistical classifiers and neural networks.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 31o

Page 32: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Matching

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 32o

Page 33: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Optimum statistical classifiers.

•Bayesian classifier. Given an unknown symbol described by its feature

vector, the probability that the symbol belongs to the class c is computed

for all classes c = 1...N . The symbol is then assigned the class which

gives the maximum probability.

• ...

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 33o

Page 34: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Post Processing : Grouping

From symbols to strings using symbols proximity

Eikvil, L. (1993). Optical Character Recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 34o

Page 35: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Post Processing : Error Detection and Correction

•Use of rules defining the syntax of the word. Ex. In English the k never

appears after the h.

•Use of dictionaries. If the word is not in the dictionary, an error has been

detected, and may be corrected by changing the word into the most

similar word.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 35o

Page 36: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Self-learning

Modern OCR systems enlarge the database of characters when new fonts

are encountered. Character recognition is based on the database previously

built in, which contains the important features related to the characters

which are known already. It is necessary that this database is able to self

expand as more and more new characters are met in order to increase the

recognition ability.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 36o

Page 37: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Handwriting Recognition (HWR)

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 37o

Page 38: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Offline HWR : Many difficult problems

•Stroke ordering

•Broken lines

•Merged blobs

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 38o

Page 39: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

From Offline to Simulated Online

It is not reliable

•What order were the strokes written in ?

•Doubled-up line segments ?

• Ink blobs ?

•Spurious joins between letters ?

•Missing joins ?

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 39o

Page 40: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Segmentation : Strokes Extraction

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 40o

Page 41: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Segmentation : Segments Fitting

Robustly cut letters into segments

Match multiple segments to detect letters

Easier than matching whole letter

Hutchison L. Handwriting Recognition for Genealogical Records

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 41o

Page 42: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Analytical Approach

It treats a word as a collection of simpler sub-units such as characters

•Segmentation of the word into these units

• Identification of the units

•Word-level interpretation using a predefined lexicon

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 42o

Page 43: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Problemswith the Analytical Approach

• segmentation ambiguity : deciding where to segment the word image

•variability of segment shape : determining the identity of each segment

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 43o

Page 44: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Holistic Matching

Treats the word as a single, indivisible entity and attempts to recognize it

using features of the word as whole.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 44o

Page 45: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Advantages of the Holystic Matching

Coarticulation effect, i.e., the changes in the appearance of a character

as a function of the shapes of neighboring characters

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 45o

Page 46: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Advantages of the Holystic Matching

Orthogonality of holistic features : information about the word that

is clearly orthogonal to the knowledge of characters in it and it stands to

reason that the introduction of this knowledge should improve recognition

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 46o

Page 47: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Advantages of the Holystic Matching

Evidence from psychological studies : psychological studies of

reading points towards the fact that humans do not, in general, read words

letter by letter.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 47o

Page 48: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Dynamic Global Search

Assemble word spelling from possible letter readings

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 48o

Page 49: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Result 1

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 49o

Page 50: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Result 2

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 50o

Page 51: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Result 3

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 51o

Page 52: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

ABBYY Fine Reader : A Case Study

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 52o

Page 53: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Scanned Document

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 53o

Page 54: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Image Rotation Adjustment

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 54o

Page 55: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Image Rotation Adjustment

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 55o

Page 56: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

First Extraction

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 56o

Page 57: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Synthetizing the Table

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 57o

Page 58: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Second Extraction

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 58o

Page 59: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Retrieval of the ornaments from the Hand-Press Period

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 59o

Page 60: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Problem Statement

For millions of intact books and tens of millions of loose pages, the

provenance of the manuscripts may be in doubt or completely unknown

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 60o

Page 61: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Manual Solution

Human experts are capable to regain the provenance by examining

linguistic, cultural and/or stylistic clues.

However, such experts are rare and this investigation is a time-consuming

process.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 61o

Page 62: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Automatic Solution

By comparing the initial letters in the manuscript to annotated initial

letters whose origin is known, the provenance can be determined.

This process can be automatized

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 62o

Page 63: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

What are the Challenges ?

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 63o

Page 64: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Ornament Segmentation

Ornament(s) detection and localization with respect to the page reference system.

Baudrier, E., Busson, S., Corsini, S., Delalandre, M., LandrA c©, J., &

Morain-Nicolier, F. (2009, July). Retrieval of the ornaments from the hand-press

period : an overview. In Document Analysis and Recognition, 2009. ICDAR’09. 10th

International Conference on (pp. 496-500). IEEE.my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 64o

Page 65: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

A Compression Based DistanceMeasure for Texture

The distance between a window and an annotated initial letter is

denoted as :

distCK 1(W , IL) =mpegSize(W , IL) + mpegSize(IL,W )

mpegSize(W ,W ) + mpegSize(IL, IL)− 1

The first image supplied to mpegSize is assigned as an I frame

and the second becomes a P frame.

Campana, B. J., & Keogh, E. J. (2010). A compression-based

distance measure for texture. Statistical Analysis and Data

Mining, 3(6), 381-398

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 65o

Page 66: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Properties of CK1 DistanceMeasure

Efficient, robust and parameter-free texture similarity measure.

Rotation, Colour and Illumination Invariant.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 66o

Page 67: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Gabor Filters

Images are convolved with each filter.

The standard deviation and mean of each response => 48 length vector

Vector Euclidean distance

Wang, X., Ding, X., & Liu, C. (2005). Gabor filters-based feature extraction for

character recognition. Pattern recognition, 38(3), 369-379

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 67o

Page 68: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Data Sets

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 68o

Page 69: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Experimental Results

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 69o

Page 70: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Part II : Text mining and semantic disambiguation

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 70o

Page 71: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Case study : Extracting named entities (people, places,etc.) in a text using Wikipedia

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 71o

Page 72: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

UsingWikipedia

•A Unique ID : A Wikipedia article is identified by a unique name, which is

the article title itself. The respective URL of a Wikipedia article can be

created by concatenating the words in the article title and appending it

to the URL root of the Wikipedia

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 72o

Page 73: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

UsingWikipedia

•Redirections : Some entities can have multiple names. In order to address

this issue, Wikipedia has some article titles that do not have a

substantive article and are only redirected to a different Wikipedia article

with another title. This mechanism is called redirection. Redirections are

used for other purposes such as spelling resolution (e.g. the article title

Oranges is redirected to Orange) and abbreviation resolution (e.g. the

article title UCLA is redirected to University of California, Los Angeles).

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 73o

Page 74: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

UsingWikipedia

•Disambiguation pages : A disambiguation page is created for ambiguous

entity names and it enumerates all the possible articles for that name. For

example, the disambiguation page for Paris enumerates 25 places called

Paris (in America, Canada and Europe), 33 people having Paris as name

or surname, 10 television series and films, whose title contains the word

Paris, etc.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 74o

Page 75: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

UsingWikipedia

•Outgoing links : In the body text of the Wikipedia article there are

references (links) to other articles. The references are within pairs of

double square brackets.

• Infobox : An infobox is a fixed-format table designed to be added to the

top right-hand corner of articles to consistently present a summary of

some unifying aspect that the articles share and sometimes to improve

navigation to other interrelated articles.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 75o

Page 76: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

3 steps

•Data extraction : A (sequence of) word(s) is extracted from a ”Le

Temps” article (e.g. Le Paris). Set the right boundaries in the extracted

data (e.g. from ”Le Paris” is retrieved the ”Paris” ).

•Disambiguation : Retrieve all the Wikipedia articles whose title contains

the word ”Paris” (e.g. Paris (France), Paris (Texas), Paris Hilton, Paris

(mythology), etc). Find the Wikipedia article that maximizes the

agreement between the content extracted from Wikipedia and the

context of the ”Le Temps” article.

•Entity classification : Classify the entity as place, person, company, etc,

based on the chosen Wikipedia article

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 76o

Page 77: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Disambiguation strategy

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 77o

Page 78: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

(1) Data extraction

•The first step is the extraction of possible named entities. This step isbased on the fact that the named entities consist of capitalized words.The rules that we apply for the extraction of possible named mentions inthe text are the following :•Retrieve all the capitalized words (e.g. England)

•Retrieve recursively terms T0 of the form T1 Particle T2, where Particle is one of a possessive

pronoun, and the terms T1 and T2 are capitalized words or sequences of capitalized words

(e.g. University of Edinburgh, European Society of Athletic Therapy and Training)

• In French, some entities can contain non-capitalized words, after some specific words.

Therefore, we retrieve non-capitalized words if they are followed by a word that is contained

in a predefined set of words (e.g. Union, Bibliotheque, etc). For example the Union

sovietique is considered as entity.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 78o

Page 79: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

(2) Disambiguation

•The disambiguation process employs a vector space model, in which a

vectorial representation of the processed article is compared with the

vectorial representations of the Wikipedia entities.

•The vectorial representation of the processed article (article vector) is a

vector having all the possible entities of the specific article obtained

during the previous step, while the vectorial representation of a Wikipedia

article (Wikipedia vector) is a vector having all the outgoing links in the

body text of the article.

•Once a Wikipedia article is identified as the most similar to the processed

article, the article vector is updated by adopting the features of the

chosen Wikipedia vector.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 79o

Page 80: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

(3) Entity classification

•The last step is to classify the entities into persons, places, companies,

etc.

•Ex : It the entity a place ? If the Wikipedia article contains an infobox,

then we retrieve it and we search for specific tags in it that can classify

the entity as a place.

• If the Wikipedia article does not have an infobox, then we use the first

sentence of the body text.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 80o

Page 81: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

Partial results

•We have implemented the algorithm and tested it on a subset of the

database

•Our current estimation of the number of entity retrieved is 85 %

•Main issue : Some entites are not in Wikipedia.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 81o

Page 82: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation

FromWikipedia toWikipast

•The First principle of Wikipedia is that it is an encyclopedia. Not all

entites are allowed. Sourcing is important but secondary

•On going discussion with Wikimedia to create an alternative to

Wikipedia, allowing page on any person, place, etc. from the past as long

at it is clearly sourced.

my header

Digital Humanities 101 - 2013/2014 - Course 7 | 2013 82o