automatically identifying record patterns from the extracted data fields of genealogical microfilm...

34
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Kenneth Tubbs Tubbs David W. David W. Embley Embley

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Automatically Identifying Record Patterns from the

Extracted Data Fields of Genealogical Microfilm

Kenneth TubbsKenneth TubbsDavid W. EmbleyDavid W. Embley

Problem

• Searching through microfilm by Searching through microfilm by hand is tedious.hand is tedious.

• Extraction by hand requires large Extraction by hand requires large amounts of time and manpower. amounts of time and manpower.

Algorithm

RecordPatterns

RecordPatterns

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

EvaluateCandidatesEvaluate

Candidates

External Preprocessing

Input FeaturesInput Features

1.1. Coordinates of each zone.Coordinates of each zone.

2.2. Printed text of each zone.Printed text of each zone.

3.3. Whether or not each zone Whether or not each zone is empty.is empty.

XML Input FileXML Input File

< zone < zone rectanglerectangle="66,55,223,11" ="66,55,223,11" printed_textprinted_text=“NAME and =“NAME and

Surname of each Surname of each Person" Person" emptyempty="0" ="0"

/>/>

Identify Structure

• Identify Table PrimitivesIdentify Table Primitives

• Evaluate PrimitivesEvaluate Primitives

• Factor Table PrimitivesFactor Table Primitives

IdentifyStructureIdentify

Structure

Identify Table Primitives

Name

Row:Row:

[label[label::value+] right, heightvalue+] right, height

IdentifyStructureIdentify

Structure

Identify Table Primitives

Column:Column:

[label[label::value+] down, widthvalue+] down, width

Name

IdentifyStructureIdentify

Structure

Identify Table Primitives

Row:Row:

[label[label::value+] right, heightvalue+] right, height

IdentifyStructureIdentify

Structure

Evaluate Primitives

Primitive Confidence LevelPrimitive Confidence Level

primitive for the

with associated ofset theis J Where

) ,( Confidence

i

Jj ji

Label

Values

ValueLabel ==

Jj ji Value Label each with associated is thatConfidence==

IdentifyStructureIdentify

Structure

Evaluate Primitives

j

iNn

n

jii

Value

LabelSizeLabelluesPerNumberOfVa

ValueSizeLabelSizeLabelluesPerNumberOfVa

with associated Labels ofset theis N Where

*

Confidence (Confidence (LabelLabelii, Value, Valuejj) ) ==

IdentifyStructureIdentify

Structure

Factor Table Primitives

A B C D E F

[A B C D E F] or[A B C D E F] or

[A] [B C D E F] or[A] [B C D E F] or

[E] [A B C D F] or[E] [A B C D F] or

Others.Others.

IdentifyStructureIdentify

Structure

Factor Table Primitives

• An expert user assigns probabilities An expert user assigns probabilities to types of factorings.to types of factorings.

ExampleExample

[column:column+] left, .90[column:column+] left, .90

[row:column+] below, .85[row:column+] below, .85

IdentifyStructureIdentify

Structure

Match Attributes

• Identify Possible Mappings from the Identify Possible Mappings from the Microfilm Table to the Genealogical Microfilm Table to the Genealogical Ontology.Ontology.

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

Identify Possible Mappings

1. Identical Matches

2. Synonym Matches

3. Composite Matches

Genealogical OntologyPrinted Text

Name Name

Sex Gender

Female Age Female, Age

Mapping typesMapping types

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

Evaluate Mapping

• Edit distance between wordsEdit distance between words

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

Check Constraints

• The algorithm evaluates each the The algorithm evaluates each the factoring of each record with a factoring of each record with a genealogical ontology.genealogical ontology.

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

Identify RecordsTable (Address , Name) = 14 / 3 = 4.67Table (Address , Name) = 14 / 3 = 4.67

Label Number of Values

Address 3

Name 14

Age 13

Gender 14

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

Genealogical Ontology

• The genealogical ontology is The genealogical ontology is created by an expert user. created by an expert user.

• The cardinalities areThe cardinalities are assigned to the ontology by assigned to the ontology by recording the cardinalities of a recording the cardinalities of a corpus of microfilm. corpus of microfilm.

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

Genealogical Ontology

Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73 Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73

Family

Address

Age Gender

1

1

Person

Name

1.11.2

4.31.1

1.31

1.11

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

Evaluate Factoring

Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = 4.73 4.73 Table (Address, Name) = 14 / 3 = 4.67Table (Address, Name) = 14 / 3 = 4.67

Distance ClassifierDistance Classifier

Distance_From_Ontology = 1 / (4.73 – 4.67)Distance_From_Ontology = 1 / (4.73 – 4.67)22 = 277 = 277

Distance_From_No_Factoring = 1 / (1 – 4.67)Distance_From_No_Factoring = 1 / (1 – 4.67)22 = .0724 = .0724

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

Evaluate Candidates

• For every combination of primitives, For every combination of primitives, attribute mappings, and factorings attribute mappings, and factorings compute the product of their compute the product of their confidences.confidences.

• Select most confident combination.Select most confident combination.

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF

ConfidenceF

ConfidenceF

ConfidenceF

ConfidenceF

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF

ConfidenceF

ConfidenceF

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

ConfidenceF F F F

Evaluate Candidates

Primitive 1

Primitive 2

Primitive 3

Attribute Attribute Attribute

ConfidenceF

ConfidenceF

ConfidenceF

Algorithm

RecordPatterns

RecordPatterns

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

MatchAttributes

MatchAttributes

IdentifyStructureIdentify

Structure

CheckConstraints

CheckConstraints

EvaluateCandidatesEvaluate

Candidates

Output

• Record Patterns– Attributes of each record.– Geometry of each record.

• Attribute mappings for the table

to the ontology.

Microfilm Queries

• A web form provides the interfaceA web form provides the interfaceto query the microfilm database.to query the microfilm database.

• Individuals can enter keywords, Individuals can enter keywords, (such as first and last name), and the (such as first and last name), and the system locates the appropriate system locates the appropriate records among the indexed records among the indexed documents.documents.

Web Query

EyreJohn

Query Results

Click an image to select a result document.

Query Results

Relevant region of the document is displayed.

Automatically Identifying Record Patterns from the

Extracted Data Fields of Genealogical Microfilm

Kenneth TubbsKenneth TubbsDavid W. EmbleyDavid W. Embley