automatically identifying record patterns from the extracted data fields of genealogical microfilm...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Automatically Identifying Record Patterns from the
Extracted Data Fields of Genealogical Microfilm
Kenneth TubbsKenneth TubbsDavid W. EmbleyDavid W. Embley
Problem
• Searching through microfilm by Searching through microfilm by hand is tedious.hand is tedious.
• Extraction by hand requires large Extraction by hand requires large amounts of time and manpower. amounts of time and manpower.
Algorithm
RecordPatterns
RecordPatterns
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
EvaluateCandidatesEvaluate
Candidates
External Preprocessing
Input FeaturesInput Features
1.1. Coordinates of each zone.Coordinates of each zone.
2.2. Printed text of each zone.Printed text of each zone.
3.3. Whether or not each zone Whether or not each zone is empty.is empty.
XML Input FileXML Input File
< zone < zone rectanglerectangle="66,55,223,11" ="66,55,223,11" printed_textprinted_text=“NAME and =“NAME and
Surname of each Surname of each Person" Person" emptyempty="0" ="0"
/>/>
Identify Structure
• Identify Table PrimitivesIdentify Table Primitives
• Evaluate PrimitivesEvaluate Primitives
• Factor Table PrimitivesFactor Table Primitives
IdentifyStructureIdentify
Structure
Identify Table Primitives
Name
Row:Row:
[label[label::value+] right, heightvalue+] right, height
IdentifyStructureIdentify
Structure
Identify Table Primitives
Column:Column:
[label[label::value+] down, widthvalue+] down, width
Name
IdentifyStructureIdentify
Structure
Identify Table Primitives
Row:Row:
[label[label::value+] right, heightvalue+] right, height
IdentifyStructureIdentify
Structure
Evaluate Primitives
Primitive Confidence LevelPrimitive Confidence Level
primitive for the
with associated ofset theis J Where
) ,( Confidence
i
Jj ji
Label
Values
ValueLabel ==
Jj ji Value Label each with associated is thatConfidence==
IdentifyStructureIdentify
Structure
Evaluate Primitives
j
iNn
n
jii
Value
LabelSizeLabelluesPerNumberOfVa
ValueSizeLabelSizeLabelluesPerNumberOfVa
with associated Labels ofset theis N Where
*
Confidence (Confidence (LabelLabelii, Value, Valuejj) ) ==
IdentifyStructureIdentify
Structure
Factor Table Primitives
A B C D E F
[A B C D E F] or[A B C D E F] or
[A] [B C D E F] or[A] [B C D E F] or
[E] [A B C D F] or[E] [A B C D F] or
Others.Others.
IdentifyStructureIdentify
Structure
Factor Table Primitives
• An expert user assigns probabilities An expert user assigns probabilities to types of factorings.to types of factorings.
ExampleExample
[column:column+] left, .90[column:column+] left, .90
[row:column+] below, .85[row:column+] below, .85
IdentifyStructureIdentify
Structure
Match Attributes
• Identify Possible Mappings from the Identify Possible Mappings from the Microfilm Table to the Genealogical Microfilm Table to the Genealogical Ontology.Ontology.
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
Identify Possible Mappings
1. Identical Matches
2. Synonym Matches
3. Composite Matches
Genealogical OntologyPrinted Text
Name Name
Sex Gender
Female Age Female, Age
Mapping typesMapping types
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
Evaluate Mapping
• Edit distance between wordsEdit distance between words
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
Check Constraints
• The algorithm evaluates each the The algorithm evaluates each the factoring of each record with a factoring of each record with a genealogical ontology.genealogical ontology.
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
Identify RecordsTable (Address , Name) = 14 / 3 = 4.67Table (Address , Name) = 14 / 3 = 4.67
Label Number of Values
Address 3
Name 14
Age 13
Gender 14
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
Genealogical Ontology
• The genealogical ontology is The genealogical ontology is created by an expert user. created by an expert user.
• The cardinalities areThe cardinalities are assigned to the ontology by assigned to the ontology by recording the cardinalities of a recording the cardinalities of a corpus of microfilm. corpus of microfilm.
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
Genealogical Ontology
Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73 Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73
Family
Address
Age Gender
1
1
Person
Name
1.11.2
4.31.1
1.31
1.11
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
Evaluate Factoring
Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = 4.73 4.73 Table (Address, Name) = 14 / 3 = 4.67Table (Address, Name) = 14 / 3 = 4.67
Distance ClassifierDistance Classifier
Distance_From_Ontology = 1 / (4.73 – 4.67)Distance_From_Ontology = 1 / (4.73 – 4.67)22 = 277 = 277
Distance_From_No_Factoring = 1 / (1 – 4.67)Distance_From_No_Factoring = 1 / (1 – 4.67)22 = .0724 = .0724
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
Evaluate Candidates
• For every combination of primitives, For every combination of primitives, attribute mappings, and factorings attribute mappings, and factorings compute the product of their compute the product of their confidences.confidences.
• Select most confident combination.Select most confident combination.
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF
ConfidenceF
ConfidenceF
ConfidenceF
ConfidenceF
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF
ConfidenceF
ConfidenceF
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
ConfidenceF F F F
Evaluate Candidates
Primitive 1
Primitive 2
Primitive 3
Attribute Attribute Attribute
ConfidenceF
ConfidenceF
ConfidenceF
Algorithm
RecordPatterns
RecordPatterns
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
MatchAttributes
MatchAttributes
IdentifyStructureIdentify
Structure
CheckConstraints
CheckConstraints
EvaluateCandidatesEvaluate
Candidates
Output
• Record Patterns– Attributes of each record.– Geometry of each record.
• Attribute mappings for the table
to the ontology.
Microfilm Queries
• A web form provides the interfaceA web form provides the interfaceto query the microfilm database.to query the microfilm database.
• Individuals can enter keywords, Individuals can enter keywords, (such as first and last name), and the (such as first and last name), and the system locates the appropriate system locates the appropriate records among the indexed records among the indexed documents.documents.