table interpretation by sibling page comparison cui tao & david w. embley data extraction group...
Post on 22-Dec-2015
223 views
TRANSCRIPT
Table Interpretation
by Sibling Page Comparison
Cui Tao & David W. Embley
Data Extraction Group Department of Computer Science
Brigham Young University
Supported by NSF
Table Interpretation(in context) Context: Table Understanding
Table Recognition Table Interpretation Table Conceptualization Table Understanding
Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community
knowledge Challenging Conceptual-Modeling Work
Table Interpretation(in context) Context: Table Understanding
Table Recognition Table Interpretation with Sibling Pages: Table Conceptualization Table Understanding
Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community
knowledge Challenging Conceptual-Modeling Work
TISP
TISP: Table Recognition and Interpretation
Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations
Locate Table LabelsExamples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2
12
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
Conceptual Table Interpretation
Wang Notation [Wang96];(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
Table Ontology
Technique Details
Unnest tables Match tables in sibling pages
“Perfect” match (table for layout discard ) “Reasonable” match (sibling table)
Determine/Use Table-Structure Pattern Discover pattern Pattern usage Dynamic pattern adjustment
Simple Tree Matching Algorithm
Labels
Values
[Yang91]
Match Score Categorization: Exact/Near-Exact, Sibling-Table, False
Table Structure Patterns
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Pattern Usage
(Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data](Location.Genomic Position) = X:13518823..13515773 bp
Dynamic Pattern Adjustment
<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+
<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+ | <tr>(<(td|th)> {L})6 (<tr>(<(td|th)> {V})6)+
TISP Evaluation Applications
Commercial: car ads Scientific: molecular biology Geopolitical: US states and countries
Data: > 2,000 tables, 275 sibling tables, 35 web sites Evaluation
Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition?
Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?
Experimental Results
Table recognition: correctly discarded 157 of 158 layout tables
Pattern recognition: correctly found 69 of 72 structure patterns
Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
Discovered Difficulties
Abundance of null entries Multiple tables as a single table
Recognize and group Use box model
[Gatterbauer07] Factored labels
Table Understanding Table Recognition
Data table vs. table for layout Adjust (group table components, defactor labels, …)
Table Interpretation Populate table ontology Additional table-ontology elements (title, footnotes, …)
Table Conceptualization Capture table semantics Reverse engineer as a conceptual model
Table Understanding Embed within a community ontology Alternatively, augment community knowledge
fleck velter
gonsity (ld/gg)
hepth(gd)
burlam 1.2 120
falder 2.3 230
multon 2.5 400
repeat:1. recognize table2. interpret table3. conceptualize table4. merge5. adjustuntil ontology developed
Knowledge Generation
velter
hepth
gonosity
fleck1
has1:*
1has 1:*
velter
hepth
gonosity
fleck1
has1:*
1has 1:*
TANGO (Table Analysis for Generating Ontologies) repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology.
GrowingOntology