1 from tessellations to table interpretation r. c. jandhyala 1, m. krishnamoorthy 1, g. nagy 1, r....
TRANSCRIPT
1
From Tessellations to Table Interpretation
R. C. Jandhyala1, M. Krishnamoorthy1,
G. Nagy1, R. Padmanabhan1,
S. Seth2, W. Silversmith1
1DocLab, Rensselaer Polytechnic Institute2Computer Science and Engineering, University of Nebraska-Lincoln
(Supported by NSF Grants # 044114854 and 0414644, and Rensselaer Center for Open Source Software)
2
Goal: Construction of a narrow-domain ontology from semi-structured web data
(“table understanding” )
3
Outline
A B C D
Tilings (rectangular tessellations) X-Y trees (1984)
Tables
Wang Categories (1996)
Grammars
4
Outline
A B C D
Tilings (rectangular tessellations) X-Y trees (1984)
Tables
Wang Categories (1996)
Grammars
5
Web tables
• Cannot precisely define human-understandable tables.
• Convert to smaller set of admissible tables.• Why? Algorithmic ease.
6
Admissible Tables
• Have stub, headings and data cells.
7
Factor out layout-equivalent tables
8
Outline
A B C D
Tilings (rectangular tessellations) X-Y trees (1984)
Tables
Wang Categories (1996)
Grammars
9
Rectangular Tessellations
• Partition of an isothetic rectangle into rectangles.• Uniquely defined by junction points (location and
type).• Number of tessellations increases rapidly with
table size.
10
XY Tessellations• Special case of rectangular tessellations.• Successive horizontal and vertical cuts.• Easily represented by trees.
11
A tiling and its X-Y Tree(aka slicing structure, puzzle tree, tree map)
V
H
V
12
Non-slicing structures – No XY tree
In fact, X-Y tilings are an infinitesimal fraction of all tilings. This helps, because tables never contain this “spiral” structure.
13
Fundamental Idea
Use XY trees to automate table processing and understanding.
14
Table to XY tree – EX2XY
• Applicable to any XY tessellation. • Input – Excel Table
– Copy and paste or Import.– Edit to make admissible.
• Output – XY tree– as XML for portability.– as parenthesized string for grammars.
15
Example
(http://www40.statcan.ca/l01/cst01/econ50-eng.htm)
16
After import into Excel
17
After Editing
18
Output - XML
…<block id='1.1.2.1' range='17,2:30,2'>
<content> Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars) </content>
</block>…
19
Outline
A B C D
Tilings (rectangular tessellations) X-Y trees (1984)
TablesWang Categories (1996)
Grammars
20
Table Grammars
• Can characterize entire families of tables.• Developed grammar for one family.
• Input - Nested parenthesized notation .• Output – Accept/Reject as example of family.
21
Grammar• For parsing column headers
S := A (Rule 1)
A := {B} (Rule 2)
B := c [X] B | c [X] (Rules 3 and 4)
X := c X | A X | A | c (Rules 5, 6, 7 and 8)• S is start symbol.• A generates all admissible column headers.• B generates category trees.• c is a root category.• X generates sub-categories.
22
Table Grammars
• Cannot check if table is consistent.
• Need further geometric alignment and lexical checks.
23
Outline
A B C D
Tilings (rectangular tessellations) X-Y trees (1984)
Tables
Wang Categories (1996)
Grammars
24
Logical Structure of Tables• How to interpret a table?
– Describe relationship between header cells and content cells [Wang, U. Waterloo,1996].
• Wang notation– Elegant description.– Dimensionality: Number of category trees.– Cartesian product maps categories to data.
25
Layout independent Wang Notation
Different layout and same information means same Wang Notation
26
Wang Category Trees for either table• characteristic
gonsity
hepth
• fleckburlamfaldermulton
• Any data cell can be designated by a path through each category tree.
• Leaves correspond to row or column headings.
27
• Analyzing logical structure not sufficient. • Need additional information from title, footnotes,
captions, etc. • Semantic analysis of the labels also important –
need external knowledge.
“Real” Table Understanding
28
Does Wang Notation always exist?
• Not always!• Inconsistent tables
do not have Wang Notation.
• Others can be edited using virtual headers.
29
XY tree to Wang Notation Algorithm
• Input – XY trees. • Output – XML version of Wang Notation.• Checks for table consistency.
30
Algorithm• Locate principal regions - stub, headers and content
cells.• Extract Wang categories. • Compute Cartesian product of category paths. • Match each key to the content of a delta cell.
31
Conclusions
• Admissible layouts identified for ease of processing.• Algorithms developed for
extracting XY trees from tables. extracting Wang notation from XY trees.
• Family of tables identified using a grammar.
32
Future work• Augmentations - captions, aggregates, units, etc.
• Expand the grammar. • Automate conversion of table to admissible formats.
(http://www40.statcan.ca/l01/cst01/agri111a-eng.htm)
33
THANK YOU
34
Goal: construction of a narrow-domain ontologyfrom semi-structured web data
(“table understanding” )
• Currently multon is the best choice for rapitting velters. It is about 25% better than burlam or falder, which have the same girby (hepth/gonsity ratio).
• Check another table to see whether elmer is even better.
• NOT TODAY!
35
H-first tree can be transformed into V-first tree(and vice-versa)
H
V
etc.
36
EX2XY: Algorithm• Two workhorses:
– Vertical_cut – returns leftmost sub-rectangle of a given rectangle.
– Horizontal_cut – returns topmost sub-rectangle of a given rectangle.
37
EX2XY: Algorithm (contd.)
• Used in a pair of procedures P1 and P2.• P1 cuts vertically and submits first sub-rectangle
to P2 for horizontal cuts.
• Similarly with P2.
38
Parenthesized notation
• P-notation has 1:1 correspondence with general trees.
• For above table, the XY tree sentence is:
Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}.
39
A table with six Wang dimensionsTable 5. Average temperatures, N and S hemisphere, degrees C
LATITUDE 10º 10º 20º 20ºWATER/LAND water land water land
HEMI-SPHERE YEAR SEASON TIME
S 1900 summer noon 32 35 37 39S 1900 summer midnight 28 32 33 35S 1900 winter noon 21 25 28 28S 1900 winter midnight 18 22 24 26S 2000 summer noon 33 37 37 40S 2000 summer midnight 29 32 34 35S 2000 winter noon 21 25 27 28S 2000 winter midnight 20 22 25 26N 1900 summer noon 30 33 35 38N 1900 summer midnight 26 29 30 35N 1900 winter noon 21 24 25 26N 1900 winter midnight 17 20 17 22N 2000 summer noon 30 33 36 39N 2000 summer midnight 26 29 30 35N 2000 winter noon 22 23 24 26N 2000 winter midnight 18 21 18 22
40
• Handles more complex scenarios:– Higher dimensionality.– Deeper nesting of headers.– Repetitive headers.
XY2WANG: Other features
41(http://www40.statcan.ca/l01/cst01/econ50-eng.htm)
42
43
Raghav’s Experiment
44
45
46
• Average total time to process a table - 231 seconds.• Average table size - 587 cells before preprocessing. • Average preprocessing time - 104 seconds. • 3 category tables took approximately 27 seconds more
than 2 category tables.
47
• Tables with aggregates and footnotes - more time to process.
• Strong correlation between processing time and table size.
• For future: automatically segmenting augmentations, categories and delta cells using visual cues.