1 from tessellations to table interpretation r. c. jandhyala 1, m. krishnamoorthy 1, g. nagy 1, r....

47
1 From Tessellations to Table Interpretation R. C. Jandhyala 1 , M. Krishnamoorthy 1 , G. Nagy 1 , R. Padmanabhan 1 , S. Seth 2 , W. Silversmith 1 1 DocLab, Rensselaer Polytechnic Institute 2 Computer Science and Engineering, University of Nebraska-Lincoln (Supported by NSF Grants # 044114854 and 0414644, and Rensselaer Center for Open Source Software)

Upload: russell-malone

Post on 13-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

1

From Tessellations to Table Interpretation

R. C. Jandhyala1, M. Krishnamoorthy1,

G. Nagy1, R. Padmanabhan1,

S. Seth2, W. Silversmith1

1DocLab, Rensselaer Polytechnic Institute2Computer Science and Engineering, University of Nebraska-Lincoln

(Supported by NSF Grants # 044114854 and 0414644, and Rensselaer Center for Open Source Software)

Page 2: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

2

Goal: Construction of a narrow-domain ontology from semi-structured web data

(“table understanding” )

Page 3: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

3

Outline

A B C D

Tilings (rectangular tessellations) X-Y trees (1984)

Tables

Wang Categories (1996)

Grammars

Page 4: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

4

Outline

A B C D

Tilings (rectangular tessellations) X-Y trees (1984)

Tables

Wang Categories (1996)

Grammars

Page 5: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

5

Web tables

• Cannot precisely define human-understandable tables.

• Convert to smaller set of admissible tables.• Why? Algorithmic ease.

Page 6: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

6

Admissible Tables

• Have stub, headings and data cells.

Page 7: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

7

Factor out layout-equivalent tables

Page 8: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

8

Outline

A B C D

Tilings (rectangular tessellations) X-Y trees (1984)

Tables

Wang Categories (1996)

Grammars

Page 9: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

9

Rectangular Tessellations

• Partition of an isothetic rectangle into rectangles.• Uniquely defined by junction points (location and

type).• Number of tessellations increases rapidly with

table size.

Page 10: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

10

XY Tessellations• Special case of rectangular tessellations.• Successive horizontal and vertical cuts.• Easily represented by trees.

Page 11: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

11

A tiling and its X-Y Tree(aka slicing structure, puzzle tree, tree map)

V

H

V

Page 12: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

12

Non-slicing structures – No XY tree

In fact, X-Y tilings are an infinitesimal fraction of all tilings. This helps, because tables never contain this “spiral” structure.

Page 13: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

13

Fundamental Idea

Use XY trees to automate table processing and understanding.

Page 14: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

14

Table to XY tree – EX2XY

• Applicable to any XY tessellation. • Input – Excel Table

– Copy and paste or Import.– Edit to make admissible.

• Output – XY tree– as XML for portability.– as parenthesized string for grammars.

Page 15: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

15

Example

(http://www40.statcan.ca/l01/cst01/econ50-eng.htm)

Page 16: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

16

After import into Excel

Page 17: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

17

After Editing

Page 18: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

18

Output - XML

…<block id='1.1.2.1' range='17,2:30,2'>

<content> Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars) </content>

</block>…

Page 19: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

19

Outline

A B C D

Tilings (rectangular tessellations) X-Y trees (1984)

TablesWang Categories (1996)

Grammars

Page 20: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

20

Table Grammars

• Can characterize entire families of tables.• Developed grammar for one family.

• Input - Nested parenthesized notation .• Output – Accept/Reject as example of family.

Page 21: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

21

Grammar• For parsing column headers

S := A (Rule 1)

A := {B} (Rule 2)

B := c [X] B | c [X] (Rules 3 and 4)

X := c X | A X | A | c (Rules 5, 6, 7 and 8)• S is start symbol.• A generates all admissible column headers.• B generates category trees.• c is a root category.• X generates sub-categories.

Page 22: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

22

Table Grammars

• Cannot check if table is consistent.

• Need further geometric alignment and lexical checks.

Page 23: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

23

Outline

A B C D

Tilings (rectangular tessellations) X-Y trees (1984)

Tables

Wang Categories (1996)

Grammars

Page 24: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

24

Logical Structure of Tables• How to interpret a table?

– Describe relationship between header cells and content cells [Wang, U. Waterloo,1996].

• Wang notation– Elegant description.– Dimensionality: Number of category trees.– Cartesian product maps categories to data.

Page 25: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

25

Layout independent Wang Notation

Different layout and same information means same Wang Notation

Page 26: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

26

Wang Category Trees for either table• characteristic

gonsity

hepth

• fleckburlamfaldermulton

• Any data cell can be designated by a path through each category tree.

• Leaves correspond to row or column headings.

Page 27: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

27

• Analyzing logical structure not sufficient. • Need additional information from title, footnotes,

captions, etc. • Semantic analysis of the labels also important –

need external knowledge.

“Real” Table Understanding

Page 28: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

28

Does Wang Notation always exist?

• Not always!• Inconsistent tables

do not have Wang Notation.

• Others can be edited using virtual headers.

Page 29: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

29

XY tree to Wang Notation Algorithm

• Input – XY trees. • Output – XML version of Wang Notation.• Checks for table consistency.

Page 30: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

30

Algorithm• Locate principal regions - stub, headers and content

cells.• Extract Wang categories. • Compute Cartesian product of category paths. • Match each key to the content of a delta cell.

Page 31: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

31

Conclusions

• Admissible layouts identified for ease of processing.• Algorithms developed for

extracting XY trees from tables. extracting Wang notation from XY trees.

• Family of tables identified using a grammar.

Page 32: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

32

Future work• Augmentations - captions, aggregates, units, etc.

• Expand the grammar. • Automate conversion of table to admissible formats.

(http://www40.statcan.ca/l01/cst01/agri111a-eng.htm)

Page 33: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

33

THANK YOU

Page 34: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

34

Goal: construction of a narrow-domain ontologyfrom semi-structured web data

(“table understanding” )

• Currently multon is the best choice for rapitting velters. It is about 25% better than burlam or falder, which have the same girby (hepth/gonsity ratio).

• Check another table to see whether elmer is even better.

• NOT TODAY!

Page 35: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

35

H-first tree can be transformed into V-first tree(and vice-versa)

H

V

etc.

Page 36: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

36

EX2XY: Algorithm• Two workhorses:

– Vertical_cut – returns leftmost sub-rectangle of a given rectangle.

– Horizontal_cut – returns topmost sub-rectangle of a given rectangle.

Page 37: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

37

EX2XY: Algorithm (contd.)

• Used in a pair of procedures P1 and P2.• P1 cuts vertically and submits first sub-rectangle

to P2 for horizontal cuts.

• Similarly with P2.

Page 38: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

38

Parenthesized notation

• P-notation has 1:1 correspondence with general trees.

• For above table, the XY tree sentence is:

Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}.

Page 39: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

39

A table with six Wang dimensionsTable 5. Average temperatures, N and S hemisphere, degrees C

LATITUDE 10º 10º 20º 20ºWATER/LAND water land water land

HEMI-SPHERE YEAR SEASON TIME

S 1900 summer noon 32 35 37 39S 1900 summer midnight 28 32 33 35S 1900 winter noon 21 25 28 28S 1900 winter midnight 18 22 24 26S 2000 summer noon 33 37 37 40S 2000 summer midnight 29 32 34 35S 2000 winter noon 21 25 27 28S 2000 winter midnight 20 22 25 26N 1900 summer noon 30 33 35 38N 1900 summer midnight 26 29 30 35N 1900 winter noon 21 24 25 26N 1900 winter midnight 17 20 17 22N 2000 summer noon 30 33 36 39N 2000 summer midnight 26 29 30 35N 2000 winter noon 22 23 24 26N 2000 winter midnight 18 21 18 22

Page 40: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

40

• Handles more complex scenarios:– Higher dimensionality.– Deeper nesting of headers.– Repetitive headers.

XY2WANG: Other features

Page 41: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

41(http://www40.statcan.ca/l01/cst01/econ50-eng.htm)

Page 42: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

42

Page 43: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

43

Raghav’s Experiment

Page 44: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

44

Page 45: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

45

Page 46: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

46

• Average total time to process a table - 231 seconds.• Average table size - 587 cells before preprocessing. • Average preprocessing time - 104 seconds. • 3 category tables took approximately 27 seconds more

than 2 category tables.

Page 47: 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

47

• Tables with aggregates and footnotes - more time to process.

• Strong correlation between processing time and table size.

• For future: automatically segmenting augmentations, categories and delta cells using visual cues.