wikitables deri talk
TRANSCRIPT
Extending DBpedia (LOD) using WikiTables
Emir Muñoz
Unit for Reasoning and Querying
Linked Open Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
But not yet a version of all normal Wikipedia tables or wikitables
October 12, 2012 -- E. Muñoz
Tables as a source of LOD
http://en.wikipedia.org/wiki/Dublin
Caption as another row
Column header represents types of information
The values represent
instances of that types
http://en.wikipedia.org/wiki/Galway
Infoboxes (attr-value)
October 12, 2012 -- E. Muñoz
Tables are inherently concise as well as information rich
Reasoning over Wikipedia Tables
http://en.wikipedia.org/wiki/Dublin
Recovering Table Semantics …
October 12, 2012 -- E. Muñoz
Dublin is twinned with the following places:
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
http://en.wikipedia.org/wiki/Dublin
Entity annotation for cells, mappings to DBpedia resources
(xsd:integer)
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/ontology/country dbpedia.org/property/subdivisionName
is dbpedia.org/ontology/country of
http://en.wikipedia.org/wiki/Dublin
Extracting relations
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• Let’s analyze these cases …
• Liverpool
• Matsue
• Beijing
October 12, 2012 -- E. Muñoz
Not that simple…
• Web tables usually don’t have explicit semantics by themselves.
• Main issues:
– Complex tables with spans
– Captions inside the table as another row
– Not well-formed tables (i.e., not a matrix)
– We need filters (e.g., min 2 columns, 2 rows)
• We are extracting relations at row level and between the main entity and the table resources
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
http://en.wikipedia.org/wiki/People%27s_Republic_of_China
Caption as another row
Table split
October 12, 2012 -- E. Muñoz
Rowspans with pictures
First step: parsing Wiki format
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
Same page link Many different formats
Anchor text vs.
Content text
http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s
October 12, 2012 -- E. Muñoz
Extracting Relations
A table containing tables
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Extracting Relations
• Also relations between the main entity and the entities in the table
dbpedia.org/resource/AFC_Ajax
14 dbpedia.org/ontology/team 14 dbpedia.org/property/clubs 11 dbpedia.org/property/currentclub 3 dbpedia.org/property/youthclubs
In his dbpedia page there is no mention
to AFC Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
October 12, 2012 -- E. Muñoz
dbpedia.org/resource/Christian_Eriksen
Disambiguation page dbpedia.org/resource/Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Our Dataset
• enwiki dump from 2012-09-03 02:17:37
• 8.6 GB of Wikipedia pages that comprise
– 10,531,986 documents (HTML pages)
– Only 413,256 HTML contains tables
– 2,989,098 tables
– 905,929 tables after the filter
• 27.7% of the whole tables
– 0.46 tables per page (or 2.15 discarding pages without tables)
October 12, 2012 -- E. Muñoz
Methodology
October 12, 2012 -- E. Muñoz
Ranking of Relationships
• The current ranking function is naïve
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
freq relationship score
14 dbpedia.org/ontology/team 0,875
14 dbpedia.org/property/clubs 0,875
11 dbpedia.org/property/currentclub 0,6875
3 dbpedia.org/property/youthclubs 0,1875
𝑠𝑐𝑜𝑟𝑒 =𝑓𝑟𝑒𝑙𝑛𝑟𝑜𝑤𝑠
Ranking of Relationships
• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Danny_Kaye
Ongoing Work and Challenges
• Improve the ranking function for relations.
• Store the 5.5M DBpedia (transitive) redirects locally (optimizing time).
• Statistical analysis of Wikipedia tables
– Number of columns, rows
– Headers, Captions
– External and internal links
• The big following challenge is the evaluation.
October 12, 2012 -- E. Muñoz
What’s next?
• Some ideas in mind:
– Use the extracted relations to classify WikiTables
– Define a similarity function for WikiTables
English Italian
October 12, 2012 -- E. Muñoz
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
What means this number?
Here there is no reference to those numbers!
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
http://en.wikipedia.org/wiki/Chlorine
Chlorous acid is a chlorite
http://dbpedia.org/page/Chlorous_acid
Open problems
• Handle multiple-entities in the same cell
• Improve the ranking function
• Handle redirects before querying DBpedia
• How to evaluate the outcome
October 12, 2012 -- E. Muñoz
Thanks! Q & A
Thanks! Emir Muñoz
Unit for Reasoning and Querying