linked data for information extraction challenge - tasks and results @ iswc 2014

Linked Data for Information Extraction Challenge 2014

Tasks and ResultsRobert Meusel and Heiko Paulheim

Linked Data for Information Extractin Challenge 2014 - Task and Results 2

Task

- Training dataset was created from HTML pages, which are annotated using Microformats hCard.

- The data is a subset of the WebDataCommons Microformats Dataset.

- The original data is provided by the Common Crawl Foundation, the largest public available collection of web crawls

Creation of an information extraction system that scrape structured information from HTML web sites.


The Common Crawl Foundation (CC)

- Non-profit foundation dedicated to building and maintaining an open crawl of the Web

- 9 crawl corpora from 2008 till 2014 available so far

- Crawling Strategies: • Earlier crawled using BFS (with link discovery) seeded with a large list of ranked

Seeds (PageRank), current crawls are gathered using a >6billion URL seed list from the blekko search index

• By this, all crawls represent the popular part of the Web

- Data availability• CC provides three different datasets for each crawl

• All data can be freely downloaded from AWS S3


The WebDataCommons Project

- Extracts information annotated with the Markup languages Microformats, Microdata and RDFa

- Till now, three different datasets gathered from crawls of 2010, 2012, and 2013

Extraction of Structured Data from the Common Crawl Corpora

RDFa

Microdata

Microformats


Extracting the Data

- Webmaster markup their information within the HTML page directly using one of the three markup languages

- Using Any23 (http://any23.apache.org/) those information are extracted as RDF triples

1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> .

2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fu\u00DFballschuh"@de .

3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> .

4. _:node1 <http://schema.org/Offer/price> "\u20AC 219,95"@de .5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de .6. …

Any23


The Original Dataset of 2013

- Over 1.7 million domains using at least one markup language

- Over 17 billion quads with over 4 billion records (typed entities)

- hCard the most dominant among domains


Extraction of Challenge Dataset

- Selected a subset of over 10k web pages from the corpus including over 450k extracted triples (annotated with MF hCard)• Training: 9 877 web pages / 373 501 triples

• Test: 2 379 web pages / 85 248 triples


Creation of the Gold Standard

- Input: Annotated HTML Pages & Triples (extracted with Any23)

- After extraction of triples, all hCard tags are replaced• Replacement by random generated tags

• stable per page, but different across pages

• Replacement of comments: as CMS systems like to comment <!– here is the name of the company -->

- Output• Training:

• Annotated HTML Page• Cleaned HTML Page• Triples

• Testing:• Cleaned HTML Page• Triples (not public)


Overview: Dataset Creation and Evaluation Process


Evaluation

- Methodology: We consider each triple within extracted statements (submission) and extracted statements (Any23 from original test HTML pages) as equal if they have the same predicate and object for one page.

- Baseline: Each page has at least one statement declaring there is one VCard

_:1 rdf:type hcard:Vcard .


Challenge Results

- We got one submission (which you will learn about in some minutes)

- The submission outperforms the baseline for Recall and F-Measure

- The Gold Standard is not perfect, as within the data, we also find names and other attributes without a giving a type (whenever webmasters did not model this) Even a perfect extraction system would not reach a precision of 1.


Outlook: LD4IE Challenge 2015

- Include more classes (e.g. Microdata and/or RDFa)

- Add negative examples to generate a more realistic setting• as today, systems can assume there is something within the test sample

• challenge of making sure, that in the negative examples there is no not marked data included

- Improve representativity of the challenge dataset• Wide-spread CMS systems automatically allow marking up of articles, posts etc.

• Eliminate such bias, if present for next challenges

<html>MF:hCard

</html>

<html>

</html>

<html>MF:hCard

</html><html>

</html>

<html>Microdata

</html>

<html>RDFa

</html>

linked data for information extraction challenge - tasks and results @ iswc 2014

Technology

challenge results

publiclinked data

domainslinked data

tripleslinked data

challengeslinked data

ld4ie challenge

data webmaster markup

information extraction