data on the (semantic) web. agenda (75 min) data on the web – extracting data – publishing data...

Data on the (Semantic) Web

Agenda (75 min)

• Data on the Web– Extracting data– Publishing data

• Linked Data• Metadata in HTML• SPARQL endpoints

• Crawling and extraction• Indexing RDF data

– Database-style indexing– IR-style indexing

IR view of the Web

• Web accessible resources– Documents (typically HTML)– Multimedia

• Search engines index NL text– Most of the structure in HTML is discarded– Multimedia is indexed by surrounding text

• Additional information on web graph, usage• See Manning, Raghavan, Müntze.

Introduction to Information Retrieval. Cambridge Press, 2008.

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Data on the Web

• Most web pages on the Web are generated from structured data– Data is stored in relational databases (typically)– Queried through web forms– Presented as tables or simply as unstructured text

• The structure and semantics (meaning) of the data is not directly accessible to search engines

• Two solutions– Extraction using Information Extraction (IE) techniques (implicit

metadata)– Relying on publishers to expose structured data using standard

Semantic Web formats (explicit metadata)

Information Extraction methods• Named Entity Recognition (NER) and disambiguation

– OpenCalais, Zemanta • Extraction of triples

– TextRunner, NELL– Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet

and Wikipedia, WWW, 2007.– Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.

• Filling web forms automatically (form-filling)– Madhavan et al. Google's Deep-Web Crawl. VLDB 2008

• Extraction from HTML tables– Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB

2008• Wrapper induction

– Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007

http://www.opencalais.com/

http://www.zemanta.com/

http://www.cs.washington.edu/research/textrunner/indexTRTypes.html

http://www.nytimes.com/2010/10/05/science/05compute.html

Information Extraction

• A tale of many trade-offs– Less or no training data, lower quality– More complex the model to learn, more training data

needed– Deeper the analysis, slower the processing– The more narrowly trained, the more likely to break– Populating a Knowledge Base is easier than ad-hoc

extraction• However, a complete and correct semantic

representation of the content may not be need for all tasks

Publishing data on the Web

• Pre-Semantic Web technologies have been inadequate– Existing formats are not appropriate for serendipitous

reuse • HTML: structure is lost due to a mix of presentation and content• XML: captures structure, but not semantics

– Lack of protocols to talk to databases over the Web• Motivation has been lacking

– Publishers are interested to the extent that they benefit from sharing data, e.g. because it drives traffic back to their site

What the Semantic Web provides

• Data format: RDF– Designed for object-relationship data– Identification of objects by URIs– Multiple serializations: RDF/XML, Turtle, N3, N-Triples,

Trix etc.• Schema language: OWL

– Description Logic based– Extensible using rule languages such as RIF

• Query language and protocol: SPARQL• The principles of Linked Data

Methods for publishing RDF data

• Multiple ways of publishing RDF data– SPARQL endpoints– Linked Data– Metadata in HTML documents– Data feeds– GRDDL– Automated tools

• Each require different treatment in crawling and extraction

SPARQL endpoints

• SPARQL is a standard query language and protocol for accessing RDF stores via HTTP– Also possible to expose a traditional RDBMs via a wrapper

• Advantages:– Most flexible and best performing access from a consumer perspective

• Disadvantages:– Higher maintenance– Discovery is problematic

• Tools: – Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.)– RDB-to-RDF mappers such as D2RQ and Triplify– SPARQL query builders

Linked Data

• A web of interlinked RDF documents– Each document describes the characteristics of a

single object, and links to related objects• Most important: links to the same object in different data

sets (sameAs)

• Guidelines for proper configuration of web servers to serve such documents

• Rapidly growing community– Focus on public datasets (government, scientific)– see linkeddata.org

The even larger picture: entire datasets connected

Linked Data

• Advantages: – No change to the publishing of the HTML documents– Data can be published by third party (e.g. Dbpedia)

• Disadvantages:– Web servers need to be configured to properly handle URIs

that identify concepts instead of documents– Search engines need to be extended to crawl linked data– Data is not always linked to documents

• Tools– Linked Data browsers (Tabulator, Marbles etc.)– RDB-to-RDF mappers (D2RQ, Triplify)

http://dig.csail.mit.edu/2005/ajar/ajaw/tab

http://www5.wiwiss.fu-berlin.de/marbles/

http://www4.wiwiss.fu-berlin.de/bizer/d2rq/

http://triplify.org/

Metadata in HTML• Microformats, RDFa, Microdata• Advantages:

– Data and document are always in sync– Browser plug-in friendly– Search engine friendly– Copy-paste friendly

• Tools: – XML editors (e.g. Oxygen)– RDFa Distiller– RDFa bookmarklet, Ubiquity RDFa plugin– Optimus microformat parser

• Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

http://www.w3.org/2007/08/pyRdfa/

http://www.w3.org/2006/07/SWD/RDFa/impl/js/

http://ubiquity-rdfa.googlecode.com/svn/trunk/install-checker.html

http://microformatique.com/optimus/

Microformats (μf)

• Agreements on the way to encode certain kinds of data in HTML– Reuse of semantic-bearing HTML elements– Based on existing standards– Minimality: designed to solve particular problems

• Microformats exist for a limited set of objects– hCard, hResume, hProduct, hRecipe

• Varying degrees of support and stability– hCard and rel-tag are widely supported

• Community centered around microformats.org– Specifications and discussions are hosted there

Example: the hCard microformat

<cite class="vcard">

<a class="fn url" rel="friend colleague met"

href="http://meyerweb.com/">Eric Meyer</a>

</cite> wrote a post (<cite>

<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">

Tax Relief</a></cite>) about an unintentionally humorous letter

he received from the <span class="vcard">

<a class="fn org url" href="http://irs.gov/">

Internal Revenue Service</a> </span>.

<div class="vcard">

<a class="email fn" href="mailto:[email protected]">Joe Friday</a>

<div class="tel">+1-919-555-7878</div>

<div class="title">Area Administrator, Assistant</div>

</div>

Microformats: limitations

• No shared syntax– Each microformat has a separate syntax tailored to the

vocabulary • No formal schemas

– Limited reuse, extensibility of schemas– Unclear which combinations are allowed

• No datatypes• No namespaces, unique identifiers (URIs)

– no interlinking– mapping between instances is required

RDFa

• W3C standard for embedding RDF data in HTML documents– A set of new HTML attributes

• Despite the extension of HTML, RDFa does not require XHTML

– A specification of how to extract the data from these attributes

• RDFa can be used to embed data in HTML headers or to annotate parts of the body of HTML documents

• RDFa is just a syntax, you have to choose a vocabulary separately

Differences in usage

• Microformats are the first choice for most publishers because they are simple

• If you find none that perfectly fits your needs then you need RDFa– Microformats have a fixed schema: you can not add your

own attributes• Example: a social networking site with user profiles

– VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections

– You either live without this, or go with RDFa

Example: Facebook’s Open Graph Protocol

• Open Graph Protocol– RDF vocabulary to be used in conjunction with RDFa

• Simplify the work of developers by restricting the freedom in RDFa– Activities, Businesses, Groups, Organizations, People, Places, Products and

Entertainment– Only HTML <head> accepted– http://opengraphprotocol.org/

• Facebook as consumer– Facebook indexes OGP data whenever someone ‘likes’ a page with OGP data– Social recommendation (‘like’ button) provides publishers with a way to

promote their content on Facebook• Shows up in profiles and news feed, the user is subscribing to a channel of future

feeds from the web page they liked– Facebook Graph API allows 3rd party developers to access the data

• http://developers.facebook.com/docs/api

http://opengraphprotocol.org/

Example: Facebook’s Open Graph Protocol

<html xmlns:og="http://opengraphprotocol.org/schema/">

<head>

<title>The Rock (1996)</title>

<meta property="og:title" content="The Rock" />

<meta property="og:type" content="movie" />

<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />

<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

</html>

Microdata

• HTML5 is currently under standardization at the W3C

• Introduces Microdata– Similar to microformats

• Some predefined vocabularies with central registration

– Some of the flexibility of RDFa– Introduce new terms using reverse domain names or

full URIs• Semantic HTML elements such as <time>, <video>,

<article>…

Microdata example<div itemscope itemid=“http://www.yahoo.com/resource/person”>

<p>My name is <span itemprop="name">Neil</span>.</p>

<p>My band is called

<span itemprop="band">Four Parts Water</span>.

I was born on

<time itemprop="birthday" datetime="2009-05-10">

May 10th 2009

</time>.

<img itemprop="image" src=”me.png" alt=”me”>

</p>

</div

The state of metadata in HTML

• 5-10% of webpages contain some explicit metadata– Depending on how you count…

• Too many competing approaches– Too many formats: microformats vs RDFa vs

Microdata– Too many schemas: publishers may need to use

multiple different vocabularies or microformats to satisfy everyone

data on the (semantic) web. agenda (75 min) data on the web – extracting data – publishing data...

Documents

structured data data

data metadata

training data

data format

sharing data

semantic web slide

extraction slide

web graph