evolving the web into a global database - advances and applications

47
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 1 Prof. Dr. Christian Bizer Evolving the Web into a global Database - Advances and Applications -

Upload: chris-bizer

Post on 23-Aug-2014

516 views

Category:

Internet


2 download

DESCRIPTION

Invited talk (Festvortrag im Rahmen der Verleihung des Carl-Adam-Petri-Preises), KIT, Karlsruhe, January 2014.

TRANSCRIPT

Page 1: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 1

Prof. Dr. Christian Bizer

Evolving the Web into a global Database

- Advances and Applications -

Page 2: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 2

Data and Web Science Group @ University of Mannheim

3 Professors• Prof. Dr. Heiner Stuckenschmidt

• Prof. Dr. Simone Paolo Ponzetto

• Prof. Dr. Christian Bizer

5 Post-Doctoral Researchers

18 PhD Students

http://dws.informatik.uni-mannheim.de/

1. Research methods for integrating and mining large amounts of heterogeneous information within enterprise and open Web contexts.

2. Empirically analyze the content and structure of the Web.

Page 3: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 3

Querying the classic Web

DBHTML

Page 4: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 4

Long standing Goal

Query the Web like a single,

global database

Page 5: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 5

2001 Article: The Semantic Web

Envisions three things to happen:

1.people publish data in structured form in addition to HTML pages on the Web

2.common vocabularies / ontologies are used to represent data

3.people implement cool applications that do smart things with the available data.

Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.

Page 6: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 6

13 Years Later

There are 1.3 million publications about the Semantic Web on Google Scholar, but

1. Do people publish structured data on the Web?

2. Do people agree on common vocabularies / ontologies?

3. What are the cool applications that exploit the data?

Page 7: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 7

Outline

1. Linked Data

2. HTML-embedded Data

3. The Role of Wikipedia

4. Conclusions

Page 8: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 8

1. Linked Data

B C

RDF

RDFlink

A D E

RDFlinks

RDFlinks

RDFlinks

RDF

RDF

RDF

RDF

RDF RDF

RDF

RDF

RDF

• by using RDF to publish structured data on the Web

• by setting links between data items within different data sources.

Set of best practices for publishing structured data on the Web in the form of a single global data graph.

Page 9: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 9

Global Identifiers and Links as Integration Hints

publishing Identity Links on the Web

publishing Vocabulary Links on the Web

<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .

<http://xmlns.com/foaf/0.1/Person> owl:equivalentClass <http://dbpedia.org/ontology/Person> .

Page 10: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 10

Effort Distribution between Publisher and Consumer

Publishers or third parties provides

identity/vocabulary links

Consumer mines missing identity/vocabulary links

Effort Distribution

Page 11: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 11

W3C Linking Open Data Project

Grassroots community effort started in 2007 to• publish existing open license datasets as Linked Data on the Web• interlink things between different data sources• maintain a data set catalog on the CKAN DataHub

Page 12: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 12

LOD Datasets on the Web: September 2011

295 data sets 31,6 billion RDF triples 503 million RDF links

Page 13: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 13

Newer statistics LODstats (University of Leipzig, 2014): 928 data sets LDspider Crawl (University of Mannheim, 2013): 850 data sets

Distribution by Topical Domain (September 2011)

Domain Data Sets Triples Percent RDF Links Percent

Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 %

Government 49 13,315,009,400 42.09 % 19,343,519 3.84 %Library 87 2,950,720,693 9.33 % 139,925,218 27.76 %

Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 %Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 %User content 20 134,127,413 0.42 % 3,449,143 0.68 %

SUM 295 31,634,213,770 503,998,829

Page 14: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 14

Ontological Agreement

Out of the 295 data sources• 102 (35%) only use terms from common vocabularies

• 105 (36%) only use proprietary terms

• 88 (29%) mix common and proprietary terms

Popular Vocabularies

Vocabulary # Data SetsDublin Core 92 (31.19 %)FOAF 81 (27.46 %)SKOS 58 (19.66 %)GEO 25 (8.47 %)AKT 17 (5.76 %)BIBO 14 (4.75 %)Music Ontology 13 (4.41 %)SIOC 10 (3.39 %)

Page 15: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 15

Page 16: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 16

Page 17: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 17

Uptake in the Government Domain

Goals• Make data available to the public and other government agencies

• Ease data integration by providing unique identifiers and by setting links

W3C Government Linked Data Working Group

Page 18: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 18

Uptake in the Libraries Community

Institutions publishing Linked Data• Library of Congress (subject headings)

• German National Library (PND dataset and subject headings)

• Swedish National Library (Libris - catalog)

• Hungarian National Library (OPAC and Digital Library)

• Europeana Digital Library (4 million artifacts)

Goals: 1. Integrate Library Catalogs on global scale

2. Interconnect resources between repositories (by topic, by location, by historical period, by ...)

Page 19: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 19

Industry Uptake

Media Industry• British Broadcasting Corporation

• New York Times

• Wolters Kluwer

• Springer

Pharmaceutical Industry• Johnson & Johnson

• Eli Lilly and Company

• AstraZeneca

IT Industry• IBM

Page 20: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 20

2. HTML-embedded Data

Microformats

Microdata

RDFa

Websites semantically markup the content of their HTML pages using:

Page 21: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 21

Schema.org

ask site owners since 2011 to markup data to enrich search results.

200+ Types: Event, Organization, Person, Place, Product, Review Encoding: Microdata or alternatively RDFa

Page 22: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 22

Open Graph Protocol

allows site owners to determine how entities are described in Facebook

relies on RDFa for encoding data in HTML pages

available since April 2010

Page 23: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 23

The Common Crawl

Page 24: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 24

The WebDataCommons.org Project

extracts all Microformat, Microdata, RDFa data from the Common Crawl

analyzes and provides the extracted data for download

Two extractions runs• 2009/2010 CC Corpus: 2.5 billion HTML pages 5.1 billion RDF triples

• 2012 CC Corpus: 3.0 billion HTML pages 7.3 billion RDF triples

used 100 machines on Amazon EC2 • approx. 3000 machine/hours

(spot instances of type c1.xlarge) 550 EUR

Jointed effort in the context of the EU project

Page 25: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 25

Websites providing Structured Data (2012)

2.29 million websites (PLDs) out of 40 million provide Microformat, Microdata or RDFa data (5.65%)

369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%)

Google, October 2013: 15% of all websites provide structured data.

Page 26: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 26

Breakdown by Encoding Format and Site Popularity

Grouped by Alexa Website Popularity Rank (rank based on amount of page views)

Page 27: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 27

Top Classes:

Topics• CMS and Blog

metadata

• Product data

• Ratings/Reviews

• Company listings

RDFa Topics (CC 2012)

og = Facebook‘s Open Graph Protocol

Page 28: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 28

Top Classes:

Topics• CMS and Blog

metadata

• Navigationalmetadata

• Products and offers

• Business listings

• Ratings

• Places

• Events

Microdata Topics (CC 2012)

schema = Schema.orgdatavoc = Google‘s Rich Snippet Vocabulary

Page 29: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 29

Class / Property Distribution

A small set ofclasses / propertiesis used.

Strong focus onSchema.org andFacebook vocabularies

Page 30: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 30

Looking Deeper into the E-Commerce Data

Microdata(2012)

Example Names:• AppleMacBook Air MC968/A 11.6-Inch Laptop• Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7

Example Description:• Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …

Page 31: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 31

Usage of Schema.org Data @ Google

Rich snippetswithin

search results

Page 32: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 32

Usage of Open Graph Protocol Data @ Facebook

allows site owners to determine how entities are described in Facebook

Page 33: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 33

Valuable Resource for Comparison Shopping Sites

We analyzed 1.9 million product offers from 9200 shops We trained classifier for 9 product categories on product descriptions

from Amazon.

Page 34: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 34

Identity Resolution for Electronic Products

We trained parser for product descriptions on offers for electronic products from Amazon.

We used Silk framework for identity resolution.

Page 35: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 35

Linked Data vs. HTML-embeded Data

LOD Cloud Microdata, Microformats, RDFa< 1000 sources millions of sources

covers wider range of specific topics focused on search engines and Facebook

contains more complex data structures

very simple and shallow data structures

partial ontology agreement strong ontology agreement

data integration eased by RDF links data integration requires NLP techniques

Page 36: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 36

Title

Description

CrossLanguageLinks

Geo-Coordinates

Images

Infoboxes

3. The Role of Wikipedia

Page 37: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 37

Extracting Knowledge from Wikipedia

Page 38: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 38

The DBpedia Knowledge Base - Version 3.9

describes 4.00 million things, out of which 3.22 million are classified in a consistent ontology using 529 classes and 2217 different properties• 832,000 persons • 639,000 places • 209,000 organizations • 116,000 music albums

Altogether 2.46 billion pieces of information (RDF triples)• 24,000,000 links to external web pages• 27,200,000 external links into other RDF datasets

DBpedia Internationalization• provide data from 119 Wikipedia language editions for download• 24 popular languages we provide cleaned infobox data

Page 39: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 39

Page 40: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 40

1. Answer fact queries: “birthdate michael douglas”

2. Compare things: „compare eiffel tower vs empire state building”

Applications of Google‘s Knowledge Graph

Page 41: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 41

Applications of Google‘s Knowledge Graph

3. Enrich search results with infoboxes and lists• Infoboxes might also contain Microdata/RDFa data, e.g. concerts of a band

4. Rank of search results using new Hummingbird ranking algorithm

Page 42: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 42

DBpedia as Background Knowledge for Data Mining

Which factors correlate with unemployment in France?

Page 43: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 43

Unemployment Table with additional Attributes

Page 44: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 44

RapidMiner Linked Open Data Extension

Allows you to 1. link local table to DBpedia and other LOD data sources

2. extend local table with additional attributes

3. mine extended tables using all Rapidminer features

Page 45: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 45

Finding Correlations

Use additional attributes to find interesting correlations

Example correlation for unemployment in France:• African islands, Islands in the Indian Ocean,

Outermost regions of the EU (positive)

• Population growth (positive)

• Disposable income (negative)

• Energy consumption (negative)

• Fast food restaurants (positive)

• Hospital beds/inhabitants (negative)

• Police stations (positive)

Page 46: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 46

Conclusions

1. Publication of Structured Data• There is more data than most people from research and industry like

• Exciting test-bed for data profiling and data integration techniques

• Not even the research focus has moved to the integration of 1000s of sources

2. Ontological Agreement• Application-pull helps (Google et al.)

• But data source-specific attributes are also important (e.g. in life science or statistics domain)

3. Applications• the big players are moving

• there is a lot of experimentation in industry, but many efforts are still in the prototype stage

Page 47: Evolving the Web into a Global Database - Advances and Applications

Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 47

Thanks

Mannheim Linked Open Data MeetupFree beer and food

Talks by Springer, Wolters Kluwer, Semantic Web Company, LOD2 project participants, DWS group members

Sunday, February 23, 2014, 6:30 PM

http://www.meetup.com/OpenKnowledgeFoundation/Mannheim-DE/1092882/

Advertisement