sirenentity-oriented search infrastructure and tools · 2011-07-30 · semsearchentity ranking and...

5
The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data EOS 2011 S. Campinas, D. Ceccarelli, T.E. Perry, R. Delbru, K. Balog, G. Tummarello Introduction Web of Data Data Collection Statistics Tools Conclusion Outline Introduction Web of Data Overview Web of Data Model Data Collection Sindice-2011 Entity Extraction Sindice-DE Sindice-ED Statistics Data Heterogeneity Term Distribution Dataset Coverage Tools SIREn TREC Entity Tools Conclusion 1 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion Towards Entity-Oriented Search Retrieving entities? people, organisation, products, locations Emphasis of the research community as well as the commercial sector on the entity-oriented search International Benchmarking campaigns foster research in that direction TREC Expert finding and Related Entity finding tasks INEX Entity Ranking track SemSearch Entity Ranking and Lists search task 2 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion In need for research material: the Web of Data The Web of Data fits the need of entity-oriented search: Data naturally centered around entities Provides a semi-structured description of the data To foster research and development in that direction: Sindice-2011 Entity-oriented dataset representative of the entities found on the Web of Data SIREn Entity-oriented search infrastructure and tools for experiments 3 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion Overview 4 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion Overview 4 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion Web of Data: Model 5 / 25 Introduction Web of Data Data Collection Statistics Tools Conclusion The Sindice-2011 Dataset Sindice crawls the Web for documents containing semantic markups since 2009 Sindice-2011 is a up-to-date snapshot of the Web of Data Large coverage of domains: e-commerce, social networks, publications, . . . Highly heterogeneous collection: many ontologies and predicates as well as data formats are used 6 / 25

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIREnEntity-oriented search infrastructure and tools · 2011-07-30 · SemSearchEntity Ranking and Lists search task 2 / 25 The Sindice-2011 Dataset for Entity-Oriented Search in

The Sindice-2011 Dataset for Entity-Oriented Search in theWeb of Data

EOS 2011

S. Campinas, D. Ceccarelli, T.E. Perry,R. Delbru, K. Balog, G. Tummarello

Introduction Web of Data Data Collection Statistics Tools Conclusion

Outline

Introduction

Web of DataOverviewWeb of Data Model

Data CollectionSindice-2011Entity ExtractionSindice-DE

Sindice-EDStatistics

Data HeterogeneityTerm DistributionDataset Coverage

ToolsSIREnTREC Entity Tools

Conclusion

1 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Towards Entity-Oriented Search

Retrieving entities? people, organisation, products, locations

Emphasis of the research community as well as thecommercial sector on the entity-oriented search

International Benchmarking campaigns foster research in thatdirection

TREC Expert finding and Related Entity finding tasksINEX Entity Ranking track

SemSearch Entity Ranking and Lists search task

2 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

In need for research material: the Web of Data

The Web of Data fits the need of entity-oriented search:

Data naturally centered around entitiesProvides a semi-structured description of the data

To foster research and development in that direction:

Sindice-2011 Entity-oriented dataset representative of theentities found on the Web of Data

SIREn Entity-oriented search infrastructure and toolsfor experiments

3 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Overview

4 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Overview

4 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Web of Data: Model

5 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

The Sindice-2011 Dataset

Sindice crawls the Web for documents containing semanticmarkups since 2009

Sindice-2011 is a up-to-date snapshot of the Web of Data

Large coverage of domains: e-commerce, social networks,publications, . . .Highly heterogeneous collection: many ontologies andpredicates as well as data formats are used

6 / 25

Page 2: SIREnEntity-oriented search infrastructure and tools · 2011-07-30 · SemSearchEntity Ranking and Lists search task 2 / 25 The Sindice-2011 Dataset for Entity-Oriented Search in

Introduction Web of Data Data Collection Statistics Tools Conclusion

The Sindice-2011 Dataset: Two Views

The Sindice-2011 dataset provides two views on the entities

Sindice-DE a Document-Entity view where an entity ispresented within a context, i.e., the document

Sindice-ED a Entity-Document view providing a globaldescription of an entity across documents

7 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

http://renaud.delbru.fr/foaf.rdf

8 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

8 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

Me

Me

Diego

Renaud

Stephane

knows

knows

name

8 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

http://renaud.delbru.fr/foaf.rdf

Me

Me

Diego

Renaud

Stephane

knows

knows

name

Diego

Stephane27

Italy

8 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

http://renaud.delbru.fr/foaf.rdf

Me

Me

Diego

Renaud

Stephane

knows

knows

name

Diego

Stephane27

Italy

Diego

Me

Diego

Stephane

27

Italy

knows

knows

country

age

Diego

8 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-ED is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

9 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-ED is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

Me

Diego

Renaud

Stephane

Me @ http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

knows

knows

name

9 / 25

Page 3: SIREnEntity-oriented search infrastructure and tools · 2011-07-30 · SemSearchEntity Ranking and Lists search task 2 / 25 The Sindice-2011 Dataset for Entity-Oriented Search in

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-ED is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

Me @ http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname

http://test.com/renaud.rdf

9 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-ED is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

Me @ http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname

http://test.com/renaud.rdf

Me

Delbru

Stephane

9 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

How Sindice-ED is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

Me @ http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname

http://test.com/renaud.rdf

Me

Delbru

Stephane

Me

Delbru

Stephane knows

Surname

Me @ http://test.com/renaud.rdf

9 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Sindice-2011 Statistics

We provide statistics about the Sindice-2011

We organize statistics in three groups:1 data heterogeneity2 entity size3 term distribution

10 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinctdocuments

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

11 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

12 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains

1.7 billion entities

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

13 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains

1.7 billion entities

11 billion of statements

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

14 / 25

Page 4: SIREnEntity-oriented search infrastructure and tools · 2011-07-30 · SemSearchEntity Ranking and Lists search task 2 / 25 The Sindice-2011 Dataset for Entity-Oriented Search in

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains

1.7 billion entities

11 billion of statements

using more than 600 thousandontologies

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

15 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains

1.7 billion entities

11 billion of statements

using more than 600 thousandontologies

. . . and more than 5 millionpredicates

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

16 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity - Data Formats

Half (142 million) of thedocuments are containingRDFa markups.

Half (130 million) of thedocuments are containingMicroformats markups.

A third (87 million) arecontaining plain RDF

All standardised dataformats are well covered.

105 106 107 108

RDFARDF

HCARDXFNADR

LICENSEGEOICBM

HCALENDARHREVIEWHRESUMEGEOURL

HLISTING

Number of documents using the format (log)

17 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Data heterogeneity - Predicate Distribution

Frequency of use ofpredicate URIs follows apower-law distribution

A large number of predicates(98%) are used only once

Distribution tail is sparse: afew predicates are used in alarge proportion ofdocuments

100 101 102 103 104 105 106 107

10−8

10−6

10−4

10−2

100

number of documents (log)

probability(log

)

α = 1.45

18 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Entity Size

Sindice-DE

25% of entities have three of fourtriples75% have ten or fewer triplesAverage entity size is 5.32 triples

Sindice-ED

20% of entities have three or fourtriples75% have ten or fewer triples.Average entity size is 14.19triples

100 101 102 10310−8

10−6

10−4

10−2

entity size in triples (log)

probab

ility(log

)

α = 3.6

100 101 102 103 104 105 106

10−8

10−6

10−4

10−2

100

entity size in triples (log)

probability(log

)

α = 2.6

19 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Term Distribution

Distribution of literal terms overpredicates (top) and thedistribution of URIs over predicates(bottom)

Probability, for each literal term orURI, to appear in exactly n distinctpredicates

URIs often tend to appear jointlywith a small number of predicates,while in literals this behavior is lesspronounced

100 101 102 103 104 105 106

10−10

10−8

10−6

10−4

10−2

100

number of distinct predicates (log)

probab

ility(log

)

α = 2.36

100 101 102 103 104 105

10−10

10−8

10−6

10−4

10−2

100

number of distinct predicates (log)

probab

ility(log

)

α = 3.22

20 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Coverage

We checked all the URIs in the relevance assessment files fromthe 2010 and 2011 editions of the Semantic Search Challengeand from the ELC task of the 2010 TREC Entity track

The Sindice-2011 Dataset covers 99% of these URIs

The missing ones are either:1 not exist anymore on the Web2 synthetic URIs that were generated to replace blank nodes

with URIs in the BTC-2009 dataset

21 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Semantic Information Retrieval Engine

Field-based techniques (e.g., Lucene) inefficient with highlyheterogeneous structured data

SIREn, an IR search engine for the Web of Data

Based on XML IR techniqueBuilt for dealing with highly heterogeneous structured datasources

No limit in the number of attributesExtended support for structured query

> Delbru, R., Toupikov, N., Catasta, M., Fuller, R.,Tummarello, G.: SIREn: Efficient Search on Semi-Structured Documents. In: Lucene in Action, SecondEdition. Manning Publications Co. (2009)

> http://siren.sindice.com

22 / 25

Page 5: SIREnEntity-oriented search infrastructure and tools · 2011-07-30 · SemSearchEntity Ranking and Lists search task 2 / 25 The Sindice-2011 Dataset for Entity-Oriented Search in

Introduction Web of Data Data Collection Statistics Tools Conclusion

TREC Entity Tools

Search infrastructure and set of tools developed on top ofSIREn

Provide out of the box functionalities to process, index andsearch Sindice-2011 dataset

One can develop and test new entity retrieval model

Two-step retrieval processTF-IDF ranking baseline provided, but possibilities toimplement its own rankingQuery expansion techniques that combines full-text andstructured queries

> https://github.com/rdelbru/trec-entity-tool

23 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Conclusion

Sindice-2011, a dataset to foster research on entity-orientedsearch

Accurate reflection of the current Web of DataA corpus of 1.7 Billion entities

A search infrastructure tool based on SIREn, the SemanticInformation Retrieval Search Engine, in order to easedevelopment and research

24 / 25

Introduction Web of Data Data Collection Statistics Tools Conclusion

Thank You!

Questions ?

25 / 25