sirenentity-oriented search infrastructure and tools · 2011-07-30 · semsearchentity ranking and...

The Sindice-2011 Dataset for Entity-Oriented Search in theWeb of Data

EOS 2011

S. Campinas, D. Ceccarelli, T.E. Perry,R. Delbru, K. Balog, G. Tummarello

Introduction Web of Data Data Collection Statistics Tools Conclusion

Outline

Introduction

Web of DataOverviewWeb of Data Model

Data CollectionSindice-2011Entity ExtractionSindice-DE

Sindice-EDStatistics

Data HeterogeneityTerm DistributionDataset Coverage

ToolsSIREnTREC Entity Tools

Conclusion

1 / 25


Towards Entity-Oriented Search

Retrieving entities? people, organisation, products, locations

Emphasis of the research community as well as thecommercial sector on the entity-oriented search

International Benchmarking campaigns foster research in thatdirection

TREC Expert finding and Related Entity finding tasksINEX Entity Ranking track

SemSearch Entity Ranking and Lists search task

2 / 25


In need for research material: the Web of Data

The Web of Data fits the need of entity-oriented search:

Data naturally centered around entitiesProvides a semi-structured description of the data

To foster research and development in that direction:

Sindice-2011 Entity-oriented dataset representative of theentities found on the Web of Data

SIREn Entity-oriented search infrastructure and toolsfor experiments

3 / 25


Overview

4 / 25


Overview

4 / 25


Web of Data: Model

5 / 25


The Sindice-2011 Dataset

Sindice crawls the Web for documents containing semanticmarkups since 2009

Sindice-2011 is a up-to-date snapshot of the Web of Data

Large coverage of domains: e-commerce, social networks,publications, . . .Highly heterogeneous collection: many ontologies andpredicates as well as data formats are used

6 / 25


The Sindice-2011 Dataset: Two Views

The Sindice-2011 dataset provides two views on the entities

Sindice-DE a Document-Entity view where an entity ispresented within a context, i.e., the document

Sindice-ED a Entity-Document view providing a globaldescription of an entity across documents

7 / 25


How Sindice-DE is produced

http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


8 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Diego

Renaud

Stephane

8 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Diego

Renaud

Stephane

Me

Me

Diego

Renaud

Stephane

knows

knows

name

8 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Me

Diego

Renaud

Stephane

knows

knows

name

Diego

Stephane27

Italy

8 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Me

Diego

Renaud

Stephane

knows

knows

name

Diego

Stephane27

Italy

Diego

Me

Diego

Stephane

27

Italy

knows

knows

country

age

Diego

8 / 25


How Sindice-ED is produced


Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

9 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name

Me

Diego

Renaud

Stephane

Me @ http://renaud.delbru.fr/foaf.rdf

Me

Diego

Renaud

Stephane

knows

knows

name

9 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname

http://test.com/renaud.rdf

9 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname


Me

Delbru

Stephane

9 / 25




Me

Diego

Renaud

Stephane27

Italy

knows

knowsknows

country

age

name


Me

Diego

Renaud

Stephane

knows

knows

name

Me

Giovanni

Delbru

StephaneItaly

knows

knows

country

surname


Me

Delbru

Stephane

Me

Delbru

Stephane knows

Surname

Me @ http://test.com/renaud.rdf

9 / 25


Sindice-2011 Statistics

We provide statistics about the Sindice-2011

We organize statistics in three groups:1 data heterogeneity2 entity size3 term distribution

10 / 25


Data heterogeneity

230 million distinctdocuments

Description Statistic

Documents 230,643,642

Domains 4,316,438

Second-level domains 270,404

Entities 1,738,089,611

Statements 11,163,347,487

Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)

Blank nodes (as objects) 1,733,769,572 (15.53%)

Unique ontologies 663,062

Unique predicate URIs 5,396,273

Unique literal words 114,138,096

Unique URIs (as objects) 409,589,991

11 / 25


Data heterogeneity

230 million distinct documents

270,404 distinct second-leveldomains



Domains 4,316,438


Entities 1,738,089,611


Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)






12 / 25


Data heterogeneity



1.7 billion entities



Domains 4,316,438


Entities 1,738,089,611


Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)






13 / 25


Data heterogeneity




11 billion of statements



Domains 4,316,438


Entities 1,738,089,611


Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)






14 / 25


Data heterogeneity





using more than 600 thousandontologies



Domains 4,316,438


Entities 1,738,089,611


Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)






15 / 25


Data heterogeneity





using more than 600 thousandontologies

. . . and more than 5 millionpredicates



Domains 4,316,438


Entities 1,738,089,611


Bytes 1,274,254,991,725

Literals 3,968,223,334 (35.55%)

URIs (as objects) 5,461,354,581 (48.92%)






16 / 25


Data heterogeneity - Data Formats

Half (142 million) of thedocuments are containingRDFa markups.

Half (130 million) of thedocuments are containingMicroformats markups.

A third (87 million) arecontaining plain RDF

All standardised dataformats are well covered.

105 106 107 108

RDFARDF

HCARDXFNADR

LICENSEGEOICBM

HCALENDARHREVIEWHRESUMEGEOURL

HLISTING

Number of documents using the format (log)

17 / 25


Data heterogeneity - Predicate Distribution

Frequency of use ofpredicate URIs follows apower-law distribution

A large number of predicates(98%) are used only once

Distribution tail is sparse: afew predicates are used in alarge proportion ofdocuments

100 101 102 103 104 105 106 107

10−8

10−6

10−4

10−2

100

number of documents (log)

probability(log

)

α = 1.45

18 / 25


Entity Size

Sindice-DE

25% of entities have three of fourtriples75% have ten or fewer triplesAverage entity size is 5.32 triples

Sindice-ED

20% of entities have three or fourtriples75% have ten or fewer triples.Average entity size is 14.19triples

100 101 102 10310−8

10−6

10−4

10−2

entity size in triples (log)

probab

ility(log

)

α = 3.6

100 101 102 103 104 105 106

10−8

10−6

10−4

10−2

100

entity size in triples (log)

probability(log

)

α = 2.6

19 / 25


Term Distribution

Distribution of literal terms overpredicates (top) and thedistribution of URIs over predicates(bottom)

Probability, for each literal term orURI, to appear in exactly n distinctpredicates

URIs often tend to appear jointlywith a small number of predicates,while in literals this behavior is lesspronounced

100 101 102 103 104 105 106

10−10

10−8

10−6

10−4

10−2

100

number of distinct predicates (log)

probab

ility(log

)

α = 2.36

100 101 102 103 104 105

10−10

10−8

10−6

10−4

10−2

100

number of distinct predicates (log)

probab

ility(log

)

α = 3.22

20 / 25


Coverage

We checked all the URIs in the relevance assessment files fromthe 2010 and 2011 editions of the Semantic Search Challengeand from the ELC task of the 2010 TREC Entity track

The Sindice-2011 Dataset covers 99% of these URIs

The missing ones are either:1 not exist anymore on the Web2 synthetic URIs that were generated to replace blank nodes

with URIs in the BTC-2009 dataset

21 / 25


Semantic Information Retrieval Engine

Field-based techniques (e.g., Lucene) inefficient with highlyheterogeneous structured data

SIREn, an IR search engine for the Web of Data

Based on XML IR techniqueBuilt for dealing with highly heterogeneous structured datasources

No limit in the number of attributesExtended support for structured query

> Delbru, R., Toupikov, N., Catasta, M., Fuller, R.,Tummarello, G.: SIREn: Efficient Search on Semi-Structured Documents. In: Lucene in Action, SecondEdition. Manning Publications Co. (2009)

> http://siren.sindice.com

22 / 25


TREC Entity Tools

Search infrastructure and set of tools developed on top ofSIREn

Provide out of the box functionalities to process, index andsearch Sindice-2011 dataset

One can develop and test new entity retrieval model

Two-step retrieval processTF-IDF ranking baseline provided, but possibilities toimplement its own rankingQuery expansion techniques that combines full-text andstructured queries

> https://github.com/rdelbru/trec-entity-tool

23 / 25


Conclusion

Sindice-2011, a dataset to foster research on entity-orientedsearch

Accurate reflection of the current Web of DataA corpus of 1.7 Billion entities

A search infrastructure tool based on SIREn, the SemanticInformation Retrieval Search Engine, in order to easedevelopment and research

24 / 25


Thank You!

Questions ?

25 / 25