sirenentity-oriented search infrastructure and tools · 2011-07-30 · semsearchentity ranking and...
TRANSCRIPT
The Sindice-2011 Dataset for Entity-Oriented Search in theWeb of Data
EOS 2011
S. Campinas, D. Ceccarelli, T.E. Perry,R. Delbru, K. Balog, G. Tummarello
Introduction Web of Data Data Collection Statistics Tools Conclusion
Outline
Introduction
Web of DataOverviewWeb of Data Model
Data CollectionSindice-2011Entity ExtractionSindice-DE
Sindice-EDStatistics
Data HeterogeneityTerm DistributionDataset Coverage
ToolsSIREnTREC Entity Tools
Conclusion
1 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Towards Entity-Oriented Search
Retrieving entities? people, organisation, products, locations
Emphasis of the research community as well as thecommercial sector on the entity-oriented search
International Benchmarking campaigns foster research in thatdirection
TREC Expert finding and Related Entity finding tasksINEX Entity Ranking track
SemSearch Entity Ranking and Lists search task
2 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
In need for research material: the Web of Data
The Web of Data fits the need of entity-oriented search:
Data naturally centered around entitiesProvides a semi-structured description of the data
To foster research and development in that direction:
Sindice-2011 Entity-oriented dataset representative of theentities found on the Web of Data
SIREn Entity-oriented search infrastructure and toolsfor experiments
3 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Overview
4 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Overview
4 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Web of Data: Model
5 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
The Sindice-2011 Dataset
Sindice crawls the Web for documents containing semanticmarkups since 2009
Sindice-2011 is a up-to-date snapshot of the Web of Data
Large coverage of domains: e-commerce, social networks,publications, . . .Highly heterogeneous collection: many ontologies andpredicates as well as data formats are used
6 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
The Sindice-2011 Dataset: Two Views
The Sindice-2011 dataset provides two views on the entities
Sindice-DE a Document-Entity view where an entity ispresented within a context, i.e., the document
Sindice-ED a Entity-Document view providing a globaldescription of an entity across documents
7 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-DE is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
http://renaud.delbru.fr/foaf.rdf
8 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-DE is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
8 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-DE is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
Me
Me
Diego
Renaud
Stephane
knows
knows
name
8 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-DE is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
http://renaud.delbru.fr/foaf.rdf
Me
Me
Diego
Renaud
Stephane
knows
knows
name
Diego
Stephane27
Italy
8 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-DE is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
http://renaud.delbru.fr/foaf.rdf
Me
Me
Diego
Renaud
Stephane
knows
knows
name
Diego
Stephane27
Italy
Diego
Me
Diego
Stephane
27
Italy
knows
knows
country
age
Diego
8 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-ED is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
9 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-ED is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
Me
Diego
Renaud
Stephane
Me @ http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
knows
knows
name
9 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-ED is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
Me @ http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
knows
knows
name
Me
Giovanni
Delbru
StephaneItaly
knows
knows
country
surname
http://test.com/renaud.rdf
9 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-ED is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
Me @ http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
knows
knows
name
Me
Giovanni
Delbru
StephaneItaly
knows
knows
country
surname
http://test.com/renaud.rdf
Me
Delbru
Stephane
9 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
How Sindice-ED is produced
http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane27
Italy
knows
knowsknows
country
age
name
Me @ http://renaud.delbru.fr/foaf.rdf
Me
Diego
Renaud
Stephane
knows
knows
name
Me
Giovanni
Delbru
StephaneItaly
knows
knows
country
surname
http://test.com/renaud.rdf
Me
Delbru
Stephane
Me
Delbru
Stephane knows
Surname
Me @ http://test.com/renaud.rdf
9 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Sindice-2011 Statistics
We provide statistics about the Sindice-2011
We organize statistics in three groups:1 data heterogeneity2 entity size3 term distribution
10 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinctdocuments
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
11 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinct documents
270,404 distinct second-leveldomains
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
12 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinct documents
270,404 distinct second-leveldomains
1.7 billion entities
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
13 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinct documents
270,404 distinct second-leveldomains
1.7 billion entities
11 billion of statements
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
14 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinct documents
270,404 distinct second-leveldomains
1.7 billion entities
11 billion of statements
using more than 600 thousandontologies
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
15 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity
230 million distinct documents
270,404 distinct second-leveldomains
1.7 billion entities
11 billion of statements
using more than 600 thousandontologies
. . . and more than 5 millionpredicates
Description Statistic
Documents 230,643,642
Domains 4,316,438
Second-level domains 270,404
Entities 1,738,089,611
Statements 11,163,347,487
Bytes 1,274,254,991,725
Literals 3,968,223,334 (35.55%)
URIs (as objects) 5,461,354,581 (48.92%)
Blank nodes (as objects) 1,733,769,572 (15.53%)
Unique ontologies 663,062
Unique predicate URIs 5,396,273
Unique literal words 114,138,096
Unique URIs (as objects) 409,589,991
16 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity - Data Formats
Half (142 million) of thedocuments are containingRDFa markups.
Half (130 million) of thedocuments are containingMicroformats markups.
A third (87 million) arecontaining plain RDF
All standardised dataformats are well covered.
105 106 107 108
RDFARDF
HCARDXFNADR
LICENSEGEOICBM
HCALENDARHREVIEWHRESUMEGEOURL
HLISTING
Number of documents using the format (log)
17 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Data heterogeneity - Predicate Distribution
Frequency of use ofpredicate URIs follows apower-law distribution
A large number of predicates(98%) are used only once
Distribution tail is sparse: afew predicates are used in alarge proportion ofdocuments
100 101 102 103 104 105 106 107
10−8
10−6
10−4
10−2
100
number of documents (log)
probability(log
)
α = 1.45
18 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Entity Size
Sindice-DE
25% of entities have three of fourtriples75% have ten or fewer triplesAverage entity size is 5.32 triples
Sindice-ED
20% of entities have three or fourtriples75% have ten or fewer triples.Average entity size is 14.19triples
100 101 102 10310−8
10−6
10−4
10−2
entity size in triples (log)
probab
ility(log
)
α = 3.6
100 101 102 103 104 105 106
10−8
10−6
10−4
10−2
100
entity size in triples (log)
probability(log
)
α = 2.6
19 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Term Distribution
Distribution of literal terms overpredicates (top) and thedistribution of URIs over predicates(bottom)
Probability, for each literal term orURI, to appear in exactly n distinctpredicates
URIs often tend to appear jointlywith a small number of predicates,while in literals this behavior is lesspronounced
100 101 102 103 104 105 106
10−10
10−8
10−6
10−4
10−2
100
number of distinct predicates (log)
probab
ility(log
)
α = 2.36
100 101 102 103 104 105
10−10
10−8
10−6
10−4
10−2
100
number of distinct predicates (log)
probab
ility(log
)
α = 3.22
20 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Coverage
We checked all the URIs in the relevance assessment files fromthe 2010 and 2011 editions of the Semantic Search Challengeand from the ELC task of the 2010 TREC Entity track
The Sindice-2011 Dataset covers 99% of these URIs
The missing ones are either:1 not exist anymore on the Web2 synthetic URIs that were generated to replace blank nodes
with URIs in the BTC-2009 dataset
21 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Semantic Information Retrieval Engine
Field-based techniques (e.g., Lucene) inefficient with highlyheterogeneous structured data
SIREn, an IR search engine for the Web of Data
Based on XML IR techniqueBuilt for dealing with highly heterogeneous structured datasources
No limit in the number of attributesExtended support for structured query
> Delbru, R., Toupikov, N., Catasta, M., Fuller, R.,Tummarello, G.: SIREn: Efficient Search on Semi-Structured Documents. In: Lucene in Action, SecondEdition. Manning Publications Co. (2009)
> http://siren.sindice.com
22 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
TREC Entity Tools
Search infrastructure and set of tools developed on top ofSIREn
Provide out of the box functionalities to process, index andsearch Sindice-2011 dataset
One can develop and test new entity retrieval model
Two-step retrieval processTF-IDF ranking baseline provided, but possibilities toimplement its own rankingQuery expansion techniques that combines full-text andstructured queries
> https://github.com/rdelbru/trec-entity-tool
23 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Conclusion
Sindice-2011, a dataset to foster research on entity-orientedsearch
Accurate reflection of the current Web of DataA corpus of 1.7 Billion entities
A search infrastructure tool based on SIREn, the SemanticInformation Retrieval Search Engine, in order to easedevelopment and research
24 / 25
Introduction Web of Data Data Collection Statistics Tools Conclusion
Thank You!
Questions ?
25 / 25