a survey of entity ranking over rdf graphs
Upload: intelligent-search-systems-and-semantic-technologies-lab-at-itis-kfu
Post on 15-Jan-2015
453 views
DESCRIPTION
The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").TRANSCRIPT
A Survey of Entity Ranking overRDF Graphs
Nikita Zhiltsov
Kazan Federal UniversityRussia
November 29, 2013
1 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion
2 / 60
MotivationI The increasing amount of valuable semi-structured
data has become available online, e.g.I RDF graphs: Linking Open Data (LOD) cloudI Web pages enhanced with microformats, RDFa
etc.: CommonCrawl, Web Data CommonsI Google: Freebase Annotations of the ClueWeb
Corpora
I More than a half of queries from real query logshave the entity-centric user intent
I Examples from industry: Google Knowledge Graph,Facebook Graph Search, Yandex Islands ⇒
3 / 60
Overview of Semantic Search ApproachesT. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It
7 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion
8 / 60
In this talk, we focus on entity ranking over RDFgraphs given a keyword search query
9 / 60
Key Issues in Entity Ranking
I Ambiguity in namesI Related entities from heterogeneous
data sourcesI Complex queries with clarifying terms
10 / 60
Key Issues in Entity RankingAmbiguity in names
Given a query university of michigan,
I University of Michigan, Ann Arbor ,I Central Michigan University, MichiganTechnological University, Michigan StateUniversity /
11 / 60
Key Issues in Entity RankingRelated entities from heterogeneous data sourcesGiven a query harry potter movie,
Semantic link information can effectively enhance termcontext
12 / 60
Key Issues in Entity RankingComplex queries with clarifying terms
Given a query shobanamasala, the user intent islikely about ShobanaChandrakumar, an Indianactress starring in movies ofthe Masala genre
13 / 60
Ad-hoc Object Retrieval in the Web of Data
Jeffrey Pound, Peter Mika, Hugo Zaragoza
WWW 2010
14 / 60
Query CategoriesI Entity query (∼ 40%∗), e.g. 1978 cj5jeep
I Type query† (∼ 12%), e.g. doctors inbarcelona
I Attribute query (∼ 5%), e.g. zip codeatlanta
I Other query (∼ 36%)I however, ∼ 14% of them contain a context
entity or type∗estimated on real query logs from Yahoo!†a.k.a. list search query
15 / 60
Repeatable and ReliableSearch System Evaluation
using Crowdsourcing
Roi Blanco, Harry Halpin, Daniel M. Herzig,Peter Mika, Jeffrey Pound, Henry S. Thompson,
Thanh D. Tran
SIGIR 2011
16 / 60
Data Collection
I Billion Triples Challenge 2009 RDF data setI The size of uncompressed data is 247GB;1.4B triples describing 114 million objects
I It was composed by combining crawls ofmultiple RDF search engines
17 / 60
Query Set Preparation1 Emulate top queries
I Given Microsoft Live Search log containingqueries repeated by at least 10 different users
I Sample 50 queries prefiltered with a NER anda gazetteer
2 Emulate long-tailed queriesI Given Yahoo! Search Query Log Tiny Sample
v1.0 – 4,500 queriesI Sample and manually filter out ambiguous
queries ⇒ 42 queries
3 ⇒ a list of 92 queries
21 / 60
Crowdsourcing JudgementsI A purpose-built rendering tool to presentthe search results
I There have been conducted the evaluation(MT1) and its repetition(MT2) after 6months
I Using Amazon Mechanical Turk HITsI Each HIT consists of 12 query-result pairs:10 real ones and 2 were from "goldenstandard" annotated by experts
I 64 workers for MT1 and 69 workers for MT222 / 60
Analysis of ResultsRepeatability
I The level of agreement is the same for twopools
I The rank order of the systems is unchanged
24 / 60
Targeting Evaluation Measures IAll the measures are usually computed on top-10 searchresults (k=10)
1 P@k (precision at k):
P@k(π, l) =
∑t≤k I{lπ(k)=1}
k
2 MAP (mean average precision):
AP (π, l) =
∑mk=1 P@k · I{lπ(k)=1}
m1
MAP = mean of AP over all queries25 / 60
Targeting Evaluation Measures II3 NDCG: normalized discounted cumulative gain
DCG@k(π, l) =k∑j=1
G(lπ(j)) · η(j),
where G(·), the rating of a document, is usuallyG(z) = 2z − 1, η(j) = 1
log(j+1) , lπ(j) ∈ {0, 1, 2}
NDCG@k(π, l) =1
ZkDCG@k(π, l)
26 / 60
Analysis of ResultsReliability
Metric DifferenceMAP 1.8%NDCG 3.5%P@10 12.8%
I In the setting, experts rate more resultsnegative than workers
I P@10 is more fragile than MAP and NDCG
27 / 60
Yahoo! SemSearch Challenge (YSC) 2010 & 2011http://semsearch.yahoo.com
28 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion
29 / 60
Entity Search Track Submission byYahoo! Research Barcelona
Roi Blanco, Peter Mika, Hugo Zaragoza
SSW at WWW 2010
30 / 60
YSC 2010 Winner ApproachI RDF S-P-O triples with literals are only consideredI Triples are filtered by predicates from a predefined
list of 300 predicatesI Triples about the same subject are grouped into a
pseudo document with multiple fieldsI BM25F ranking formula is applied (the weighting
scheme wc is handcrafted):
BM25F =∑t∈q∩d
tf(t, d)
k1 + b ∗ tf(t, d)· idf(t),
tf(t, d) =∑c∈d
wc · tfc(t, d)
31 / 60
Sindice BM25MF at SemSearch 2011
Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,Diego Ceccarelli, Giovanni Tummarello
SSW at WWW 2011
32 / 60
YSC 2011 Winner Approach I
I URI resolution for triple objectsI Extended BM25F approach with additionalnormalization for term frequencies perpredicate types:
I The weighting scheme is handcraftedI The proportion of query terms in entityliterals
33 / 60
YSC 2011 Winner Approach III
Star-shaped query matching the entity:
35 / 60
On the Modeling of Entitiesfor Ad-Hoc Entity Search in the Web of Data
Robert Neumayer, Kristztian Balog, Kjetil Nørvåg
ECIR 2012
37 / 60
Approach to entity representation I
RDF graph example:
38 / 60
Approach to entity representation II
a) Unstructured Entity Model; b) Structure Entity Model:
39 / 60
Main Findings
I Two generative language models (LMs) forthe task:
I Unstructured Entity ModelI Structured Entity Model
I The evaluation on the YSC data shows thatthe representation of relations as a mixtureof predicate type LMs can contributesignificantly to overall performance
40 / 60
LM Retrieval Framework
P (e|q) = P (q|e)P (e)P (q)
rank= P (q|e)P (e),
where P (e|q) - probability of being relevant given query q
Further Assumptions(i) P (e) is uniform; (ii) query terms are i.i.d
Let θe be the entity model that predicts how likely theentity would produce a given term t, thenthe query likelihood is
P (q|θe) =∏t∈q
P (t|θe)tf(t,q)
41 / 60
Unstructured Entity Model
IdeaCollapse all text values of properties associatedwith the entity into a single document and applystandard IR techniques
The entity model is a Dirichlet-smoothedmultinomial distribution:
P (t|θe) =tf (t, e) + µP (t|θc)
|e| + µ
42 / 60
Structured Entity ModelFolding Predicates
Group RDF triples by the following predicate types pt:
I Name, e.g. literal values of foaf:name, rdfs:label
I Attributes, i.e. remaining datatype properties
I OutRelations: resolving "object" (O) URIs in S-P-Otriple getting their names
I InRelations: resolving "subject" (S) URIs in S-P-Otriple getting their names
43 / 60
Structured Entity ModelMixture of Language Models
Each group has its own LM P (t|θpte ):
P (t|θpte ) =tf (t, pt, e) + µptP (t|θptc )
|pt, e| + µpt
Then, the entity model is a linear mixture of thepredicate type LMs:
P (t|θe) =∑pt
P (t|θpte )P (pt)
44 / 60
Comparative Evaluation
Model MAP P@10 NDCGYSC 2010
UEM 0.207 0.314 0.383SEM 0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%)
YSC 2011UEM 0.207 0.188 0.295SEM 0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%)
The multi-fielded document approach improvesthe targeted measures on 26-35%
45 / 60
Combining N-gram Retrieval with WeightsPropagation on Massive RDF Graphs
He Hu, Xiaoyang Du
FSKD 2012
46 / 60
Approach I
I Considering 2- to 5-grams while indexing entityURIs as well as literals
I Thinking of URIs as hierarchical namesI Computing the entity-query similarity scores:
simURI(Q) =engram_hit_count
(||Q| − |URI.path||+ 1) · (URI.depth+ 1)
simLITERAL(Q) =engram_hit_count
||Q| − |LITERAL.length||+ 1
47 / 60
Approach III Ranking score:
ScoreURI(Q) = 1− e−sim(Q)
I Taking advantage of iterative PageRank-like weightpropagation:
WURI_hit(i+ 1) = α ·WURI_hit(i)
WURI_unhit(i+ 1) = (1− α) ·WURI_hit_neighbors(i)
NURI_hit_neighbors
I Improvement up to 80% w.r.t. the plain n-gramranker
48 / 60
Combining Inverted Indicesand Structured Search
for Ad-hoc Object Retrieval
Alberto Tonton, Gianluca Demartini,Phillipe Cudré-Mauroux
SIGIR 2012
49 / 60
Structured Inverted Index
Consider the following property values as fields:I URI: tokens from entity URI, e.g. http://dbpedia.org/page/Barack_Obama⇒ ’barack’, ’obama’ etc.
I Labels: values of a list of manually selecteddatatype properties
I Attributes: other properties
BM25F is used as a ranking function
51 / 60
Graph-based Entity Search
1 Given a query q, obtain a list of entitiesRetr = {e1, e2, . . . , en} ranked by the BM25Fscores
2 Use top-N elements as seeds for graph traversal3 To get StructRetr = {e′1, . . . , e′m}, exploit
promising LOD properties‡ as well as Jaro-Winklerstring similarity scores JW (q, e′) > τ
4 Combine two rankings:
finalScore(q, e′) = λ×BM25(q, e) + (1− λ)× JW (q, e′)
‡owl:sameAs, dbpedia:disambiguates, dbpedia:redirect52 / 60
Evaluation
I The graph-based approach (S1_1) outperforms BM25scoring with 25% improvement of MAP on the 2010 data set
I No significant improvement over baseline on the 2011 datasetThis may be explained by the lack of the used predicates(owl:sameAs volume < 0.7%)
53 / 60
Improving Entity Search over Linked Databy Modeling Latent Semantics
Nikita Zhiltsov, Eugene Agichtein
CIKM 2013
54 / 60
Key Contributions
I A tensor factorization based approach to incorporatesemantic link information into ranking model
I Outperforms the state of the art baseline inNDCG/MAP/P@10
I A thorough evaluation of the proposed techniquesby acquiring thousands of manual labels to augmentthe YSC benchmark data set
⇒ more details in the next talk
55 / 60
Negative ResultsThe ideas from standard IR that do not work out:
I Wordnet-based query expansion [Tonon et al.,SIGIR 2012]
I Pseudo-relevance feedback [Tonon et al., SIGIR2012]
I Query suggestions of a commercial search engine[Tonon et al., SIGIR 2012]
I Direct application of centrality measures, such asPageRank and HITS [Campinas et al., SSW WWW2010; Dali et al., 2012]
57 / 60
Outline
1 Introduction
2 Task Statement and Evaluation Methodology
3 Approaches
4 Conclusion
58 / 60
Wrap up
I Entity search over RDF graphs a.k.a. ad-hoc objectretrieval has emerged as a new task in IR
I There is a robust and consistent evaluationmethodology for it
I State-of-the-art approaches revolve aroundapplications of well-known IR methods along
I Lack of approaches for leveraging semantic links
I Lots of data: scalability really matters
59 / 60