semantic web search - searching documents and semantic data on the web

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

1

Semantic Web Search Searching Documents and Semantic Data on the WebPresentation at Information Sciences Institute, USC

Semantic Search Group at the AIFB InstituteThanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner, Veli Bicer, Yongtao Ma and Rudi Studer.

http://sites.google.com/site/kimducthanh





Structure

2

• Motivation• Previous and current work• Keyword query processing• Keyword query result ranking• Conclusion


MOTIVATION

Besides documents, there is an increasing amount of structured data on the Web such as RDF, RDFa and Linked Data! How can we leverage this for enhancing the search experience?

3


RDFa

4

…<div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3>

Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:

<div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div></div>…

adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/

http://www.w3.org/TR/xhtml-rdfa-primer/


RDFa

5


content

content




Semantic Data

6

source: http://linkeddata.org/

http://linkeddata.org/


Linked Data

7




Addressing Complex Information Needs “Information about a friend of Alice, who shared an apartment with

her in Berlin and knows someone in the field of Semantic Search working at KIT”.

Alice


trouble with bob

creator

Bob

knowssunset.jpg

creator

titleBeautiful Sunset

knows Thanhworks

author

KIT loca

ted

Germany

Semantic Search

year

2009

Germany

author

Peter

worksFluidOps 34age

<friend of Alice>

<shared apartment in Berlin with Alice> <knows someone in the field of Semantic Search working at KIT>

9


Data Sources in SemanticSearch@AIFB Demo

English Wikipedia

Data from Linked Open Data DBpedia YAGO Many more

Live data from Data.gov (US Government) E.g. live data about earthquakes

10


Search Intent Interpretation, Refinement and Exploration

Vorlesung Knowledge Discovery - Institut AIFBFacets

Term Completions

Keywords

Query Completions

13


Result Inspection, Analysis and Browsing

14


OVERVIEW OF WORK

15


Search Concepts

Hybrid Search: Structured queries combined with keywords on structured and unstructured data in possibly remote (Linked Data) sources

Query interpretation: Translation of keywords to hybrid queries

Keyword search (translated hybrid query) combined with faceted search: starting with keywords and then iterative refinement process based on operations on facets

BACK-END

FRONT-END

16


Previous and Current Work

Semi-structured RDF data management [ISWC09] [TKDE12] Inverted index for RDF data management Structure index

Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11] Keyword query routing to find relevant sources / relevant

combination of sources “Explorative” query processing and adaptive query optimization Combining local and remote Linked Data

Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11] Ontology and entity result summarization Faceted and keyword search

Current work: hybrid data search

Tran Thanh: Schema-agnostic Search17


KEYWORD QUERY PROCESSING [ICDE09]

18


DB-style Keyword SearchKeyword query processing / translation

), dD,Q,F,R(q ji

“Articles of researchers at Stanford with Turing Award” Turing Award“„Stanford Article

1) Query 1 1) Result 12) Query 2 2) Result 2

Set of Queries

Specification

Selection

19

Set of Results

Keywords might produce large number of matching elements in the data graph

The data graph might be large in size Search complexity increases substantially with

the size of the graph Large number of results


Query Space Schema graph Query space

Main Idea Query space: more compact representation of the data graph

Online construction of query space out of schema graph Match keywords against labels of resources to find keyword elements Connect keyword elements with elements of schema to obtain query space

Online top-k query graph exploration

20

Exploration on much reduced summary model called query space

Substantially decrease complexity Top-k procedure for graph exploration to compute

only top-k results


Top-k Query Graph Exploration on Query SpacePaths and their costs The resulting query graph

• Cost-directed exploration of Steiner graphs• Explore all possible distinct paths starting from keyword elements• At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to construct the query

graph and add it to candidate list • Top-k terminates when highest cost of the candidate list (the cost of the k-

ranked query graph) is found to be lower than the lowest possible cost that can achieved with paths in the queues yet to be explored

• Result: best k query interpretations to be shown to the user

21


Evaluation – Performance• Comparison with bidirectional search [V. Kacholia et al.] and

search based on graph indexing (1000 BFS, 1000 METIS, 300 BFS, 300 METIS in [H. He et al.])

• Query computation + processing time until finding 10 answers • Outperforms bidirectional search by at least one order of

magn. • Performance comparable with indexing based approaches, but

requires less space

22

Query Performance on DBLP DataQ1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

1

10

100

1000

10000

100000

Our SolutionBidirect1000 BFS1000 METIS300BFS300METIS


KEYWORD QUERY RESULT RANKING[CIKM11]

23


IR-based Ranking Schemes

TF*IDF based: Discover, EASE, SPARK [Liu et al, SIGMOD06]

24

nidfndl

ntfrvWeight ),(

))ln(1ln(1 tfntf

avdldlssndl /)1(

df

Nnidf

1ln

Qrv

QvWeightrvWeightrScore,

),(),()(

JRTr

rScoreJRTScore )()(

24


Proximity-based Ranking Schemes

EASE, XRANK, BLINKS, etc. EASE

Proximity between a pair of keywords

Overall score of a JRT is aggregation on the score of keyword pairs XRANK

Ranking of XML documents / elements Proximity here is defined based on w, the smallest text window in

n that contains all search keywords

25


Prestige-based Ranking Schemes

Based on graph structure, i.e. PageRank-like methods to determine node prestige XRank [Guo et al, SIGMOD03] ObjectRank [Balmin et al, VLDB04] : considers both

global ObjectRank and keyword-specific ObjectRank The probability that edges of different types will be

visited are not uniform: requires manual fine-tuning to set the importance of different types of edges

Naive: indegree

26


27

Introduction

Recent study shows that the effectiveness of most works are below the expectations (Coffman and Weaver, CIKM 2010)

Problems: Proximity does not directly model relevance Ad-hoc TF/IDF normalization does not capture the nature

of keyword search results well (small document length, skewed word occurrence statistics)

PageRank not directly applicable


28

Overview of the Approach

Keyword query is short an ambiguous, while data (and results) provide rich structure information that can be exploited!

Principled approach to relevance based on language models and PRF estimate model from content and structure of PRF results

Adopt relevance model as a fine-grained model representing both content and structure of relevant document and queries (relevance class)


29

Relevance Models [SIGIR 01]

Explicit notion of relevance Queries and documents are samples from a latent

representation space, i.e. the relevance model underlying the information need


P(w|Q) w

.077 palestinian

.055 israel

.034 jerusalem

.033 protest

.027 raid

.011 clash

.010 bank

.010 west

.010 troop

…

sample probabilities

UMM

k

iik MqPMwPMPqqwP

11 )|()|()()...,(

Palestinian

Israeli

raids

???

q1q2q3

w

M

M

M

Relevance Models

)...(

)...,()...|()|(

1

11

k

kk qqP

qqwPqqwPRwP

30


31

Ranking with Relevance Models

Probability ranking principle

See relevance model as query expansion Rank of document is based on the cross-entropy of its

model and the relevance model

Vw

DwPRwPDRH )|(log)|()||(

)|()1(||

),()|( CwP

D

DwnDwP DD

Dw NwP

RwP

NDP

RDP

)|(

)|(

)|(

)|(


32

Edge-Specific Relevance Models

Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted keyword index: E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}

Based on PRF results, an edge specific relevance model is constructed for each unique edge e based on:


33

Edge Specific Resource Models

Edge-specific resource model:

Smoothing with model for the entire resource The score of a resource calculated based on cross-entropy

of edge-specific RM and edge-specific ResM:

Alpha allows to control the importance of edges


34

Ranking JRTs

Ranking aggregated JRTs: The cross entropy between the edge-specific RM (Query Model) and

geometric mean of combined edge-specific ResM:

The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms)


35

Experiments

Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases Queries: 50 queries for each dataset including “TREC style” queries and

“single resource” queries Metrics: Three metrics are used: (1) the number of top-1 relevant

results, (2) Reciprocal rank and (3) Mean Average Precision (MAP) Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,

CoveredDensity (TF-IDF). RM-S: Our approach


36

Experiments – Single Resource Queries

- Proximity-based approaches perform well - Minimizing compactness results in single resources being ranked high - TF-IDF normalization not as aggressive, not as effective

Reciprocal rank for single resource queries


37

Experiments – TREC-style Queries

- TF-IDF based approaches performed better- Our approach outperformed existing approaches also in this category,

providing more stable performance over the entire precision-recall curve

Precision-recall for TREC-style queries on Wikipedia


38

Experiment – All Queries

- Our approach consistently shows superior performance- Encouraging, given that this is first study that use a general

framework for evaluating keyword search ranking

MAP scores for all queries


Conclusions / Future Work

Front-to-backend work on using structured data for enhancing the search experience

From backend data management to frontend search concepts

Current work / future directions Managing hybrid data Hybrid query processing / interfaces Ranking hybrid results

39


References (1)

Günter Ladwig, Thanh TranSIHJoin: Querying Remote and Local Linked DataIn 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full research paper, 23% acceptance rate).

Thanh Tran, Lei Zhang, Rudi StuderSummary Models for Routing Keywords to Linked Data SourcesIn Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).

Günter Ladwig, Thanh TranLinked Data Query Processing StrategiesIn Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).

Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer Ontology-based Interpretation of Keywords for Semantic Search In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523-536. Busan, Korea, November 2007 (full paper, 19% acceptance rate).

40

https://sites.google.com/site/kimducthanh/chaoban/RoutingToLinkedData.pdf?attredirects=0













http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/paper.pdf











References (2)

Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).

Haofen Wang, Duc Thanh Tran, Chang Liu CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support In Proceedings of the 17th Conference on Information and Knowledge Management (CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).

Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue PanSemplore: A Scalable IR Approach to Search the Web of DataIn Journal of Web Semantics, 2009 (Impact Factor 3.4).

Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu Snippet Generation for Semantic Web Search Engines In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December 2008 (full research paper, 31% acceptance rate).

Thanh Tran, Günter LadwigStructure Index for RDFIn SemData@VLDB Workshop (SemData'10). Singapore, September, 2010.

41

http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/keywordtopk.pdf













http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/ce2poster.pdf









http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/semplore.pdf












http://www.aifb.uni-karlsruhe.de/WBS/dtr/papers/snippet.pdf








https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxraW1kdWN0aGFuaHxneDpmODk1ZDAxNzJmNmQ0NjQ





Thanks!

Tran Duc [email protected]

http://sites.google.com/site/kimducthanh/

42


Backups

43


Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In ICDE, pages 5-16.

Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and opportunities. In VLDB, page 1368.

Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528.

Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440.

Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using Relevance Models. In CIKM.

Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165

Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data graphs. PVLDB, 1(1):1189-1204.

Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pages 836-845.

Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In SIGMOD, pages 927-940.

Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD.

He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316.

Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Trans. Database Syst., 33(1):1-40


Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB.

Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.

Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173-182.

Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword Search Databases. In CIKM.

Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127. Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search

method for unstructured, semi-structured and structured data. In SIGMOD. Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational

databases. In SIGMOD, pages 563-574. Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational

databases. In SIGMOD, pages 115-126. Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD,

pages 681-694. Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across

heterogeneous relational databases. In ICDE, pages 346-355. Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search

Process. In Journal of Web Semantics, 2011. Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates

for Efficient Keyword Search on RDF. In ICDE. Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over

relational databases. In VLDB.

semantic web search - searching documents and semantic data on the web

Education

hybrid data search kit

live data

kit kit

unstructured data

search concepts hybrid

hybrid query

faceted search

data graph model query