bidirectional expansion for keyword search on graph databases varun kacholia shashank pandit soumen...

32
Bidirectional Expansion for Keyword Search on Graph Databases http://www.cse.iitb.ac.in/banks/ Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan Rushi Desai Hrishikesh Karambelkar

Upload: shona-hoover

Post on 24-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidirectional Expansion for Keyword Search

on Graph Databases

http://www.cse.iitb.ac.in/banks/

Varun Kacholia Shashank PanditSoumen Chakrabarti S. SudarshanRushi Desai Hrishikesh Karambelkar

Page 2: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Keyword Search on Graph Representation of Data

Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank,

etc. Need to find a (closely) connected

set of nodes that together match all given keywords

Focus of our work Search algorithms to find connections

between nodes

Page 3: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Outline

Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions

Page 4: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Graph Data Model Data modeled as a directed weighted

graph: BANKS [ICDE’02] Can model relational, XML, HTML, etc. data

E.g., DBLP database Node = tuple Edge = foreign key reference

Multi-Query Optimization

Sudarshan Prasan Roy

writes

author

paper

Soumen

BANKS: Keyword search…

Page 5: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Graph Data Model (2) E.g., XML data <proceedings> <paper id=“1”> <title>Databases</title> </paper> <paper id=“2”> <title>Keyword Search</title> <cite ref=“1”>Databases</cite> </paper> </proceedings>

titletitle

proceedings

paper (@id = 1)

paper (@id = 2)

cite

Page 6: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Response Model

Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS

Multi-Query Optimization

Sudarshan Prasan Roy

writes writes

author author

paper

E.g., Sudarshan Roy

Page 7: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Response Ranking Edge Score = EA

Smaller tree => higher score E.g., BANKS: EA = 1/ ( edge weights)

Node Score = NA

Measure of authority of nodes in tree E.g., BANKS: NA = (leaf and root node

authorities)

Overall score = f (EA, NA) E.g., BANKS: f (EA, NA)EA . NA

Page 8: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Finding Answer Trees Backward Expanding Search: BANKS [ICDE02]

Intuition: travel backwards from keyword nodes till you hit a common node

Sudarshan Prasan Royauthors

MultiQuery Optimizationpaper

Query: sudarshan roy

writes

Page 9: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Backward Search: Algorithm

Algorithm Run concurrent single source shortest

path iterators from each node matching a keyword

Traverse the graph edges in reverse direction Output next nearest node on each get-next()

call Do best-first search across iterators Output node if in the intersection of sets

of nodes reached from each keyword

Page 10: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Backward Search: Limitations

Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-

degree)

Database

Shashank Sudarshan

… author

paper

writes

Schema Legend

“Shashank Sudarshan Database”

Page 11: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidirectional Search: Motivation

Page 12: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Intuition

First cut solution: Don’t go backward if keyword matches

many nodes Don’t go backward if node points to a

hub Instead explore forward from other

keywords

Page 13: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Example

author

paper

writes

Schema Legend

“Shashank Sudarshan Database”

Database

Shashank Sudarshan

Page 14: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Issues

What should threshold for not expanding be? Our solution: prioritize expansion of

nodes based on spreading activation to penalize frequent keywords and bushy

trees

How to manage exploration in both directions?

Page 15: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Spreading Activation

Spreading Activation Node with highest activation explored

first Every node given an initial activation

Gives low activation to frequently occurring keywords

“John”

1/5 1/5 1/5 1/5 1/5

Page 16: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Spreading Activation

Spreading Activation Node with highest activation explored first

Activation spread to neighbors (μ = 0.3)

Gives low activation to neighbors of hubs

0.3 x 1/5

0.7 x 1/5 x 1/4

0.7 x 1/5 x 1/4

0.7 x 1/5 x 1/4

1/5

0

0

0

0 0.7 x 1/5 x 1/4

1

1

1

1

Page 17: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

How to manage exploration in both directions?

Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes

Details in full paper

Bidir Search: Iterators

1[0,∞] [∞,0]

[1,∞]

[∞,∞]

[∞,1][∞,1]

“A” “B”

[2,3 ∞] 7

3

2

4 5

6 [∞,∞ 2][Dist from “A”, Dist from “B”]

[2,∞]

Page 18: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: Algorithm Algorithm

Activate matching nodes; insert into backward iterator

while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures)

Propagate values to ancestors if necessary Insert nodes explored in the backward direction into

the forward iterator /* for future forward exploration */

Stop when top-k results are produced

Page 19: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: top-k results

Results need not be generated “in-order”

Naïve solution Store results in an intermediate heap Output top k results after mk total results

have been generated (m ~ 10) Can do better

Compute upper bound on score of next result; output answers with a higher score

Similar to NRA algorithm (Fagin et al., PODS’01)

Page 20: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments Datasets

DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges

Workload Keywords randomly picked from results of SQL join

statements Search algorithms

MI-Bkwd: original backward search Iterator for every node matching a keyword

SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search

Time taken/nodes explored Measured when 10th answer is generated (or last

answer if #answers < 10) Origin size

#nodes matched by keywords in the query

Page 21: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments (2) MI-Bkwd versus SI-Bkwd

SI-Bkwd gain increases with origin size, # keywords

Page 22: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments (3) SI-Bkwd versus Bidirec

Bidirec gain increases with origin size, # keywords

Page 23: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments (4) Precision/Recall experiments

Relevant answers are well-defined; can be generated through SQL statements

Both MI-Backward and Bidirectional show similar performance

Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before

generating all relevant answers Bidirectional runs faster, yet minimal

loss of relevant results!

Page 24: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB’03]

Generate join expressions leading to query results

Use DB-provided scores for ranking tuples and aggregate them to rank answer trees

For top-k results: automatically determine required number of join expressions

Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a

lot more!) Bidirectional versus Sparse-LB

Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)

Page 25: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Experiments (6) SI-Bkwd versus Bidirec: by origin size

Bidirec gains more with unbalanced origin sizes

A = (T,S,S,S)

B = (M,M,M,M)

C = (M,L,L,L)

D = (M,M,L,L)

E = (T,L,L,L)

F = (T,S,M,L)

G = (T,M,L,L)

H = (T,T,T,L)

Page 26: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Discussion Bidirectional search as dynamic per-

tuple join ordering Related work in this area: Eddies

Bidirectional search Schema-less Prioritization based on activation instead of

selectivity Generate answers in relevance order

Page 27: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Related Work Keyword querying on relational data:

Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory

data structures Issues: generate join plans, re-use

common sub-expressions, etc. Keyword querying on XML

XRank (Cornell), Schema-Free XQuery (Michigan), …

Tree model is too limited ObjectRank

Page 28: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Conclusions Graph model

Convenient common denominator representation

Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading

activation performs much better Dynamically choose join order on per-

tuple basis

Page 29: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Thank You!

Questions??

Page 30: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Future of Keyword Search in DBs

Next generation of intelligent search will require context information E.g. search email, files, calendar, .. Information integration will be important Graph structured data will be a key

component Is there a killer app?

Deep web? Display of answers

Users don’t want to see schema details Can we leverage off existing (Web) apps?

Page 31: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

BANKS Future Work Applications of BANKS

Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of

data Extract information, Infer soft links

BANKS for personal information management SPIN: Search Personal Information Networks

Ongoing/future work on BANKS: More sysadmin/user control on ranking

One size does not fit all BANKS provides infrastructure

Characterize bidirectional search better And find other applications

Security

Page 32: Bidirectional Expansion for Keyword Search on Graph Databases  Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan

Bidir Search: top-k results (2)

Compute upper bound on score of next result; output answers with a higher score

Computing the bound mi = minimum path length explored backward

from keyword i unseen answer node: 1/(m1 + m2 + … + mn ) visited answer node: suppose reached from

first x keywords with distance di 1/[(d1 + d2 + … + dx ) + (mx+1 + mx+2 + … + mn )]

combine this with max node prestige We simply use 1/(m1 + m2 + … + mn )!

Experiments show no significant loss in using this heuristic