bidirectional expansion for keyword search on graph databases varun kacholia shashank pandit soumen...

Bidirectional Expansion for Keyword Search

on Graph Databases

http://www.cse.iitb.ac.in/banks/

Varun Kacholia Shashank PanditSoumen Chakrabarti S. SudarshanRushi Desai Hrishikesh Karambelkar

Keyword Search on Graph Representation of Data

Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank,

etc. Need to find a (closely) connected

set of nodes that together match all given keywords

Focus of our work Search algorithms to find connections

between nodes

Outline

Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions

Graph Data Model Data modeled as a directed weighted

graph: BANKS [ICDE’02] Can model relational, XML, HTML, etc. data

E.g., DBLP database Node = tuple Edge = foreign key reference

Multi-Query Optimization

Sudarshan Prasan Roy

writes

author

paper

Soumen

BANKS: Keyword search…

Graph Data Model (2) E.g., XML data <proceedings> <paper id=“1”> <title>Databases</title> </paper> <paper id=“2”> <title>Keyword Search</title> <cite ref=“1”>Databases</cite> </paper> </proceedings>

titletitle

proceedings

paper (@id = 1)

paper (@id = 2)

cite

Response Model

Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS

Multi-Query Optimization

Sudarshan Prasan Roy

writes writes

author author

paper

E.g., Sudarshan Roy

Response Ranking Edge Score = EA

Smaller tree => higher score E.g., BANKS: EA = 1/ ( edge weights)

Node Score = NA

Measure of authority of nodes in tree E.g., BANKS: NA = (leaf and root node

authorities)

Overall score = f (EA, NA) E.g., BANKS: f (EA, NA)EA . NA

Finding Answer Trees Backward Expanding Search: BANKS [ICDE02]

Intuition: travel backwards from keyword nodes till you hit a common node

Sudarshan Prasan Royauthors

MultiQuery Optimizationpaper

Query: sudarshan roy

writes

Backward Search: Algorithm

Algorithm Run concurrent single source shortest

path iterators from each node matching a keyword

Traverse the graph edges in reverse direction Output next nearest node on each get-next()

call Do best-first search across iterators Output node if in the intersection of sets

of nodes reached from each keyword

Backward Search: Limitations

Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-

degree)

…

Database

Shashank Sudarshan

… author

paper

writes

Schema Legend

“Shashank Sudarshan Database”

Bidirectional Search: Motivation

Bidir Search: Intuition

First cut solution: Don’t go backward if keyword matches

many nodes Don’t go backward if node points to a

hub Instead explore forward from other

keywords

Bidir Search: Example

…

…

author

paper

writes

Schema Legend

“Shashank Sudarshan Database”

Database

Shashank Sudarshan

…

Bidir Search: Issues

What should threshold for not expanding be? Our solution: prioritize expansion of

nodes based on spreading activation to penalize frequent keywords and bushy

trees

How to manage exploration in both directions?

Bidir Search: Spreading Activation

Spreading Activation Node with highest activation explored

first Every node given an initial activation

Gives low activation to frequently occurring keywords

“John”

1/5 1/5 1/5 1/5 1/5

Bidir Search: Spreading Activation

Spreading Activation Node with highest activation explored first

Activation spread to neighbors (μ = 0.3)

Gives low activation to neighbors of hubs

0.3 x 1/5

0.7 x 1/5 x 1/4

0.7 x 1/5 x 1/4

0.7 x 1/5 x 1/4

1/5

0

0

0

0 0.7 x 1/5 x 1/4

1

1

1

1

How to manage exploration in both directions?

Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes

Details in full paper

Bidir Search: Iterators

…

1[0,∞] [∞,0]

[1,∞]

[∞,∞]

[∞,1][∞,1]

“A” “B”

[2,3 ∞] 7

3

2

4 5

6 [∞,∞ 2][Dist from “A”, Dist from “B”]

[2,∞]

Bidir Search: Algorithm Algorithm

Activate matching nodes; insert into backward iterator

while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures)

Propagate values to ancestors if necessary Insert nodes explored in the backward direction into

the forward iterator /* for future forward exploration */

Stop when top-k results are produced

Bidir Search: top-k results

Results need not be generated “in-order”

Naïve solution Store results in an intermediate heap Output top k results after mk total results

have been generated (m ~ 10) Can do better

Compute upper bound on score of next result; output answers with a higher score

Similar to NRA algorithm (Fagin et al., PODS’01)

Experiments Datasets

DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges

Workload Keywords randomly picked from results of SQL join

statements Search algorithms

MI-Bkwd: original backward search Iterator for every node matching a keyword

SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search

Time taken/nodes explored Measured when 10th answer is generated (or last

answer if #answers < 10) Origin size

#nodes matched by keywords in the query

Experiments (2) MI-Bkwd versus SI-Bkwd

SI-Bkwd gain increases with origin size, # keywords

Experiments (3) SI-Bkwd versus Bidirec

Bidirec gain increases with origin size, # keywords

Experiments (4) Precision/Recall experiments

Relevant answers are well-defined; can be generated through SQL statements

Both MI-Backward and Bidirectional show similar performance

Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before

generating all relevant answers Bidirectional runs faster, yet minimal

loss of relevant results!

Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB’03]

Generate join expressions leading to query results

Use DB-provided scores for ranking tuples and aggregate them to rank answer trees

For top-k results: automatically determine required number of join expressions

Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a

lot more!) Bidirectional versus Sparse-LB

Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)

Experiments (6) SI-Bkwd versus Bidirec: by origin size

Bidirec gains more with unbalanced origin sizes

A = (T,S,S,S)

B = (M,M,M,M)

C = (M,L,L,L)

D = (M,M,L,L)

E = (T,L,L,L)

F = (T,S,M,L)

G = (T,M,L,L)

H = (T,T,T,L)

Discussion Bidirectional search as dynamic per-

tuple join ordering Related work in this area: Eddies

Bidirectional search Schema-less Prioritization based on activation instead of

selectivity Generate answers in relevance order

Related Work Keyword querying on relational data:

Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory

data structures Issues: generate join plans, re-use

common sub-expressions, etc. Keyword querying on XML

XRank (Cornell), Schema-Free XQuery (Michigan), …

Tree model is too limited ObjectRank

Conclusions Graph model

Convenient common denominator representation

Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading

activation performs much better Dynamically choose join order on per-

tuple basis

Thank You!

Questions??

Future of Keyword Search in DBs

Next generation of intelligent search will require context information E.g. search email, files, calendar, .. Information integration will be important Graph structured data will be a key

component Is there a killer app?

Deep web? Display of answers

Users don’t want to see schema details Can we leverage off existing (Web) apps?

BANKS Future Work Applications of BANKS

Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of

data Extract information, Infer soft links

BANKS for personal information management SPIN: Search Personal Information Networks

Ongoing/future work on BANKS: More sysadmin/user control on ranking

One size does not fit all BANKS provides infrastructure

Characterize bidirectional search better And find other applications

Security

Bidir Search: top-k results (2)

Compute upper bound on score of next result; output answers with a higher score

Computing the bound mi = minimum path length explored backward

from keyword i unseen answer node: 1/(m1 + m2 + … + mn ) visited answer node: suppose reached from

first x keywords with distance di 1/[(d1 + d2 + … + dx ) + (mx+1 + mx+2 + … + mn )]

combine this with max node prestige We simply use 1/(m1 + m2 + … + mn )!

Experiments show no significant loss in using this heuristic

bidirectional expansion for keyword search on graph databases varun kacholia shashank pandit soumen...

Documents

keyword search slide

keyword slide

bidir search

expanding search

sudarshan roy slide

activation node

motivation slide

data banks