bidirectional expansion for keyword search on graph databases varun kacholia shashank pandit soumen...
TRANSCRIPT
Bidirectional Expansion for Keyword Search
on Graph Databases
http://www.cse.iitb.ac.in/banks/
Varun Kacholia Shashank PanditSoumen Chakrabarti S. SudarshanRushi Desai Hrishikesh Karambelkar
Keyword Search on Graph Representation of Data
Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank,
etc. Need to find a (closely) connected
set of nodes that together match all given keywords
Focus of our work Search algorithms to find connections
between nodes
Outline
Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions
Graph Data Model Data modeled as a directed weighted
graph: BANKS [ICDE’02] Can model relational, XML, HTML, etc. data
E.g., DBLP database Node = tuple Edge = foreign key reference
Multi-Query Optimization
Sudarshan Prasan Roy
writes
author
paper
Soumen
BANKS: Keyword search…
Graph Data Model (2) E.g., XML data <proceedings> <paper id=“1”> <title>Databases</title> </paper> <paper id=“2”> <title>Keyword Search</title> <cite ref=“1”>Databases</cite> </paper> </proceedings>
titletitle
proceedings
paper (@id = 1)
paper (@id = 2)
cite
Response Model
Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS
Multi-Query Optimization
Sudarshan Prasan Roy
writes writes
author author
paper
E.g., Sudarshan Roy
Response Ranking Edge Score = EA
Smaller tree => higher score E.g., BANKS: EA = 1/ ( edge weights)
Node Score = NA
Measure of authority of nodes in tree E.g., BANKS: NA = (leaf and root node
authorities)
Overall score = f (EA, NA) E.g., BANKS: f (EA, NA)EA . NA
Finding Answer Trees Backward Expanding Search: BANKS [ICDE02]
Intuition: travel backwards from keyword nodes till you hit a common node
Sudarshan Prasan Royauthors
MultiQuery Optimizationpaper
Query: sudarshan roy
writes
Backward Search: Algorithm
Algorithm Run concurrent single source shortest
path iterators from each node matching a keyword
Traverse the graph edges in reverse direction Output next nearest node on each get-next()
call Do best-first search across iterators Output node if in the intersection of sets
of nodes reached from each keyword
Backward Search: Limitations
Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-
degree)
…
Database
Shashank Sudarshan
… author
paper
writes
Schema Legend
“Shashank Sudarshan Database”
Bidirectional Search: Motivation
Bidir Search: Intuition
First cut solution: Don’t go backward if keyword matches
many nodes Don’t go backward if node points to a
hub Instead explore forward from other
keywords
Bidir Search: Example
…
…
author
paper
writes
Schema Legend
“Shashank Sudarshan Database”
Database
Shashank Sudarshan
…
Bidir Search: Issues
What should threshold for not expanding be? Our solution: prioritize expansion of
nodes based on spreading activation to penalize frequent keywords and bushy
trees
How to manage exploration in both directions?
Bidir Search: Spreading Activation
Spreading Activation Node with highest activation explored
first Every node given an initial activation
Gives low activation to frequently occurring keywords
“John”
1/5 1/5 1/5 1/5 1/5
Bidir Search: Spreading Activation
Spreading Activation Node with highest activation explored first
Activation spread to neighbors (μ = 0.3)
Gives low activation to neighbors of hubs
0.3 x 1/5
0.7 x 1/5 x 1/4
0.7 x 1/5 x 1/4
0.7 x 1/5 x 1/4
1/5
0
0
0
0 0.7 x 1/5 x 1/4
1
1
1
1
How to manage exploration in both directions?
Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes
Details in full paper
Bidir Search: Iterators
…
1[0,∞] [∞,0]
[1,∞]
[∞,∞]
[∞,1][∞,1]
“A” “B”
[2,3 ∞] 7
3
2
4 5
6 [∞,∞ 2][Dist from “A”, Dist from “B”]
[2,∞]
Bidir Search: Algorithm Algorithm
Activate matching nodes; insert into backward iterator
while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures)
Propagate values to ancestors if necessary Insert nodes explored in the backward direction into
the forward iterator /* for future forward exploration */
Stop when top-k results are produced
Bidir Search: top-k results
Results need not be generated “in-order”
Naïve solution Store results in an intermediate heap Output top k results after mk total results
have been generated (m ~ 10) Can do better
Compute upper bound on score of next result; output answers with a higher score
Similar to NRA algorithm (Fagin et al., PODS’01)
Experiments Datasets
DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges
Workload Keywords randomly picked from results of SQL join
statements Search algorithms
MI-Bkwd: original backward search Iterator for every node matching a keyword
SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search
Time taken/nodes explored Measured when 10th answer is generated (or last
answer if #answers < 10) Origin size
#nodes matched by keywords in the query
Experiments (2) MI-Bkwd versus SI-Bkwd
SI-Bkwd gain increases with origin size, # keywords
Experiments (3) SI-Bkwd versus Bidirec
Bidirec gain increases with origin size, # keywords
Experiments (4) Precision/Recall experiments
Relevant answers are well-defined; can be generated through SQL statements
Both MI-Backward and Bidirectional show similar performance
Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before
generating all relevant answers Bidirectional runs faster, yet minimal
loss of relevant results!
Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB’03]
Generate join expressions leading to query results
Use DB-provided scores for ranking tuples and aggregate them to rank answer trees
For top-k results: automatically determine required number of join expressions
Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a
lot more!) Bidirectional versus Sparse-LB
Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)
Experiments (6) SI-Bkwd versus Bidirec: by origin size
Bidirec gains more with unbalanced origin sizes
A = (T,S,S,S)
B = (M,M,M,M)
C = (M,L,L,L)
D = (M,M,L,L)
E = (T,L,L,L)
F = (T,S,M,L)
G = (T,M,L,L)
H = (T,T,T,L)
Discussion Bidirectional search as dynamic per-
tuple join ordering Related work in this area: Eddies
Bidirectional search Schema-less Prioritization based on activation instead of
selectivity Generate answers in relevance order
Related Work Keyword querying on relational data:
Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory
data structures Issues: generate join plans, re-use
common sub-expressions, etc. Keyword querying on XML
XRank (Cornell), Schema-Free XQuery (Michigan), …
Tree model is too limited ObjectRank
Conclusions Graph model
Convenient common denominator representation
Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading
activation performs much better Dynamically choose join order on per-
tuple basis
Thank You!
Questions??
Future of Keyword Search in DBs
Next generation of intelligent search will require context information E.g. search email, files, calendar, .. Information integration will be important Graph structured data will be a key
component Is there a killer app?
Deep web? Display of answers
Users don’t want to see schema details Can we leverage off existing (Web) apps?
BANKS Future Work Applications of BANKS
Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of
data Extract information, Infer soft links
BANKS for personal information management SPIN: Search Personal Information Networks
Ongoing/future work on BANKS: More sysadmin/user control on ranking
One size does not fit all BANKS provides infrastructure
Characterize bidirectional search better And find other applications
Security
Bidir Search: top-k results (2)
Compute upper bound on score of next result; output answers with a higher score
Computing the bound mi = minimum path length explored backward
from keyword i unseen answer node: 1/(m1 + m2 + … + mn ) visited answer node: suppose reached from
first x keywords with distance di 1/[(d1 + d2 + … + dx ) + (mx+1 + mx+2 + … + mn )]
combine this with max node prestige We simply use 1/(m1 + m2 + … + mn )!
Experiments show no significant loss in using this heuristic