querying big social graphs
DESCRIPTION
Querying Big Social Graphs. Incremental graph pattern matching Query preserving graph compression Graph pattern matching using views Top-k graph pattern matching Distributed graph pattern matching. 1. The complexity of graph pattern matching. Recall from the last lecture. 2. - PowerPoint PPT PresentationTRANSCRIPT
1
QSX: Querying Social Graphs
Approximate query answering
Query-driven approximation
Data-driven approximation
Graph systems
2
Make big graphs small
Input: A class Q of queries
Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that
Q(G) = Q(GQ)?
Is it always to compute exact query answers?
Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation
Q( )GG
Q( ) GQGQ
2
3
Approximate query answering
Approximate query answering
1.Query-driven approximation
① From intractable to low polynomial time
② Top-k query answering
2.Data-driven approximation: resource-bounded query answering
Querying big data within our available resources3
We can’t always find exact query answers in big data Some queries are not parallel scalable and can’t be made small enough We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response
What can we do?
Query-driven approximation
44
5
Revised graph simulation
Relaxing the semantics of query answering
Effectiveness: capture more sensible matches in social graphs
Efficiency: from intractable to low polynomial time
Subgraph isomorphism
NP-completeNP-complete Exponentially many matchesExponentially many matches
Quadratic/cubic timeQuadratic/cubic time |VQ||V||VQ||V|
Use “cheaper” queries whenever possible
Works better for social network analysis
5
Top-k query answering
6
Traditional query answering: compute Q(G)
Top-k query answering:
Input: Query Q, a graph G and a positive integer k.
Output: A top-ranked set of k matches of a designated node uo
It is expensive to compute when G is large
The result Q(G) is excessively large for the users to inspect –
larger than G
6
How many matches do you check when you use, e.g., Google?
Early termination: return diversified top-k matches without
computing Q(G)
Finding best candidates
7
Project Manager*
Programmer DB manager
Tester
PM1
BA
PM2 PM3 PM4
PRG1 DB1 DB2 PRG3 DB3PRG4
PRG2UD1 UD2ST1 ST2 ST3 ST4
Query: find good PM (project manager) candidates collaborated with
PRG (programmer), DB (database developer) and ST (software tester).
Collaboration network G
“query focus”
complete matching relation(project manager, PM1), (project manager, PM2)(project manager, PM3), (project manager, PM4)
(programmer, PRG1), (programmer, PRG2)(programmer, PRG3), (programmer, PRG4)
(DBmanager, DB1), (DBmanager, DB2)(DBmanager, DB3)
(tester, ST1), (tester, ST2)(tester, ST3), (tester, ST4)
Pattern graph Q
Querying collaborative networks: we just want top-ranked PMs
8
Input: graph G = (V, E, fA), pattern Q = (VQ, EQ, fv, uo)
Output: Q(G, uo) = { v | (uo, v) Q(G) }
Graph pattern matching with output node
Output: k nodes vs. the entire set Q(G)
Output node
Matches of the output node
Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q(G, uo)
PM
DBPRG
ST
Pattern Q
*pm1 pm2 pm3 pmn
db1 db2 db3prg1 prg2 prg3
st1 st2 st3 st4 stm
……
……
Top-2 matches
How to rank the answers?
8
Top-k answers
9
Top-k matching: top-k matches that maximize the total relevance
PM2
DB2 PRG3 DB3PRG4
PRG2ST2 ST3 ST4
Based on relevance alone: O(|G||Q| + |G|2) time
Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u
• a unique, maximum relevance set
Relevance function◦ The more reachable matches, the better
Early termination
10
Ranking match results: Relevance
Top-k graph pattern matching: social impact
Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q(G, uo)
pm1 pm2 pm3 pmn
db1 db2 db3prg1 prg2 prg3
st1 st2 st3 st4 stm
……
PM
DBPRG
STPattern
*Tok-2 relevant matches
10
Result diversification
11
Diversity function: set difference of the relevant set
Diversified Top-k matching: find a set S of k matches for output node
such that F(S) is maximized
Diversification: bi-objective optimization: relevance and diversity
To avoid too homogeneous matches: social relevance
relevance
Diversifies top-k matching: NP-complete
More expensive than simulation
diversity
12
Ranking match results: Diversity
Diversified top-k graph pattern matching: social diversity
Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k diversified matches in Q(G, uo)
pm1 pm2 pm3 pmn
db1 db2 db3prg1 prg2 prg3
st1 st2 st3 st4 stm
……
PM
DBPRG
ST
Pattern
*
δd(pm2,pm3)=3/(m+2)δd(pm1,pm2)=(m+5)/(m+6)
δd(pm1,pm3)=1Top-2 diversified matches
12
13
The complexity
Relevance alone: O((| V | + | Q |) (| E | + | V | )
Diversification based on both relevance and diversity
• NP-complete (decision problem)
• APX-hard
• O((| V | + | Q |) (| E | + | V | ) with approximation ratio 2
Early termination: stop as soon as top-k matches are found
1.8 times faster than the traditional simulation algorithm
Top-k query answering:
Input: : Pattern Q, data graph G and a positive integer k.
Output: Top-k matches in Q(G, uo)
without computing the entire Q(G, uo)
quadratic time
13
Even with diversification
Data-driven approximation for bounded resources
1414
15
Traditional approximation theory
Traditional approximation algorithms T: for an NPO (NP-complete optimization problem),
for each instance x, T(x) computes a feasible solution y quality metric f(x, y) performance ratio : for all x,
Does it work when it comes to querying big data?
OPT(x): optimal solution, 1
Minimization: OPT(x) f(x, y) OPT(x)
Maximization: 1/ OPT(x) f(x, y) OPT(x)
Minimization: OPT(x) f(x, y) OPT(x)
Maximization: 1/ OPT(x) f(x, y) OPT(x)
15
16
The approximation theory revisited
Traditional approximation algorithms T: for an NPO • for each instance x, T(x) computes a feasible solution y• quality metric f(x, y)• performance ratio (minimization): for all x,
A quest for revising approximation algorithms for querying big data
Approximation: for even low PTIME problems, not just NPO
Quality metric: answer to a query is a typically a set, not a number
Approach: it does not help much if T(x) conducts computation on “big” data x directly!
OPT(x) f(x, y) OPT(x)OPT(x) f(x, y) OPT(x)
Big data?
16
Data-driven: Resource bounded query answering
17
Input: A class Q of queries, a resource ratio [0, 1), and a performance ratio (0, 1]
Question: Develop an algorithm that given any query Q Q and dataset D,
• accesses a fraction D of D such that |D| |D|
• computes as Q(D) as approximate answers to Q(D), and
• accuracy(Q, D, )
Q( D )
dynamic reductionD D
Qapproximation
Q
Q( D )
Accessing |D| amount of data in the entire process 17
18
Resource bounded query answering
Resource bounded: resource ratio [0, 1)• decided by our available resources: time, space, …
In combination with other tricks for making big data small
Dynamic reduction: given Q and D, find D for Q
• contrast this to synopses: find D for all Q
• histogram, wavelets, sketches, sampling, …
better reduction ratio
Q( D )
dynamic reductionD D
Qapproximation
Q
Q( D )
access schema, distributed, views, …
18
19
Accuracy metrics
Performance ratio for approximate query answering
Performance ratio: F-measure of precision and recall
to cope with the set semantics of query answers
precision(Q, D, ) = | Q(D) Q(D)| / | Q(D)| precision(Q, D, ) = | Q(D) Q(D)| / | Q(D)|
recall(Q, D, ) = | Q(D) Q(D)| / | Q(D)| recall(Q, D, ) = | Q(D) Q(D)| / | Q(D)|
accuracy(Q, D, ) = 2 * precision(Q, D, ) * recall(Q, D, ) / (precision(Q, D, ) + recall(Q, D, ))
accuracy(Q, D, ) = 2 * precision(Q, D, ) * recall(Q, D, ) / (precision(Q, D, ) + recall(Q, D, ))
19
20
Personalized social search
We can make big graphs of PB size fit into our memory!
Graph Search, FacebookFind me all my friends who live in Edinburgh and like cyclingFind me restaurants in London my friends have been toFind me photos of my friends in New York
We can do personalized social search with = 0.0015%!
1.5 * 10-6 * 1PB (1015B) = 15 * 109 = 15GB
We are making big graphs of PB size as small as 15GB!
Localized patterns
with 100% accuracy!
Add to this access schema, distributed, views, …
20
Localized queries
Localized queries: can be answered locally◦ Graph pattern queries: revised simulation queries
◦ matching relation over dQ-neighborhood of a personalized node
Michael
hikinggroup
cycling club member
?cycling lovers
Michael(unique match)
hiking group
……
…
cycling club
member
cycling fans
hgm
hg1
hg2 cc1cc2
cc3
cl1
cl2cln-1
cln
Personalized node
Personalized social search, ego network analysis, …
Michael: “find cycling fans who know both my friends in cycling club and my friends in hiking groups
21
Resource-bounded simulation
22
local auxiliary information
G
Boolean guarded condition: label matching
Cost function c(u,v)
Potential function p(u,v), estimated probability that v matches u
bound b, determines an upper bound of the number of nodes to be visited
Q
degree |neighbor| <label, frequency> …u v
u vlabel match
Dynamically updated auxiliary informationu v?
If v is included, the number of additional
nodes that need also to be included – budget
The probability for v to match u (total number of nodes in the neighbor of v
that are candidate matchesQuery guided search – bounded by the budget
Resource-bounded simulation
23
Michael
hikinggroup
cycling club
?cycling lovers
Michael
hiking group
…
cycling club
member
cycling fans
hgm
hg1
hg2
cc1cc2
cc3
cl1 cl2 cln-1 cln
cycling club cc1
cc2
cc3
cycling club
member
?cycling loverscln-1
cln
cycling fans
hgm
hikinggroup
hiking group
FALSE
-
-
-
TRUE
Cost=1
Potential=3
Bound =2
TRUE
Cost=1
Potential=2
Bound =2
bound = 16visited = 16
Match relation: (Michael, Michael), (hiking group, hgm), (cycling club, cc1),(cycling club, cc3), (cycling lover, cln-1), (cycling lover, cln)
Dynamic data reduction and query-guided search
Accuracy
24
Varying α ( 10 -5), accuracy, Yahoo
89%-100% for simulation queriesboth achieves 100% accuracy when α>0.0015%,
100% accuracy for 10-5 * 1PB (1015B) = 109 = 10GB
Efficiency of resource bounded simulation
25
Varying α ( 10 -5), Yahoo
10-5 * 1PB (1015B) = 109 = 10GB
26
Michael
cc1
cl3cl7
cln-1
cl4
cl9
cl5
…
…
cl6
cl16
Eric
…
Non-localized queries
Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?
Non-localized: t may be far from s
Does dynamic reduction work for non-localized queries?
Is Michael connected to Eric via social links?
Resource-bounded reachability
27
Reductionsize |GQ| <= α|G|
Reachability queryresults
Reachability queryresults
Approximation(experimentally
Verified; no false positive, in time O(α|G|)
big graph G
small tree index GQ
O(|G|)
Yes, dynamic reduction works for non-localized queries
Preprocessing: landmarks
28
Michael
cc1
cl3cl7
cln-1
cl4
cl9
cl5
…
…
cl6
cl16
Eric
…
Landmarks◦ a landmark node covers certain
number of node pairs◦ Reachability of the pairs it covers
can be computed by landmark labels
cc1 “I can reach cl3”
cl3
cln-1, “cl3 can reach me”
cl4
…
cl6
cl16
A revision of 2-hop covers
Search landmark index instead of G
<= α|G|
Hierarchical landmark Index
29
Landmark Index◦ landmark nodes are selected to encode pairwise reachability◦ Hierarchical indexing: apply multiple rounds of landmark selection to
construct a tree of landmarks
cc1 cl7 cln-1
Michael
cc1
cl3cl7
cln-1
cl4
cl9
cl5
…
…
cl6
cl16
Eric
…… cl16
cl3 cl5 cl6
cl4
cl9 …
A node v can reach v’ if there exist v1, v2, v3 in the index such that
v reaches v1, v2 reaches v’, and v1 and v2 are connected to v3
(to and from) at the same level (coding)
Hierarchical landmark Index
30
Landmark Index◦ landmark nodes are selected to encode pairwise reachability◦ Hierarchical indexing: apply multiple rounds of landmark selection to
construct a tree of landmarks
cc1 cl7 cln-1
Michael
cc1
cl3cl7
cln-1
cl4
cl9
cl5
…
…
cl6
cl16
Eric
…… cl16
cl3 cl5 cl6
cl4
cl9 …
Boolean guarded condition (v, vp, v’)
Cost function c(v): size of unvisited landmarks in the subtree rooted at v
Potential P(v), total cover size of unvisited landmarks as the children of v
Cover size
Landmark labels/encoding
Topological rank/range
Whether v can possibly reach v’ via vp
Guided search on landmark index
Resource-bounded reachability
31
Michael
cc1
cl3cl7
cln-1
cl4
cl9
cl5
…
…
cl6
cl16
Eric
cc1
…cl7 cln-1 … cl16
cl3 cl5 cl6
cl4
Michael
Eric
“drill down”?
cl9 …local auxiliary
information
“roll up”
bi-directed guided traversal
Condition = FALSE
-
-
Condition = ?
Cost=9
Potential = 46 Condition = ?
Cost=2
Potential = 9
Condition = TRUE
…
…
Drill down and roll up
Accuracy
32
Varying α ( 10 -4), accuracy, Yahoo
achieves 100% accuracy when α>0.05%,
100% accuracy for 10-4 * 1PB (1015B) = 1010 = 100GB
Efficiency: resource bounded reachability
33
RBreach is 62.5 times faster than BFS and 5.7 times faster than BFS-OPT
Varying α ( 10 -4), Yahoo
10-4 * 1PB (1015B) = 1010 = 100GB
Graph systems
3434
35
Systems based on vertex-centric models
Giraph (Apache)
– Model: BSP, Pregel, Google– On top of Hadoop (MapReduce)– Scalability: Facebook used Giraph with some performance
improvements to analyze one trillion (1012) edges using 200 machines in 4 minutes
– Download: http://www.apache.org/dyn/closer.cgi/giraph/
Think like a vertex
GraphLab (CMU, an Apache licence)– Initially for machine learning, and beyond now– Libraries of algorithms– https://dato.com/products/create/open_source.html
36
Graph databases
– Neo4j (Neo, GNU public license)
• Graph databases: stored in form of either an edge, a node or an attribute
• Optimized for local traversal
• indexing
• Download: http://neo4j.com//
Database style
GraphX: a Spark API for graphs (AmpLab, UC Berkeley)– Based on Spark (in-memory primitives as opposed to Hadoop’s two-
stage disk based MapReduce; claimed 100 times faster; support SQL)– Graph parallel (Pregel, Graphlab) and data parallel (Spark, tables), – Download: http://spark.apache.org/graphx/
Summing up
3737
38
Approximate query answering
Challenges: to get real-time answers
Big data and costly queriesLimited resources
Yes, we can query big data within bounded resources! 38
Combined with techniques for making big data smallTwo approaches:
Query-driven approximation• Cheaper queries• Retain sensible answers
Data-driven approximation• Dynamic data reduction• Query-guided search
Reduce data of PB size to GB
Beyond graphs: relational queriesSPC: 7.9 * 10-4
SQL: 1.9 * 10-3
Summary and review
What is query-driven approximation? When can we use the
approach?
Traditional approximation scheme does not work very well for
query answering in big data. Why?
What is data-driven dynamic approximation? Does it work on
localized queries? Non-localized queries?
What is query-guided search?
39
40
Project (1)
Revise subgraph queries to include a designated output node, as query focus.
Develop ranking (relevance) function to rank matches of the output node in a subgraph query. Justify the design of your ranking function.
Develop an algorithm that, given a graph G, a subgraph query Q with a designated output node vo and a positive integer k, computes top-k matches of vo based on your ranking function.
Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the
size of graphs
A research and development project
Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.
41
Project (2)
Develop a resource-bounded algorithm for evaluating personalized subgraph queries.
Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy and
scalability
A development project
Revise graph pattern matching via subgraph isomorphism (Lecture 3) by including a designated personalized node, referred to as personalized subgraph queries in the sequel.
42
Project (3)
A research and development project
Recall keyword search with Steiner trees (Lecture 2). You may add practical restrictions to keyword search, e.g., a designated keyword is required to match a small set of nodes in a graph.
Develop a resource-bounded algorithm for keyword search. Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy
and scalability
42
43
• G. Gou and R. Chirkova. Efficient algorithms for exact ranked twig-pattern matching over graphs. In SIGMOD,
2008. http://dl.acm.org/citation.cfm?id=1376676• H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness:
an efficient algorithm for testing subgraph isomorphism. PVLDB,2008. http://www.vldb.org/pvldb/1/1453899.pdf
• R. T. Stern, R. Puzis, and A. Felner. Potential search: A bounded-cost search algorithm. In ICAPS, 2011. (search Google Scholar)
• S. Zilberstein, F. Charpillet, P. Chassaing, et al. Real-time problem solving with contract algorithms. In IJCAI, 1999. (search Google Scholar)
• W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB 2014. (query-driven approximation)
• W. Fan, X. Wang, and Y. Wu. Querying big graphs with bounded resources, SIGMOD 2014. (data-driven approximation)
Papers for you to review