querying big social graphs

1

QSX: Querying Social Graphs

Approximate query answering

Query-driven approximation

Data-driven approximation

Graph systems

2

Make big graphs small

Input: A class Q of queries

Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that

Q(G) = Q(GQ)?

Is it always to compute exact query answers?

Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation

Q( )GG

Q( ) GQGQ

2

3



1.Query-driven approximation

① From intractable to low polynomial time

② Top-k query answering

2.Data-driven approximation: resource-bounded query answering

Querying big data within our available resources3

We can’t always find exact query answers in big data Some queries are not parallel scalable and can’t be made small enough We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response

What can we do?

Query-driven approximation

44

5

Revised graph simulation

Relaxing the semantics of query answering

Effectiveness: capture more sensible matches in social graphs

Efficiency: from intractable to low polynomial time

Subgraph isomorphism

NP-completeNP-complete Exponentially many matchesExponentially many matches

Quadratic/cubic timeQuadratic/cubic time |VQ||V||VQ||V|

Use “cheaper” queries whenever possible

Works better for social network analysis

5

Top-k query answering

6

Traditional query answering: compute Q(G)

Top-k query answering:

Input: Query Q, a graph G and a positive integer k.

Output: A top-ranked set of k matches of a designated node uo

It is expensive to compute when G is large

The result Q(G) is excessively large for the users to inspect –

larger than G

6

How many matches do you check when you use, e.g., Google?

Early termination: return diversified top-k matches without

computing Q(G)

Finding best candidates

7

Project Manager*

Programmer DB manager

Tester

PM1

BA

PM2 PM3 PM4

PRG1 DB1 DB2 PRG3 DB3PRG4

PRG2UD1 UD2ST1 ST2 ST3 ST4

Query: find good PM (project manager) candidates collaborated with

PRG (programmer), DB (database developer) and ST (software tester).

Collaboration network G

“query focus”

complete matching relation(project manager, PM1), (project manager, PM2)(project manager, PM3), (project manager, PM4)

(programmer, PRG1), (programmer, PRG2)(programmer, PRG3), (programmer, PRG4)

(DBmanager, DB1), (DBmanager, DB2)(DBmanager, DB3)

(tester, ST1), (tester, ST2)(tester, ST3), (tester, ST4)

Pattern graph Q

Querying collaborative networks: we just want top-ranked PMs

8

Input: graph G = (V, E, fA), pattern Q = (VQ, EQ, fv, uo)

Output: Q(G, uo) = { v | (uo, v) Q(G) }

Graph pattern matching with output node

Output: k nodes vs. the entire set Q(G)

Output node

Matches of the output node

Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q(G, uo)

PM

DBPRG

ST

Pattern Q

*pm1 pm2 pm3 pmn

db1 db2 db3prg1 prg2 prg3

st1 st2 st3 st4 stm

……

……

Top-2 matches

How to rank the answers?

8

Top-k answers

9

Top-k matching: top-k matches that maximize the total relevance

PM2

DB2 PRG3 DB3PRG4

PRG2ST2 ST3 ST4

Based on relevance alone: O(|G||Q| + |G|2) time

Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u

• a unique, maximum relevance set

Relevance function◦ The more reachable matches, the better

Early termination

10

Ranking match results: Relevance

Top-k graph pattern matching: social impact

Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q(G, uo)

pm1 pm2 pm3 pmn


st1 st2 st3 st4 stm

……

PM

DBPRG

STPattern

*Tok-2 relevant matches

10

Result diversification

11

Diversity function: set difference of the relevant set

Diversified Top-k matching: find a set S of k matches for output node

such that F(S) is maximized

Diversification: bi-objective optimization: relevance and diversity

To avoid too homogeneous matches: social relevance

relevance

Diversifies top-k matching: NP-complete

More expensive than simulation

diversity

12

Ranking match results: Diversity

Diversified top-k graph pattern matching: social diversity

Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k diversified matches in Q(G, uo)

pm1 pm2 pm3 pmn


st1 st2 st3 st4 stm

……

PM

DBPRG

ST

Pattern

*

δd(pm2,pm3)=3/(m+2)δd(pm1,pm2)=(m+5)/(m+6)

δd(pm1,pm3)=1Top-2 diversified matches

12

13

The complexity

Relevance alone: O((| V | + | Q |) (| E | + | V | )

Diversification based on both relevance and diversity

• NP-complete (decision problem)

• APX-hard

• O((| V | + | Q |) (| E | + | V | ) with approximation ratio 2

Early termination: stop as soon as top-k matches are found

1.8 times faster than the traditional simulation algorithm

Top-k query answering:

Input: : Pattern Q, data graph G and a positive integer k.

Output: Top-k matches in Q(G, uo)

without computing the entire Q(G, uo)

quadratic time

13

Even with diversification

Data-driven approximation for bounded resources

1414

15

Traditional approximation theory

Traditional approximation algorithms T: for an NPO (NP-complete optimization problem),

for each instance x, T(x) computes a feasible solution y quality metric f(x, y) performance ratio : for all x,

Does it work when it comes to querying big data?

OPT(x): optimal solution, 1

Minimization: OPT(x) f(x, y) OPT(x)

Maximization: 1/ OPT(x) f(x, y) OPT(x)

Minimization: OPT(x) f(x, y) OPT(x)

Maximization: 1/ OPT(x) f(x, y) OPT(x)

15

16

The approximation theory revisited

Traditional approximation algorithms T: for an NPO • for each instance x, T(x) computes a feasible solution y• quality metric f(x, y)• performance ratio (minimization): for all x,

A quest for revising approximation algorithms for querying big data

Approximation: for even low PTIME problems, not just NPO

Quality metric: answer to a query is a typically a set, not a number

Approach: it does not help much if T(x) conducts computation on “big” data x directly!

OPT(x) f(x, y) OPT(x)OPT(x) f(x, y) OPT(x)

Big data?

16

Data-driven: Resource bounded query answering

17

Input: A class Q of queries, a resource ratio [0, 1), and a performance ratio (0, 1]

Question: Develop an algorithm that given any query Q Q and dataset D,

• accesses a fraction D of D such that |D| |D|

• computes as Q(D) as approximate answers to Q(D), and

• accuracy(Q, D, )

Q( D )

dynamic reductionD D

Qapproximation

Q

Q( D )

Accessing |D| amount of data in the entire process 17

18

Resource bounded query answering

Resource bounded: resource ratio [0, 1)• decided by our available resources: time, space, …

In combination with other tricks for making big data small

Dynamic reduction: given Q and D, find D for Q

• contrast this to synopses: find D for all Q

• histogram, wavelets, sketches, sampling, …

better reduction ratio

Q( D )

dynamic reductionD D

Qapproximation

Q

Q( D )

access schema, distributed, views, …

18

19

Accuracy metrics

Performance ratio for approximate query answering

Performance ratio: F-measure of precision and recall

to cope with the set semantics of query answers

precision(Q, D, ) = | Q(D) Q(D)| / | Q(D)| precision(Q, D, ) = | Q(D) Q(D)| / | Q(D)|

recall(Q, D, ) = | Q(D) Q(D)| / | Q(D)| recall(Q, D, ) = | Q(D) Q(D)| / | Q(D)|

accuracy(Q, D, ) = 2 * precision(Q, D, ) * recall(Q, D, ) / (precision(Q, D, ) + recall(Q, D, ))

accuracy(Q, D, ) = 2 * precision(Q, D, ) * recall(Q, D, ) / (precision(Q, D, ) + recall(Q, D, ))

19

20

Personalized social search

We can make big graphs of PB size fit into our memory!

Graph Search, FacebookFind me all my friends who live in Edinburgh and like cyclingFind me restaurants in London my friends have been toFind me photos of my friends in New York

We can do personalized social search with = 0.0015%!

1.5 * 10-6 * 1PB (1015B) = 15 * 109 = 15GB

We are making big graphs of PB size as small as 15GB!

Localized patterns

with 100% accuracy!

Add to this access schema, distributed, views, …

20

Localized queries

Localized queries: can be answered locally◦ Graph pattern queries: revised simulation queries

◦ matching relation over dQ-neighborhood of a personalized node

Michael

hikinggroup

cycling club member

?cycling lovers

Michael(unique match)

hiking group

……

…

cycling club

member

cycling fans

hgm

hg1

hg2 cc1cc2

cc3

cl1

cl2cln-1

cln

Personalized node

Personalized social search, ego network analysis, …

Michael: “find cycling fans who know both my friends in cycling club and my friends in hiking groups

21

Resource-bounded simulation

22

local auxiliary information

G

Boolean guarded condition: label matching

Cost function c(u,v)

Potential function p(u,v), estimated probability that v matches u

bound b, determines an upper bound of the number of nodes to be visited

Q

degree |neighbor| <label, frequency> …u v

u vlabel match

Dynamically updated auxiliary informationu v？

If v is included, the number of additional

nodes that need also to be included – budget

The probability for v to match u (total number of nodes in the neighbor of v

that are candidate matchesQuery guided search – bounded by the budget

Resource-bounded simulation

23

Michael

hikinggroup

cycling club

?cycling lovers

Michael

hiking group

…

cycling club

member

cycling fans

hgm

hg1

hg2

cc1cc2

cc3

cl1 cl2 cln-1 cln

cycling club cc1

cc2

cc3

cycling club

member

?cycling loverscln-1

cln

cycling fans

hgm

hikinggroup

hiking group

FALSE

-

-

-

TRUE

Cost=1

Potential=3

Bound =2

TRUE

Cost=1

Potential=2

Bound =2

bound = 16visited = 16

Match relation: (Michael, Michael), (hiking group, hgm), (cycling club, cc1),(cycling club, cc3), (cycling lover, cln-1), (cycling lover, cln)

Dynamic data reduction and query-guided search

Accuracy

24

Varying α ( 10 -5), accuracy, Yahoo

89%-100% for simulation queriesboth achieves 100% accuracy when α>0.0015%,

100% accuracy for 10-5 * 1PB (1015B) = 109 = 10GB

Efficiency of resource bounded simulation

25

Varying α ( 10 -5), Yahoo

10-5 * 1PB (1015B) = 109 = 10GB

26

Michael

cc1

cl3cl7

cln-1

cl4

cl9

cl5

…

…

cl6

cl16

Eric

…

Non-localized queries

Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?

Non-localized: t may be far from s

Does dynamic reduction work for non-localized queries?

Is Michael connected to Eric via social links?

Resource-bounded reachability

27

Reductionsize |GQ| <= α|G|

Reachability queryresults

Reachability queryresults

Approximation(experimentally

Verified; no false positive, in time O(α|G|)

big graph G

small tree index GQ

O(|G|)

Yes, dynamic reduction works for non-localized queries

Preprocessing: landmarks

28

Michael

cc1

cl3cl7

cln-1

cl4

cl9

cl5

…

…

cl6

cl16

Eric

…

Landmarks◦ a landmark node covers certain

number of node pairs◦ Reachability of the pairs it covers

can be computed by landmark labels

cc1 “I can reach cl3”

cl3

cln-1, “cl3 can reach me”

cl4

…

cl6

cl16

A revision of 2-hop covers

Search landmark index instead of G

<= α|G|

Hierarchical landmark Index

29

Landmark Index◦ landmark nodes are selected to encode pairwise reachability◦ Hierarchical indexing: apply multiple rounds of landmark selection to

construct a tree of landmarks

cc1 cl7 cln-1

Michael

cc1

cl3cl7

cln-1

cl4

cl9

cl5

…

…

cl6

cl16

Eric

…… cl16

cl3 cl5 cl6

cl4

cl9 …

A node v can reach v’ if there exist v1, v2, v3 in the index such that

v reaches v1, v2 reaches v’, and v1 and v2 are connected to v3

(to and from) at the same level (coding)

Hierarchical landmark Index

30

Landmark Index◦ landmark nodes are selected to encode pairwise reachability◦ Hierarchical indexing: apply multiple rounds of landmark selection to

construct a tree of landmarks

cc1 cl7 cln-1

Michael

cc1

cl3cl7

cln-1

cl4

cl9

cl5

…

…

cl6

cl16

Eric

…… cl16

cl3 cl5 cl6

cl4

cl9 …

Boolean guarded condition (v, vp, v’)

Cost function c(v): size of unvisited landmarks in the subtree rooted at v

Potential P(v), total cover size of unvisited landmarks as the children of v

Cover size

Landmark labels/encoding

Topological rank/range

Whether v can possibly reach v’ via vp

Guided search on landmark index

Resource-bounded reachability

31

Michael

cc1

cl3cl7

cln-1

cl4

cl9

cl5

…

…

cl6

cl16

Eric

cc1

…cl7 cln-1 … cl16

cl3 cl5 cl6

cl4

Michael

Eric

“drill down”?

cl9 …local auxiliary

information

“roll up”

bi-directed guided traversal

Condition = FALSE

-

-

Condition = ?

Cost=9

Potential = 46 Condition = ?

Cost=2

Potential = 9

Condition = TRUE

…

…

Drill down and roll up

Accuracy

32

Varying α ( 10 -4), accuracy, Yahoo

achieves 100% accuracy when α>0.05%,

100% accuracy for 10-4 * 1PB (1015B) = 1010 = 100GB

Efficiency: resource bounded reachability

33

RBreach is 62.5 times faster than BFS and 5.7 times faster than BFS-OPT

Varying α ( 10 -4), Yahoo

10-4 * 1PB (1015B) = 1010 = 100GB

Graph systems

3434

35

Systems based on vertex-centric models

Giraph (Apache)

– Model: BSP, Pregel, Google– On top of Hadoop (MapReduce)– Scalability: Facebook used Giraph with some performance

improvements to analyze one trillion (1012) edges using 200 machines in 4 minutes

– Download: http://www.apache.org/dyn/closer.cgi/giraph/

Think like a vertex

GraphLab (CMU, an Apache licence)– Initially for machine learning, and beyond now– Libraries of algorithms– https://dato.com/products/create/open_source.html

36

Graph databases

– Neo4j (Neo, GNU public license)

• Graph databases: stored in form of either an edge, a node or an attribute

• Optimized for local traversal

• indexing

• Download: http://neo4j.com//

Database style

GraphX: a Spark API for graphs (AmpLab, UC Berkeley)– Based on Spark (in-memory primitives as opposed to Hadoop’s two-

stage disk based MapReduce; claimed 100 times faster; support SQL)– Graph parallel (Pregel, Graphlab) and data parallel (Spark, tables), – Download: http://spark.apache.org/graphx/

Summing up

3737

38


Challenges: to get real-time answers

Big data and costly queriesLimited resources

Yes, we can query big data within bounded resources! 38

Combined with techniques for making big data smallTwo approaches:

Query-driven approximation• Cheaper queries• Retain sensible answers

Data-driven approximation• Dynamic data reduction• Query-guided search

Reduce data of PB size to GB

Beyond graphs: relational queriesSPC: 7.9 * 10-4

SQL: 1.9 * 10-3

Summary and review

What is query-driven approximation? When can we use the

approach?

Traditional approximation scheme does not work very well for

query answering in big data. Why?

What is data-driven dynamic approximation? Does it work on

localized queries? Non-localized queries?

What is query-guided search?

39

40

Project (1)

Revise subgraph queries to include a designated output node, as query focus.

Develop ranking (relevance) function to rank matches of the output node in a subgraph query. Justify the design of your ranking function.

Develop an algorithm that, given a graph G, a subgraph query Q with a designated output node vo and a positive integer k, computes top-k matches of vo based on your ranking function.

Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the

size of graphs

A research and development project

Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.

41

Project (2)

Develop a resource-bounded algorithm for evaluating personalized subgraph queries.

Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy and

scalability

A development project

Revise graph pattern matching via subgraph isomorphism (Lecture 3) by including a designated personalized node, referred to as personalized subgraph queries in the sequel.

42

Project (3)

A research and development project

Recall keyword search with Steiner trees (Lecture 2). You may add practical restrictions to keyword search, e.g., a designated keyword is required to match a small set of nodes in a graph.

Develop a resource-bounded algorithm for keyword search. Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy

and scalability

42

43

• G. Gou and R. Chirkova. Efficient algorithms for exact ranked twig-pattern matching over graphs. In SIGMOD,

2008. http://dl.acm.org/citation.cfm?id=1376676• H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness:

an efficient algorithm for testing subgraph isomorphism. PVLDB,2008. http://www.vldb.org/pvldb/1/1453899.pdf

• R. T. Stern, R. Puzis, and A. Felner. Potential search: A bounded-cost search algorithm. In ICAPS, 2011. (search Google Scholar)

• S. Zilberstein, F. Charpillet, P. Chassaing, et al. Real-time problem solving with contract algorithms. In IJCAI, 1999. (search Google Scholar)

• W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB 2014. (query-driven approximation)

• W. Fan, X. Wang, and Y. Wu. Querying big graphs with bounded resources, SIGMOD 2014. (data-driven approximation)

Papers for you to review

querying big social graphs

Documents