a query language for analyzing networks

45
A query language for analyzing networks Anton Dries (based on joint work with Siegfried Nijssen)

Upload: larca-upc

Post on 09-Jul-2015

1.025 views

Category:

Technology


0 download

DESCRIPTION

Information networks are a popular way to represent information, especially in domains where the emphasis lies on the structural relationships between the entities rather than their features. Notable examples are online social networks and road networks. This special focus on network topology has led to the development of specialized graph databases. However, few of these databases offer a high-level declarative interface suited for analyzing information networks.In this talk I present our work on developing a query language for analyzing networks. I will focus on the general principles we followed in the design of this language, and the main challenges related to developing it into a scalable tool for network analysis.

TRANSCRIPT

Page 1: A query language for analyzing networks

A query language for analyzing networksAnton Dries(based on joint work with Siegfried Nijssen)

Page 2: A query language for analyzing networks

Idea

Declarative language for manipulating and analyzing information networks

“Query language” – cf. SQL

with special focus on querying connections

simplicity / expressivity / flexibility

Page 3: A query language for analyzing networks

Information networks

Objects (“nodes”)

Connections between objects (“edges”)

Focus on structure (“topology”)

a.k.a. “large single graph”

Page 4: A query language for analyzing networks

Information networks

HTTP://SPIKEDMATH.COM/382.HTML

Page 5: A query language for analyzing networks

Information networksExamples:

World Wide Web

Social networks

Bibliographical

Transportation

Biological

Page 6: A query language for analyzing networks

ProcessCommon tasks

Query language

Operational model (algebra)

Implementation & Optimization

Data management & storage

TOP

DOW

N AP

PROA

CH

Page 7: A query language for analyzing networks

ProcessCommon tasks

Query language

Operational model (algebra)

Implementation & Optimization

Data management & storage

TOP

DOW

N AP

PROA

CH [CIKM 2009]

[MLG 2010]

?

Graph databases (DEX, Neo, ...)

Page 8: A query language for analyzing networks

Common tasksFeature-based queries

Structure-based queries

Aggregation

Basic graph problems e.g. degree, shortest path

Network analysis (e.g. centrality measures)

...

Mainly path-based queries

Page 9: A query language for analyzing networks

BiQL“The BISON Query Language”

Page 10: A query language for analyzing networks

publication

publication

publication

keyworddata mining

keywordgraphs

keywordmachine learning

keywordprobabilities

author

author

author

author

author

author of

author of

author of

author ofauth

or o

f

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

Page 11: A query language for analyzing networks

publication

publication

publication

keyworddata mining

keywordgraphs

keywordmachine learning

keywordprobabilities

author

author

author

author

author

author of

author of

author of

author ofauth

or o

f

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

co-authorship

Page 12: A query language for analyzing networks

Manipulation“query language”

SQL-style: loosely based on SQL syntax

One type of query: create set of (new) objects

CREATE/UPDATE Domain<Vars> { Properties }FROM Path Expression

WHERE Constraints

Page 13: A query language for analyzing networks

Example

publication

publication

publication

keyworddata mining

keywordgraphs

keywordmachine learning

keywordprobabilities

author

author

author

author

author

author of

author of

author of

author of

au

tho

r of

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

CREATE CoAuthor<A,B> { A <−>, B <−> }FROM Author A −> AuthorOf −> Publication P

<− AuthorOf <− Author B

Page 14: A query language for analyzing networks

Examplepublication

publication

publication

keyworddata mining

keywordgraphs

keywordmachine learning

keywordprobabilities

author

author

author

author

author

author of

author of

author of

author of

au

tho

r of

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

CREATE CoAuthor<A,B> { A <−>, B <−> }

FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

“path expression” – structural selection

“object creation” – output specification

(+ other operations)

Page 15: A query language for analyzing networks

Structural selectionAuthor A −> AuthorOf −> Publication P <− AuthorOf <− Author B,

Publication P −> HasKeyword −> Keyword K

Author A −> CoAuthor −> Author B −> CoAuthor −> Author C −> CoAuthor −> Author A

AuthorA

AuthorB

Publication PAuthorOf AuthorOf

Keyword K

HasKeyword

AuthorA

AuthorB

AuthorC

CoAuthor

CoAuthorCoAuthor

Page 16: A query language for analyzing networks

Structural selection

Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

regular expressions

list variables

each expansion of regular expression should lead to a valid (simple) path expression defining

the same variables

Page 17: A query language for analyzing networks

Structural selectionNode A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

Node A −> Edge [E] −> Node B

Node A −> Edge [E] −> Node −> Edge [E] −> Node B

(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)

(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)

(A,E,B) =

(A,E,B) =

n1

n2

n3

n4

e1

e2e3

e4

e5

Page 18: A query language for analyzing networks

Output specificationCREATE CoAuthor<A,B> { A <−>, B <−> }

FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

CREATE CoAuthor<A,B> { A <−>, B <−> }

update/createobjects

put themin this

domain

for each combination

of values

with these properties

UPDATE

Page 19: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)

(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)

n1

n2

n3

n4

e1

e2e3

e4

e5

<A>

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3

Page 20: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

n1

n2

n3

n4

e1

e2e3

e4

e5

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3

<A>

Page 21: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

n1

n2

n3

n4

e1

e2e3

e4

e5

<B>

([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2])

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3

<A>

Page 22: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

n1

n2

n3

n4

e1

e2e3

e4

e5

<B>

([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2])

Page 23: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

n1

n2

n3

n4

e1

e2e3

e4

e5

<B>

([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2]) count 3

2

1

Page 24: A query language for analyzing networks

Output specificationUPDATE <A> { nr_reach: count<B> }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

n1

n2

n3

n4

e1

e2e3

e4

e5

count 3

2

1

<B>

([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2]) UPDATE

n1nr_reach: 3

n2nr_reach: 2

n3nr_reach: 1

Page 25: A query language for analyzing networks

Object properties

Attribute definition

Link definition

strength: count<P> start: min<P>(P.year)

A −>, B −> P <−

Page 26: A query language for analyzing networks

Examples

Page 27: A query language for analyzing networks

Co-authorship

CREATE CoAuthor<A,B> { A −>, B −>, <− P,

start: min<P>(P.year), end: max<P>(P.year), strength: count<P> }

FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

adding a new relationship

A B

CoAuthorstrength: 3start: 2008end: 2010

P1year: 2008

P2year: 2008

P3year: 2010

Page 28: A query language for analyzing networks

UPDATE <A> { netsize: count<B> }FROM Author A −> (CoAuthor [co] <− Author −>)*

CoAuthor [co] <− Author BWHERE length(co) < 4

Size of neighborhoodtransitive closure

Page 29: A query language for analyzing networks

Distance

CREATE Connection<A,B> { A −>, −> B, distance: min<E>(length(E)) }

FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

based on shortest path

distance: min<E>(sum(E.weight))distance: min<E>(length(E))

distance: max<E>(product(E.probability))

Page 30: A query language for analyzing networks

Centrality measures

closeness centralityUPDATE <A> { closeness: 1/sum<B>(min<AB>(AB.distance))}FROM Node A −> Connection AB −> Node B

degree centrality

UPDATE <A> { Cdegree: count<B>/(count<N>-1) }FROM Node A −− Edge -- Node B, Node N

CD(v) =deg(v)

n� 1

CC(v) =1P

t2V dist(v, t)

Page 31: A query language for analyzing networks

Query execution

Page 32: A query language for analyzing networks

Operational model

Query algebra operators:

Evaluate path expression (graph –> tuple)

Relational algebra (tuple –> tuple)

Construction operator (tuple –> graph)

Used by prototype implementation

Page 33: A query language for analyzing networks

Operational model

“Pattern match” operator is too broad

Enumerates all paths

exponential

e.g. even when only shortest path is requested

Need for atomic graph operations (open question)

Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Page 34: A query language for analyzing networks

Pattern matching

Homomorphism matching (no cycle check)

more efficient than isomorphism

cycles could lead to unbounded solutions

Use constraints and algebraic solutions to avoid infinite processing

operator interaction – “pattern match” operator not atomic enough

Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Page 35: A query language for analyzing networks

Avoiding unbounded solutions

CREATE Distance<A,B> { A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

CREATE ConnectionWeight<A,B> { A −>, −> B, distance: sum<E>(product(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

CREATE PathCount<A,B> { A −>, −> B, numP: count<E> }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Page 36: A query language for analyzing networks

Fletcher’s algorithmFOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j) Ck,k,k = e⊙ ⊕ Ck,k,k

(S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring

where

number of nodes in the graphn

[FLETCHER, 1980][BATAGELJ, 1994]

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator

C0,i,j weighted adjacency matrix

Page 37: A query language for analyzing networks

Fletcher’s algorithm

Dynamic programming approach

At step k: Ck,i,j contains solution using paths containing only nodes 1...k

Some examples ...

Page 38: A query language for analyzing networks

Fletcher’s algorithm

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ+, min, +, ∞, 0)

Ck,k* = min(0, Ck,k, 2Ck,k, 3Ck,k, ...) = 0 (Ck,k >= 0)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = min(Ck-1,i,j,Ck-1,i,k + Ck-1,k,j) Ck,k,k = 0

Floyd-Warshall shortest path algorithm

Page 39: A query language for analyzing networks

Fletcher’s algorithm(S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1)

Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k

sum of all path weights

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...

Page 40: A query language for analyzing networks

Fletcher’s algorithm

a* = 1 + a + a2 + a3 + ...

(S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)

Ck,k* = 1 (Ck,k = 0)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k

number of pathsCk,k* = ∞ (Ck,k > 0) cycle k–>k

no cycle k–>k

Page 41: A query language for analyzing networks

Fletcher’s algorithmGeneralized algorithm for several connectivity problems

O(n3) time complexity, O(n3) or O(n2) space complexity

for many problems: best known time complexity (exact, for arbitrary graphs)

also in the presence of cycles (thanks to (Ck,k,k*) term)

Applicability depends on constraints on path

Page 42: A query language for analyzing networks

Fletcher’s algorithmCREATE Connection<A,B>

{ A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BWHERE A.color = ‘blue’

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)

if e1e2 matches path expression then e1 and e2 must match path expression

=> has to compute all pair shortest paths

= +

Page 43: A query language for analyzing networks

Conclusion

A query language for analyzing networks

Focussed to path based analysis

Raises interesting questions

Some ideas on implementation and optimization

Page 44: A query language for analyzing networks

Future workNeed for atomic graph operations

Fletcher’s algorithm:

interaction with constraints

complex path expressions (not just Node-Edge-Node)

Approximate answers – O(n3) is very bad

Other metrics: flow-based, pagerank, ... mining

Page 45: A query language for analyzing networks

Thank you