shortest paths on large graphs: systems, algorithms...

79
Shortest paths on large graphs: Systems, Algorithms, Applications Andrey Gubichev TU M¨ unchen January 2012 Andrey Gubichev Shortest paths on large graphs 1 / 53

Upload: hanhi

Post on 27-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Shortest paths on large graphs:Systems, Algorithms, Applications

Andrey Gubichev

TU Munchen

January 2012

Andrey Gubichev Shortest paths on large graphs 1 / 53

Outline

Introduction

Systems

Algorithms

ApplicationsSemantic WebSocial Search

Andrey Gubichev Shortest paths on large graphs 2 / 53

Everything is a graph

Internet Graph,RichardsonWeb Graph Social Network

Wikipedia, TulipProteins, Bordalier Inst

Andrey Gubichev Shortest paths on large graphs 3 / 53

RDF: format for graph data

Marie Curie U Paris

Warsaw

Poland

1867

1934

Maria SklodowskaNobel Prize Chemistry

Pierre Curie Nobel Prize Physics

Henri BecquerelbornIn

marriedTo

bornOn

diedOn

bornAs

in hasWon

hasWon

almamater

adviser

hasWon

RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)

(G.Weikum, WSDM’09)

• pay-as-you-go: schema-agnostic, schema-later

• RDF triples form ER graph

Andrey Gubichev Shortest paths on large graphs 4 / 53

RDF: format for graph data

Marie Curie U Paris

Warsaw

Poland

1867

1934

Maria SklodowskaNobel Prize Chemistry

Pierre Curie Nobel Prize Physics

Henri BecquerelbornIn

marriedTo

bornOn

diedOn

bornAs

in hasWon

hasWon

almamater

adviser

hasWon

RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)

(G.Weikum, WSDM’09)

• pay-as-you-go: schema-agnostic, schema-later

• RDF triples form ER graph

Andrey Gubichev Shortest paths on large graphs 4 / 53

RDF: format for graph data

Marie Curie U Paris

Warsaw

Poland

1867

1934

Maria SklodowskaNobel Prize Chemistry

Pierre Curie Nobel Prize Physics

Henri BecquerelbornIn

marriedTo

bornOn

diedOn

bornAs

in hasWon

hasWon

almamater

adviser

hasWon

RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)

(G.Weikum, WSDM’09)

• pay-as-you-go: schema-agnostic, schema-later

• RDF triples form ER graph

Andrey Gubichev Shortest paths on large graphs 4 / 53

RDF: a lot of data out there

Linked Data Project, linkeddata.org

Linked Data: extract explicit knowledge (ER-oriented facts) from theworld‘s best information sources (Wikipedia, Web, Web 2.0)

Andrey Gubichev Shortest paths on large graphs 5 / 53

SPARQL: a query language

Select ?c

Where

{

?p isa scientist.

?p bornIn ?t.

?p hasWon ?a.

?t locatedIn ?c.

?a Name NobelPrize.

}

...

...

• SQL-like syntax

• triple patterns

• common variables form joins

Andrey Gubichev Shortest paths on large graphs 6 / 53

SPARQL: a query language for RDF

...

Select ?c

Where

{

?p isa scientist.

?p bornIn ?t.

?p hasWon ?a.

?t locatedIn ?c.

?a Name NobelPrize.

Filter (?t < 1900)

}

...

• SQL-like syntax

• triple patterns

• common variables form joins

• filter predicates

Andrey Gubichev Shortest paths on large graphs 7 / 53

SPARQL: a query language

...

...

Select Distinct ?c

Where

{

?p ?r1 ?t.

?t ?r2 ?c.

?c isa Country.

?p bornOn ?b.

Filter (?b > 1945)

}

• SQL-like syntax

• triple patterns

• common variables form joins

• filter predicates

• wildcard joins

Andrey Gubichev Shortest paths on large graphs 8 / 53

RDF & SPARQL Engines

giant triples table

clustered property tables property table

S P Oid1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

...

Sesame/OpenRDFYARS2 (DERI)

PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...

id2 Henri B 1852 id9 ...

...

TownS Name Countryid3 Warsaw id11

...

Jena (HP Labs)Oracle RDF MATCH

bornOnS Oid1 1867

id5 1852

... ....

AdvisorS Oid1 id5

... ....

C-Store (MIT)MonetDB(CWI)

Why a new engine?

Three main things in database design:

1. Performance

2. Performance

3. Performance

Andrey Gubichev Shortest paths on large graphs 9 / 53

RDF & SPARQL Engines

giant triples table clustered property tables

property table

S P Oid1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

...

Sesame/OpenRDFYARS2 (DERI)

PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...

id2 Henri B 1852 id9 ...

...

TownS Name Countryid3 Warsaw id11

...

Jena (HP Labs)Oracle RDF MATCH

bornOnS Oid1 1867

id5 1852

... ....

AdvisorS Oid1 id5

... ....

C-Store (MIT)MonetDB(CWI)

Why a new engine?

Three main things in database design:

1. Performance

2. Performance

3. Performance

Andrey Gubichev Shortest paths on large graphs 9 / 53

RDF & SPARQL Engines

giant triples table clustered property tables property table

S P Oid1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

...

Sesame/OpenRDFYARS2 (DERI)

PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...

id2 Henri B 1852 id9 ...

...

TownS Name Countryid3 Warsaw id11

...

Jena (HP Labs)Oracle RDF MATCH

bornOnS Oid1 1867

id5 1852

... ....

AdvisorS Oid1 id5

... ....

C-Store (MIT)MonetDB(CWI)

Why a new engine?

Three main things in database design:

1. Performance

2. Performance

3. Performance

Andrey Gubichev Shortest paths on large graphs 9 / 53

RDF & SPARQL Engines

giant triples table clustered property tables property table

S P Oid1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

...

Sesame/OpenRDFYARS2 (DERI)

PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...

id2 Henri B 1852 id9 ...

...

TownS Name Countryid3 Warsaw id11

...

Jena (HP Labs)Oracle RDF MATCH

bornOnS Oid1 1867

id5 1852

... ....

AdvisorS Oid1 id5

... ....

C-Store (MIT)MonetDB(CWI)

Why a new engine?

Three main things in database design:

1. Performance

2. Performance

3. Performance

Andrey Gubichev Shortest paths on large graphs 9 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary)

andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

Andrey Gubichev Shortest paths on large graphs 10 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary)

andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

S P Oid1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

...

S P O1 3 4

1 5 6

1 7 2

2 3 8

...

map

ID

Andrey Gubichev Shortest paths on large graphs 10 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary) andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

P O S3 4 1

3 8 2

5 6 1

7 2 1

Andrey Gubichev Shortest paths on large graphs 10 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary) andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

Andrey Gubichev Shortest paths on large graphs 10 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary) andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

Andrey Gubichev Shortest paths on large graphs 10 / 53

Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]

• tuning-free system architecture: giant tripletable

• map literals into ids (dictionary) andprecompute

exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression, index-only store

• directly store indexes into clustered B+ trees

• can choose any order for scan and join

• also store two mapping indexes:literal → id, id → literal

• efficient merge joins with order-preservation

Andrey Gubichev Shortest paths on large graphs 10 / 53

RDF-3X Query Optimization[T.Neumann et al: VLDB’08]

• bottom-up dynamical programming for plan enumaration

• exploit numerous indexes, order-preservation

• cost model based on selectivity estimation

Andrey Gubichev Shortest paths on large graphs 11 / 53

Evaluation[T.Neumann et al: SIGMOD’09]

• Queries like: find a polishscientist with a french advisor,both got some awards

• YAGO knowledge base: 40 Mio.triples

• Billion Triple dataset, Uniprot(845 Mio.) - similar results

Try it out!

RDF-3X is freely available:http://code.google.com/p/rdf3x/

Andrey Gubichev Shortest paths on large graphs 12 / 53

Evaluation[T.Neumann et al: SIGMOD’09]

• Queries like: find a polishscientist with a french advisor,both got some awards

• YAGO knowledge base: 40 Mio.triples

• Billion Triple dataset, Uniprot(845 Mio.) - similar results

Try it out!

RDF-3X is freely available:http://code.google.com/p/rdf3x/

Andrey Gubichev Shortest paths on large graphs 12 / 53

Outline

Introduction

Systems

Algorithms

ApplicationsSemantic WebSocial Search

Andrey Gubichev Shortest paths on large graphs 13 / 53

What is missing?

What kind of queries we CAN answer?

• Find lat and long of the Eiffel Tower

• Find politicians who are also scientists

What kind of queries we CAN NOT answer?

• Find common things between Angela Merkel and ArnoldSchwarznegger

• Find all European-born Nobel prize winners

Why?

They require path traversals over RDF graph.

Andrey Gubichev Shortest paths on large graphs 14 / 53

Why is SPARQL not enough?

Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).

Example Triples

Humboldt bornIn Berlin.Berlin locatedIn Germany.

Example Triples

Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.

Were they both born in Germany? Yes.

How to figure that out?

Einstein Ulm Baden-Wurttemberg Germany

Humboldt Berlin

bornIn locatedIn locatedIn

bornIn

locatedIn

Andrey Gubichev Shortest paths on large graphs 15 / 53

Why is SPARQL not enough?

Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).

Example Triples

Humboldt bornIn Berlin.Berlin locatedIn Germany.

Example Triples

Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.

How to find all scientists that were born in Germany?

SPARQL

?person bornIn ?place. ?place locatedIn Germany.UNION?person bornIn ?place. ?place locatedIn ?place1. ?place1 locatedInGermany.UNION...

Andrey Gubichev Shortest paths on large graphs 16 / 53

Why is SPARQL not enough?

Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).

Example Triples

Humboldt bornIn Berlin.Berlin locatedIn Germany.

Example Triples

Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.

How to find all scientists that were born in Germany?

SPARQL with paths

?person bornIn ?place. ?place ??path Germany.

Andrey Gubichev Shortest paths on large graphs 17 / 53

SPARQL with path variables

Introduced by K.Anyanwu et al. (WWW’07)

• Example: select ??p ?obj where {?place ??path Germany} (pathtriple)

• ??p: there exists a path from place to Germany in the RDF graph

• we consider only shortest paths

• we can specify filter (conditions) on ??p

• we can join such path patterns with regular patterns

Example

select ?name where { ?m type Mountain.?m hasName ?name.?m ??location Europe.filter(ContainsOnly(??location, locatedIn)) }

Andrey Gubichev Shortest paths on large graphs 18 / 53

How to execute SPARQL with path variables?[A.Gubichev et al: WebDB’11]

We build upon RDF-3X. Two goals:

• Query Optimization: How to estimate cardinality of path triples?

• Physical Level: How to perform path scan efficiently?

Andrey Gubichev Shortest paths on large graphs 19 / 53

Outline

Introduction

Systems

Algorithms

ApplicationsSemantic WebSocial Search

Andrey Gubichev Shortest paths on large graphs 20 / 53

Can we do better?

• Dijkstra’s algo is fine, but let’s consider approximate algorithms(trade quality for speed)

• Let’s change the setting for now: shortest paths on social network

Social network:

• a set of people

• a social relationship linking them

Andrey Gubichev Shortest paths on large graphs 21 / 53

Problem Statement

Exact shortest path:

• V — users, E — ”friend of” relationships

• Graph G (V ,E ) — directed, unweighted, static

• Given u, v ∈ V find the shortest path from u to v

Approximate shortest path:

• Graph is disk-resident

• Offline step: Do some precomputation, store on disk

• Online step: for u,v ∈ V quickly find some path from u to v

• Approximation error:

|approximate| − |exact||exact|

Andrey Gubichev Shortest paths on large graphs 22 / 53

Different approaches

Exact SP

• Dijkstra: very slow

• A∗: works well for road networks, slow for OSN

• Hierarchy-based decomposition: works well for road networks, slow forOSN

Approximate SP

• Different types of preprocessing: keep distances from all nodes tosmall subset of nodes (random, with high degree or centrality)

• Poor results for OSN: average error is ≥ 10%

• Find just the distance, not the path itself

Andrey Gubichev Shortest paths on large graphs 23 / 53

Precomputation

Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:

1, 2, 22, 23,...,2r

Step3 For every u ∈ V and for every set S

1. Find the closest nodes to u in S (landmarks):

landmark h ∈ S : dist(u, h) = dist(u,S)

landmark h′ ∈ S : dist(h′, u) = dist(S , u)

2. Find the distance from u to h and from h′ to u

Andrey Gubichev Shortest paths on large graphs 24 / 53

Precomputation - WSDM’10 approach[A.Das Sarma et al: WSDM’10]

u

...

2

3

1

h1 ∈ S1

h2 ∈ S2

hr ∈ Sr

Sketch in RDF:〈u〉〈2〉〈h1〉〈u〉〈3〉〈h2〉· · ·

〈u〉〈1〉〈hr 〉

Andrey Gubichev Shortest paths on large graphs 25 / 53

Precomputation - our approach[A.Gubichev et al: CIKM’10]

u

x

y

...

h1 ∈ S1

h2 ∈ S2

hr ∈ Sr

Sketch in RDF:〈u〉〈x〉〈h1〉〈u〉〈x y〉〈h2〉· · ·

〈u〉〈 〉〈hr 〉

Andrey Gubichev Shortest paths on large graphs 26 / 53

Precomputation

Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:

1, 2, 22, 23,...,2r

Step3 For every u ∈ V and for every set S

1. Find the closest nodes to u in S (landmarks):

landmark h ∈ S : dist(u, h) = dist(u,S)

landmark h′ ∈ S : dist(h′, u) = dist(S , u)

2. Find the path from u to h and from h′ to u3. Store the paths (RDF):〈u〉 〈path〉 〈h〉, 〈h′〉 〈path′〉 〈u〉

Step4 Repeat Steps 2-3 k times (we use k = 2).

Andrey Gubichev Shortest paths on large graphs 27 / 53

Sketch

Sketch for a node u consists of

1. Landmarks h1,...,hkr

2. Paths from u to landmarks

3. Paths from landmarks to u

Sketch for u consists of two trees (u is the root)

We keep sketches for every u ∈ V

Andrey Gubichev Shortest paths on large graphs 28 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]

Input: nodes s, d ∈ V

1. Load all the distances from s

2. Load all the distances to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest distance

Output: distance from s to d

s

d

3

4

2

3

Andrey Gubichev Shortest paths on large graphs 29 / 53

SKETCH algorithm with paths[A.Gubichev et al: CIKM’10]

Input: nodes s, d ∈ V

1. Load all the paths from s

2. Load all the paths to d

3. Find common landmarks

4. Construct the paths

5. Select the shortest path

Output: path from s to d :〈s x y h z d〉

s

d

z

y

x

h

Andrey Gubichev Shortest paths on large graphs 30 / 53

Datasets

• Slashdot: 77 K nodes, undirected

• YouTube: 1.1 Mln nodes

• Flickr: 1.7 Mln nodes

• WikiTalk: 2.2 Mln nodes

• Twitter: 2.4 Mln nodes

• Orkut: 3 Mln nodes, undirected

Sources: Stanford, MPI, Telefonica Research

Andrey Gubichev Shortest paths on large graphs 31 / 53

Approximation error of the Sketch algorithm

Error =|approximate| − |exact|

|exact|

Dataset (#nodes) Sketch error

Slashdot (77K) 46%YouTube (1.1M) 30%Flickr (1.7M) 28%WikiTalk (2.2M) 55%Twitter (2.4M) 51%Orkut (3M) 71%

Andrey Gubichev Shortest paths on large graphs 32 / 53

Precomputation

Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:

1, 2, 22, 23,...,2r

Step3 For every u ∈ V and for every set S

1. Find the closest nodes to u in S (landmarks):

landmark h ∈ S : dist(u, h) = dist(u,S)

landmark h′ ∈ S : dist(h′, u) = dist(S , u)

2. Find the path from u to h and from h′ to u3. Store the paths (RDF):〈u〉 〈path〉 〈h〉, 〈h′〉 〈path′〉 〈u〉

Step4 Repeat Steps 2-3 k times (we use k = 2).

Andrey Gubichev Shortest paths on large graphs 33 / 53

First modification

We find the path, not just the distance!

Are there cycles?Construct a shorter path

s d

a as da

Andrey Gubichev Shortest paths on large graphs 34 / 53

First modification

We find the path, not just the distance!

Are there cycles?

Construct a shorter path

s da a

s da

Andrey Gubichev Shortest paths on large graphs 34 / 53

First modification

We find the path, not just the distance!

Are there cycles?

Construct a shorter path

s da a

s da

Andrey Gubichev Shortest paths on large graphs 34 / 53

First modification

We find the path, not just the distance!Are there cycles?

Construct a shorter path

s da a

s da

Andrey Gubichev Shortest paths on large graphs 34 / 53

Approximation error of the first modification

No time overhead!

Dataset (#nodes) Sketch error Sketch I error

Slashdot (77K) 46% 26%YouTube (1.1M) 30% 12%Flickr (1.7M) 28% 11%WikiTalk (2.2M) 55% 31%Twitter (2.4M) 51% 38%Orkut (3M) 71% 48%

Andrey Gubichev Shortest paths on large graphs 35 / 53

Second modification

Are there any ”hidden” connections?If yes, construct a shorter path

s d

?

s d

Andrey Gubichev Shortest paths on large graphs 36 / 53

Second modification

Are there any ”hidden” connections?

If yes, construct a shorter path

s d

?

s d

Andrey Gubichev Shortest paths on large graphs 36 / 53

Second modification

Are there any ”hidden” connections?

If yes, construct a shorter path

s d

?

s d

Andrey Gubichev Shortest paths on large graphs 36 / 53

Second modification

How to check it?

1. For every node in the path load the list of friends from the originaldataset

2. For every pair of nodes from the path check whether they are friends

Number of nodes in the path is usually small!

Andrey Gubichev Shortest paths on large graphs 37 / 53

Approximation error of the second modification

Dataset (#nodes) Sketch error Sketch I error Sketch II error

Slashdot (77K) 46% 26% 0.6%YouTube (1.1M) 30% 12% 0.6%Flickr (1.7M) 28% 11% 0.3%WikiTalk (2.2M) 55% 31% 0.2%Twitter (2.4M) 51% 38% 0.8%Orkut (3M) 71% 48% 0.6%

Andrey Gubichev Shortest paths on large graphs 38 / 53

Tree algorithm

Paths from a node to landmarks form atree

s

...

... ...

...

landmarks

Andrey Gubichev Shortest paths on large graphs 39 / 53

Tree algorithm

• Load paths from s and to d

• Start BFS from s and d

• For every visited node load a listof friends

• For every pair of visited nodescheck:

1. are they equal? (s3, d1)2. are they friends? (s1, d)

• Form a new path and put it to thequeue Q

• Don’t go too deep: terminate if

levels + leveld > Q.top.length

s

...

... ...

d

s4 s5

s1 s2 s3

s

s3

d

d2

d4 d3

d1

s3

d1

s4 s5

d4 d3

levels + leveld = 4 > 2

Andrey Gubichev Shortest paths on large graphs 40 / 53

Tree algorithm

• Load paths from s and to d

• Start BFS from s and d

• For every visited node load a listof friends

• For every pair of visited nodescheck:

1. are they equal? (s3, d1)2. are they friends? (s1, d)

• Form a new path and put it to thequeue Q

• Don’t go too deep: terminate if

levels + leveld > Q.top.length

s

...

... ...

d

s4 s5

s1 s2 s3

s

s3

d

d2

d4 d3

d1

s3

d1

s4 s5

d4 d3

levels + leveld = 4 > 2

Andrey Gubichev Shortest paths on large graphs 40 / 53

Tree algorithm

• Load paths from s and to d

• Start BFS from s and d

• For every visited node load a listof friends

• For every pair of visited nodescheck:

1. are they equal? (s3, d1)2. are they friends? (s1, d)

• Form a new path and put it to thequeue Q

• Don’t go too deep: terminate if

levels + leveld > Q.top.length

s

...

... ...

d

s4 s5

s1 s2 s3

s

s3

d

d2

d4 d3

d1

s3

d1

s4 s5

d4 d3

levels + leveld = 4 > 2

Andrey Gubichev Shortest paths on large graphs 40 / 53

Tree algorithm

• Load paths from s and to d

• Start BFS from s and d

• For every visited node load a listof friends

• For every pair of visited nodescheck:

1. are they equal? (s3, d1)2. are they friends? (s1, d)

• Form a new path and put it to thequeue Q

• Don’t go too deep: terminate if

levels + leveld > Q.top.length

s

...

... ...

d

s4 s5

s1 s2 s3

s

s3

d

d2

d4 d3

d1

s3

d1

s4 s5

d4 d3

levels + leveld = 4 > 2

Andrey Gubichev Shortest paths on large graphs 40 / 53

Tree algorithm

• Load paths from s and to d

• Start BFS from s and d

• For every visited node load a listof friends

• For every pair of visited nodescheck:

1. are they equal? (s3, d1)2. are they friends? (s1, d)

• Form a new path and put it to thequeue Q

• Don’t go too deep: terminate if

levels + leveld > Q.top.length

s

...

... ...

d

s4 s5

s1 s2 s3

s

s3

d

d2

d4 d3

d1

s3

d1

s4 s5

d4 d3

levels + leveld = 4 > 2

Andrey Gubichev Shortest paths on large graphs 40 / 53

Approximation error of the Tree algorithm

Dataset Sketch error Sketch I error Sketch II error Tree error

Slashdot 46% 26% 0.6% 0YouTube 30% 12% 0.6% 0.06%Flickr 28% 11% 0.3% 0.04%WikiTalk 55% 31% 0.2% 0Twitter 51% 38% 0.8% 0.03%Orkut 71% 48% 0.6% 0.1%

Andrey Gubichev Shortest paths on large graphs 41 / 53

Experimental setup

• Pick 100 nodes (uniformly at random) from the OSN.

• For each node compute Shortest Path Tree (Dijkstra)

• The result is {(x , y , dist)|x , y ∈ V , dist = dist(x , y)}• Group triples by distance and randomly choose 50 triples from every

group

• For every chosen triple (x , y , dist): find approximate shortest pathsfrom x to y and compare their lengths with dist

Andrey Gubichev Shortest paths on large graphs 42 / 53

Implementation details

• Datasets in RDF:〈user1〉 〈friend-of〉 〈user2〉

• Precomputed paths in RDF:

〈u〉 〈path〉 〈h〉

〈h′〉 〈path′〉 〈u〉

• RDF3X for datasets and precomputed data

• C++

• Laptop: 2.0GHz Intel Core 2 Duo, 4 Gb RAM, L2 cache 3 Mb

Andrey Gubichev Shortest paths on large graphs 43 / 53

Time

Dataset (#nodes) Sketch Sketch II Tree Dijkstra Dijkstra(sec) (sec) (sec) (sec) (queue)

Flickr (1.7M) 1.2 2.1 1.9 73 696KWikiTalk (2.2M) 0.7 1.4 1.7 101 2 MlnTwitter (2.4M) 1.9 3.9 4.0 119 1.1 MlnOrkut (3M) 1.1 2.6 2.7 503 2.5 Mln

Andrey Gubichev Shortest paths on large graphs 44 / 53

Disk space

Disk space for precomputed data, Gb

Dataset Dataset size Sketch with distances Sketch with paths

Flickr 0.57 2.3 4.4WikiTalk 0.22 1.9 2.1Twitter 1.3 3.4 6.1Orkut 5.6 6.0 7.4

Andrey Gubichev Shortest paths on large graphs 45 / 53

Number of shortest paths

We find several shortest paths:

Dataset (#nodes) Sketch II Tree

Flickr (1.7M) 33.3 55.6Wikitalk (2.2M) 18.6 50.7Twitter (2.4M) 45.5 92Orkut (3M) 9.5 30

Andrey Gubichev Shortest paths on large graphs 46 / 53

Outline

Introduction

Systems

Algorithms

ApplicationsSemantic WebSocial Search

Andrey Gubichev Shortest paths on large graphs 47 / 53

Application #1: Semantic Web

• SPARQL v.1.1 - SPARQL + path traversal

• Querying the DB of entire human knowledge (everything thatWikipedia knows)

Andrey Gubichev Shortest paths on large graphs 48 / 53

Outline

Introduction

Systems

Algorithms

ApplicationsSemantic WebSocial Search

Andrey Gubichev Shortest paths on large graphs 49 / 53

Small World

Milgram 1967

• People are given letters, asked to forward to one friend

• Source: random Omahaians; Target: stockbrocker in Sharon, MA

• Of completed chains, averaged 6 hops to reach target

Andrey Gubichev Shortest paths on large graphs 50 / 53

Shortest paths on Social NetworksShortest paths are interesting...• per se:

• what is the distance between you and Angela Merkel?• for geeks: Erdos number

• as an important primitive for• social network analysis (diameter, centrality, etc)• social search

• Of course, we can do one-to-many shortest paths algo

M. Potamias et al. CIKM 2009

John searches MaryRanking:

1. Mary A

2. Mary B

3. Mary C

M. Potamias et al. CIKM 2009

Andrey Gubichev Shortest paths on large graphs 51 / 53

Shortest paths on Social NetworksShortest paths are interesting...• per se:

• what is the distance between you and Angela Merkel?• for geeks: Erdos number

• as an important primitive for• social network analysis (diameter, centrality, etc)• social search

• Of course, we can do one-to-many shortest paths algo

M. Potamias et al. CIKM 2009

John searches MaryRanking:

1. Mary A

2. Mary B

3. Mary C

M. Potamias et al. CIKM 2009

Andrey Gubichev Shortest paths on large graphs 51 / 53

Acknowledgements

• Srikanta Bedathur

• Gerhard Weikum

• Josep M. Pujol

• Thomas Neumann

• Sihem Amer-Yahia

Andrey Gubichev Shortest paths on large graphs 52 / 53

Thank you!Questions?

Andrey Gubichev Shortest paths on large graphs 53 / 53