jaccard coefficients as a potential graph benchmarkgraphanalysis.org/ipdps2016-gabb/kogge.pdf ·...

21
1 GABB: May 23, 2016 Jaccard Coefficients as a Potential Graph Benchmark Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired) Please Sir, I want more

Upload: others

Post on 15-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

1GABB: May 23, 2016

Jaccard Coefficients as a

Potential Graph Benchmark

Peter M. KoggeMcCourtney Prof. of CSE

Univ. of Notre DameIBM Fellow (retired)

Please Sir, I want more

Page 2: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

2GABB: May 23, 2016

Outline

• Motivation

• Jaccard Coefficients

• A MapReduce Baseline

• Variations

• A Possible Heuristic

• Suggested Benchmarks

• Key Questions

Page 3: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

3GABB: May 23, 2016

20

9

Graph 500

• Start with a root, find reachable vertices

• Simplifications: only 1 kind of edge, no weights

• Performance metric: TEPS: Traversed Edges/sec

1

3

5

7

8e0 e1

e2

e3e4

e5

e6

e7e8

Starting at 1: 1, 0, 3, 2, 9, 5

Scale = log2(# vertices)

Level Scale Size

Vertices

(Billion) TB

Bytes

/Vertex

10 26 Toy 0.1 0.02 281.8048

11 29 Mini 0.5 0.14 281.3952

12 32 Small 4.3 1.1 281.472

13 36 Medium 68.7 17.6 281.4752

14 39 Large 549.8 141 281.475

15 42 Huge 4398.0 1,126 281.475

Average 281.5162

Page 4: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

4GABB: May 23, 2016

Limitations of BFS as a Benchmark

Has provided rich set of new algorithms & implementations, BUT:

• Complexity only O(E)– E = # edges

• Only a batch algorithm– Must investigate virtually entire static graph

• No natural incremental variant

• Little non-academic applicability– Commercial apps much more focused on limited

neighborhoods

Page 5: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

5GABB: May 23, 2016

Big Graph Relationship Problems

• Tough Problem: Find vertex pairs that “share some common property”

• Related graph problem: computing “Jaccard coefficients”

• Commercial version: “Non-obvious Relationship Problems” (NORA)

Which 2 entities

share >1

addresses?

a

b

a

e

e

f

a

f

2.2

2.6

2.7

2.8

#A

#B

#E

#F

#G

1

2

3

4

5

This is Mayo guy

Entities:e.g. people

Secondary Basis:e.g. Apt. #

Alias:e.g. Last Name

Primary Basis:e.g. Street Address

Dedup values:e.g: zip code, state,

What are the pairs

of people that

follow the same set

of twitter feeds

from “Burkhardt & Waring, An NSA Big Graph Experiment”

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Page 6: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

6GABB: May 23, 2016

Sample Real World Problem

• 40+ TB of Raw Data

• Periodically clean up & combine to 4-7 TB

• Weekly “Boil the Ocean” to precompute answers to all standard queries

– Does W have financial difficulties?

– Does X have legal problems?

– Has Y had significant driving problems?

– Who has shared addresses with Z?

– …

Auto Insurance Co: “Tell me about giving auto policy to Jane Doe” in < 0.1sec

“Jane Doe has no indicators

But

she has shared multiple

addresses with Joe Scofflaw

Who has the following negative

indicators ….”

Look up answers to

precomputed queries for

“Jane Doe”

and combine

Relationships

Page 7: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

7GABB: May 23, 2016

A 2012 Relationship

• Given: 14.2 billion records from– 800 million entities (North American people, businesses)

– 100 million addresses

• Goal: given entity, find all other entities that– Share at least 2 addresses in common

– Or have one address in common and last name that is “close”

• Matching last names requires processing to check for typos (“Levenshtein distance”)

• Above one of dozens or relationships

Page 8: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

8GABB: May 23, 2016

2012 Processing Flow

14.2B recs

325 B/rec

4.6TBProject

14.2B recs

100+ B/rec

1.5TB

Join on

Address

1.6T recs

200+ B/rec

300+TB

Sort &

Remove

Duplicates

1.5T recs

30B/rec

45TB

• Compute adr hash

• Compare lnames

• Init score to 3

• Project

1.6T recs

30 B/rec

48+TB

Group by

ID pairs &

Sum scores,

Lname_match

{(ID1, ID2,

adrhash, score,

lname_match)}

{(ID, lname, adr)}

Hash ID1,2

& Distribute

12B recs

16B/rec

200GB

Select on

Score &

Lname_match

1.2T recs

16B/rec

20TB

{(ID1, ID2, score, lname_match)}

800M distinct IDs

400M distinct IDs

Send between

nodes via TCP/IP

datagrams

“h”“t”

“J”

“D”

Page 9: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

9GABB: May 23, 2016

Jaccard Coefficient Γ(u , v)

N(u) = set of neighbors of u

Γ(u,v) = fraction of neighbors of u

and v that are in common

Γ(u,v) = |N(u) ∩ N(v)|/(N(u) U N(v)|

i

j

Green and Purple lead to common neighbors

Blue lead to non-common neighbors

u

v

Page 10: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

10GABB: May 23, 2016

Jaccard Coefficient Γ(u , v)

d(u) = # of neighbors of i

ɤ(u, v) = # of common neighbors

Γ(u,v) = fraction of all neighbors that

are shared

ɤ(u, v) = 2:

d(u)=3; d(v)=4:

Γ(u,v) = 2/5 =0.4

In General:

Γ(u,v) = ɤ(u, v) /(d(u)+d(v)- ɤ(u, v))

i

j

Green and Purple lead to common neighbors

Blue lead to non-common neighbors

u

v

Page 11: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

11GABB: May 23, 2016

Computing a Single γ

• γ(u,v) = # of common neighbors

• Do not need to “visit” neighbors, only enumerate them

• Algorithm requires comparing 2 lists of lengths d(u) & d(v)– Let d = max(d(u), d(v))

• If lists are sorted, then O(d)

• If lists not sorted, then O(d*log(d))– Clever hash algorithms may reduce to almost O(d)

Page 12: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

12GABB: May 23, 2016

Computing all γs• Apply prior single γ to all pairs

– O(V2d) to O(V2d*log(d))

– Factor of 2 reduction if only do (u,v) where i<j

• Avoid computing all clearly 0 terms

– For each u, explore each w, (u,w) an edge

– Compute γ(u,v) for each v where (v,w) an edge & u<v

– O(V*d3) to O(V*d4)

• Backwards: compute γs incrementally

– Group edges so for each w we have {x|(x,w)}

– For each x & y in this set, increment γ(x,y)

– O(Vd2): Avoid O(V2) initialization by dynamic creation & initialization

• Sparse Matrix Equivalency: A = adjacency matrix

– γ = AAT; O(nnz(A))

Page 13: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

13GABB: May 23, 2016

Jaccard via MapReduce

• 1 MapReduce step: 3 phases– Map some function over all data to (key,value) stream

– Group pairs by key

– Reduce each group

• Two reported Jaccard implementations– 3-steps with V = 1 million edges on 50 node cluster

– 5-steps with V = 64 million vertices & 1 billion edges

• Burkhardt “Asking Hard Graph Questions,” Beyond Watson Workshop, Feb. 2014.

• RMAT matrices, average d(i) = 16

• 1000 node system, each with 12 cores & 64GB

Page 14: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

14GABB: May 23, 2016

Measurements & Trends

1.E+02

1.E+03

1.E+04

1.E+05

1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Tim

e (S

ec)

Vertices

Measured Modeled

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Co

eff

icie

nts

Vertices

Measured Modeled

136 + 10-5V1.16 16V1.42

JACS (Jaccard Coefficients / Sec) = 1.6E6*V0.26

Page 15: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

15GABB: May 23, 2016

Relevant Batch Variations

• Consider graphs with 2 vertex classes– A = set of “people”, B = set of “addresses”

– Edges from A to B is person a “resided at” address b

– γ(a,b) = # of common addresses between a and b

• Real world: for |A| = 8E8, |B| = 1E8, – |γ| = 1.2E12 = |A|1.35

• Variations:– >2 classes of vertices

– Exclude paths thru high in-degree neighbors

– Weight paths on basis of properties (e.g. same last name)

– Add threshold

Page 16: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

16GABB: May 23, 2016

Dynamic Variations

• Many applications with dynamic graphs– Vertices, edges added/deleted dynamically

• Question: which γs change with edge addition or deletion– Perhaps just those that cross threshold, in either

direction

Page 17: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

17GABB: May 23, 2016

Other Variations

• Don’t compute/store all possibly O(V2) γs, But store just statistics of set of γs for each vertex– Number of non-zero, largest, average, etc.

– Reduces storage to O(V)

– Serve as starting point for more complex queries

• Expand “neighbors” beyond “1 hop”– Real commercial applications at “1.5” hops

Page 18: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

18GABB: May 23, 2016

A Possible Heuristic

• Goal: constant time elimination of γs that are zero of less than some threshold

• Build bit vector for each vertex that hashes vertex ids into bits

– Bit i =1 if one of more neighbors fall into set I

• Estimate γ(u,v) by ANDing u & v’s bit vectors

– If results are 0, then γ = 0

• Can also estimate upper bound on γ(u,v)

– Simple function of # of 1s in bit vectors

• If benchmark uses a threshold, this estimate can prevent computation when bound<threshold

• Based on ideas from SpGEMM package

Page 19: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

19GABB: May 23, 2016

Suggested Benchmarks

• Simple batch: compute all non-zeros– Perhaps use same RMAT as for BFS

– Performance Metric JACS

• Multi-class benchmark– One metric: time to compute all γs as above

– Optionally append Jaccard statistics as properties to each vertex of a class

• Real-time event detection– Start with precomputed γ statistics

– Input stream of additions/deletions

– Check for changes in γs

– Performance metric: change throughput

Page 20: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

20GABB: May 23, 2016

Key Questions for More Precise Benchmark Definitions

• Formal definition of batch and incremental

• Where do non-zeros go? File, properties, …

• Tie data sets to real applications

• Explore real-world distribution of γs and Γs to understand growth rates better

• Build in options such as thresholds

• Develop complexity models, esp. parallel

• Verify correct solutions

• Develop reference implementations

Page 21: Jaccard Coefficients as a Potential Graph Benchmarkgraphanalysis.org/IPDPS2016-GABB/Kogge.pdf · 2019. 8. 15. · GABB: May 23, 2016 7 A 2012 Relationship •Given: 14.2 billion records

21GABB: May 23, 2016

Acknowledgements

• Funded in part by the Univ. of Notre Dame, Notre Dame, IN, USA