neighborhood based fast graph search in large networks

28
NEIGHBORHOOD BASED FAST GRAPH SEARCH IN LARGE NETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan, nanli, xyan, ziyuguan}@cs.ucsb.edu Supriyo Chakraborty UC Los Angeles [email protected] Shu Tao IBM TJ Watson [email protected]

Upload: delora

Post on 24-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Neighborhood Based Fast Graph Search In Large Networks. Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara { arijitkhan , nanli , xyan , ziyuguan }@ cs.ucsb.edu. Shu Tao IBM TJ Watson [email protected]. Supriyo Chakraborty UC Los Angeles - PowerPoint PPT Presentation

TRANSCRIPT

Slide 1

Neighborhood Based Fast Graph Search In Large NetworksArijit Khan, Nan Li, Xifeng Yan, Ziyu GuanComputer Science UC Santa Barbara{arijitkhan, nanli, xyan, ziyuguan}@cs.ucsb.edu

Supriyo ChakrabortyUC Los [email protected]

Shu TaoIBM TJ [email protected]

A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1Motivation (RDF Query)

2Which actors have appeared in both a John Waters movie and a Steven Spielberg movie?

DirectorMovieNameTitledirectER DiagramWriting of a SPARQL query requires to know how the entities are connected in the graph data.SELECT ?actorName WHERE { ?actor ?actorName.

?director1 S. Spielberg.

?director1Movie ?actor; ?director1.

?director2 J. Waters.

?director2Movie ?actor; ?director2.}SPARQL QueryNameActoractNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2RDF QUERY 3

?J. WatersS. SpielbergQuery GraphJ. WatersS. SpielbergDarren E. BurrowsAmistadCry-BabyMatching SubgraphHow the entities are connected is less important than how closely they are connected.DirectorMovieNameTitledirectER DiagramNameActoractNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 34

Approximate Graph MatchingFind the athlete who is from Romania and won gold in 3000m and bronze in 1500m in 1984 Olympics?Bronze1500mQuery GraphMatching Subgraph19843000mGoldRomaniaBronze1500m19843000mGoldRomaniaMaricica PuicaGraph Edit Distance: 7

# Missing Edges: 4

Maximum Common Subgraph Size: 3

Still a close approximate match of the query graph !!!?Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 45

Graph AlignmentAlign the nodes of two graphs based on their attributes.Graph AlignmentName Disambiguation and Database Schema Matching.Linked InTwitterNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 56

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 6# Missing Edges: 1 (both for f1 and f2)

Graph Edit Distance: 2 (for f1), 1 (for f2)

Graph Edit distance, # of Missing Edges are not scalable for large graphs.7

Problem FormulationDifficulties with the # of Edge Mismatch or Graph Edit Distance f1 is a better match than f2 considering the proximity of the labels.

aacbbcabcf1f2QGNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 7Approximate query matching techniques, that preserve the shape of the query graph, might not be appropriate.

8

Problem FormulationProblem with Shape Preserving Approx. Query Matching If two labels are close in the query graph, they should also be close in the matching subgraph.Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 8If the query graph Q is subgraph isomorphic to target graph G, then the cost of matching Q in G must be 0.

The farther the labels are in G compared to that in Q, the higher will be the cost of matching.9

A Good SubGraph Matching Algorithm Should Have Problem with Random Walk Based Methods fGQRandom Walk Based Models (i.e. Personalized Page Rank) does not satisfy these requirements.GQGreen Yellow0.750.67Green Blue0.250.33Random Walk Probabilities Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 9Convert the label distribution in the neighborhood of each node u into a multi-dimensional vector R(u)={}.

10

Information Propagation Model

Information Propagation Model

h = 2, = 0.5

RQ(v1)= {} , RQ(v2)={}

Rf1(u1)= {}, Rf1(u2)= {}

Rf2(u1)= {}, Rf2(u2)= {}

Example of Neighborhood Vectorization

Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 10Neighborhood Based Cost Function:

- Positive difference between the neighborhood vectors.

11

Problem Definition

Neighborhood Based Cost FunctionCN(f1) = 0

CN(f2) = (0.5-0.25)+(0.5-0.25)=0.5

h = 2, = 0.5

RQ(v1)= {} , RQ(v2)={}

Rf1(u1)= {}, Rf1(u2)= {}

Rf2(u1)= {}, Rf2(u2)= {}

Neighborhood Based Top-k Similarity Search: Given a target graph G and a query graph Q, find the top-k embeddings with respect to cost CN.

Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1112

Cost Function Properties

False Positive, CN(f )=0, for h=1. For an exact embedding fe, CN(fe )=0.

Neighborhood Based Cost Function can have False Positives.

Given a graph G and a query graph Q, if each of their nodes has a distinct label, for any inexact embedding f, CN(f )>0, for all h>0, > 0Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1213

Cost Function PropertiesNeighborhood Based Top-k Similarity Search is NP-hard.

Given two graphs Q and G of same number of nodes, it can be determined in polynomial time if G itself is an embedding f of Q with CN(f )=0.Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1314

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1415

Search Algorithm Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) L(u) and cost(u,v) is less than a predefined cost threshold .

Step 2: Discard the labels of the unmatched nodes in the target graph.

Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further.

GQv1v2v3v4u1u2u3u4u5u6h=1, =0.5, =0Search Algorithmf1st Round:cost(u1, v1)=0cost(u5,v1)=0cost(u2,v3)=0.5 . .match(v1) = {u1, u5}match(v2) = {u3}match(v3) = {u6}match(v4) = {u4}

2nd Round:cost(u1, v1)=0.5cost(u5,v1)=0

. .match(v1) = {u5}match(v2) = {u3}match(v3) = {u6}match(v4) = {u4}

Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1516

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1617

Index the neighborhood vectors for the first round of matching.

Two Types of Indexing: - Label Based (Hashing of Node Labels) - Neighborhood Based

cbu3u4au1u2u5u6?v3v4v1v2GQaaabRQ(v1) ={, }

RG(u1)= {, }RG(u2)={, , }RG(u3)={ , }RG(u4)={, , }RG(u5)={, , }RG(u6)={, , }

bah=2, =0.5, =0au2u1u4u6u5u3bu1u5u6u2u3u4cost = 0cost = 0cost = 0.25 > a, 1.0a, 1.25a, 0.75b, 0.5b, 0.75b, 0.75a, 0.5b, 0.75Threshold Algorithm Neighborhood Vectors

RG(u1)= {, }RG(u2)={, , }RG(u3)={ , }RG(u4)={, , }RG(u5)={, , }RG(u6)={, , }

Index StructureIndexingNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1718

Insertion/ deletion of nodes/ edges incur local changes in the neighborhood vectors of only a few nodes.

Index structure consists of sorted list of nodes based on the label association values in their neighborhood vectors.

Index can be implemented using Priority Queue. Easy to perform local updates.

Dynamic UpdateNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 1819

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 19 Query OptimizationNon-discriminative labels increase the number of node matches in the initial rounds of search algorithm.

Eliminate non-discriminative labels initially; add them in the final stage of search algorithm.

Labels with Heavy-head distribution are more discriminative than those with Heavy-tail distribution.

20Au(l)Au(l)|u||u|Heavy Head (Discriminative) DistributionHeavy Tail (Non-Discriminative) Distribution Pruned

Not Pruned Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2021

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2122

Data Sets:

Efficiency:Experimental Results# of Node# of Edges

# of LabelsAvg. # of Labels/ NodeFreeBase172,015579,869159,5141Intrusion 200,858703,0201,00025DBLP684,9117,764,604683,9271WebGraph10M213M10,0001FreeBaseIntrusionDBLPWebGraph2-hop Indexing(Off-line)280.0 sec227.0 sec1733.0 sec5,125.0 secTop-1 Search*(On-line)0.06 sec1.6 sec0.02 sec0.11 sec*Query graph is a subgraph of the target graph; # of nodes in Query Graph = 50Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2223

Error Ratio: # of incorrectly identified nodes of the target graph in all top-1 matches divided by the # of nodes in all the query graphs in a query set.

Noise Ratio: # of edges added divided by total number of nodes in query graphs. Robustness Results

Robustness Results (FreeBase)

Diameter 2 100 nodes

Diameter 3 150 nodes

Diameter 4 200 nodes

Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2324

Noise Ratio: # of edges added divided by total number of nodes in query graphs. Convergence ResultsConvergence Results (DBLP)

Diameter 2 100 nodes

Diameter 3 150 nodes

Diameter 4 200 nodes

Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2425

Scalability Results

Scalability Results (WebGraph)

Query graph is a subgraph of the target graph.

# of nodes in Query Graph = 50

Indexing is performed for h=2 hops.Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2526

RoadmapProblem Formulation

Search Algorithm

Indexing

Query Optimization

Experimental Results

ConclusionNeighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2627

ConclusionNew Measure of Graph Similarity based on Neighborhood structure.

Information Propagation Model to convert a large graph into multi-dimensional vectors.

Iterative pruning based efficient and scalable search algorithm using the neighborhood vectors.

Efficient Indexing and Query Optimization Techniques.

How to match the labels when they are not exactly same in two graphs? Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 2728

Thank You!!!Questions?Neighborhood Based Fast Graph Search in Large Networks A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao 28