presented by ozgur d. sahin. outline introduction neighborhood functions anf algorithm modifications...

23
Presented by Ozgur D. Sahin

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Presented by Ozgur D. Sahin

Page 2: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Outline

Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Page 3: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Introduction & Motivation

Graph-based data is becoming more importatnt Internet modeling, academic citations, phone records,

movie databases, CAD circuits Example Questions:

How robust is the Internet to failures? What are the most influential database papers? What is the best opening move in tic-tac-toe? Are phone call patterns in Asia similar to those in the U.S.?

Goal: Quickly answer questions on graph- represented data

Page 4: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Answering Questions

We can answer these questions if we can compute following three properties related to connectivity and neighborhood structure: Graph Similarity: Decide if two graphs have similar

connectivity/neighborhood structure Subgraph Similarity: Compare how two subgraphs of a

given graph are connected Vertex Importance: Assign an importance to each node

based on its connectivity

This paper provides such a tool: ANF (Approximate Neighborhood Function)

Page 5: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Challenges

Following properties should be satisfied: Error Guarantees: Accurate estimates Fast: Scale linearly with n (# of nodes) and m (# of edges)

Low Storage Adapts to available memory Parallelizable Sequential scan of the edge file Estimates per node

Page 6: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Definitions - Neighborhood Functionsdist(u,v): # of edges on the shortest path from u to v

Define following neighborhood functions:

Page 7: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Definitions - Neighborhood FunctionsGeneralize these two definitions to deal with subgraphs:

Page 8: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Basic ANF Algorithm

N(h) can be computed by a graph traversal Graph traversal accesses edges in random order Running time is O(nm)

Access edges in sequential order: M(x,h) is the set of nodes within distance h of node x

Page 9: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Basic ANF Algorithm

How to compute the number of distinct elements in the set M(x,h): A dictionary data structure: O(n2log n) time/space Use bits to mark membership: O(n2) space Use ‘probabilistic counting algorithm’

Approximate set sizes using ‘log n+r’ bits

Page 10: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Probabilistic counting algorithm Approximate set sizes using ‘log n+r’ bits Instead of one bit per node, give half the nodes bit 0, a

quarter of them bit 1, and so on (A node is given bit i with probability 1/2i+1)

The approximation of the size of a set is proportional to 2b, where b is the least bit that has not been set in the bit representation of this set

Use k parallel approximations M(x,h) is represented by k(log n+r) bits

Page 11: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Basic ANF Algorithm Consider a ring with 5 nodes

Example for k=3 and r=0 Bit 0 is the leftmost bit in each 3-bit mask

M(2,1) is the union of M(2,0), M(1,0), and M(3,0): M(2,1)=M(2,0) OR M(1,0) OR M(3,0)

IN(2,1) is computed from the average of the least zero bit positions: Avg=(2+1+1)/3=4/3 IN(2,1) = (24/3)/0.77359 = 3.25

Page 12: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Basic ANF Algorithm

Page 13: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Modifications

M(x,h) uses M(y,h-1) but not M(y,h-2), so just keep the M(y,h-1) during iteration h.

Include a mark bit to handle generalized neighborhood functions

Break bit masks into smaller pieces if they are larger than the available memory

Page 14: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Leading Ones Compression

As ANF runs, most bit masks will have many leading 1’s

Compress bit masks by including a counter of the leading ones

Bit shuffling of k parallel bit masks enables further compression: 11010,11100 1111011000

Provides up to 23% speed-up

Page 15: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Experiments

Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic

Evaluation Metric:

Page 16: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Experiments - Accuracy

k=64: - ANF achieves less than 7% error

- ANF’s error is independent of the data set

Page 17: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Experiments - Time

Page 18: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Experiments - Scalability

Page 19: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Data Mining with ANF

ANF tool can be used to answer graph mining problems: Best opening move for Tic-Tac-Toe game Clustering movie classes Measuring the robustness of the Internet

Use summarized statistics derived from neighborhood function: Many real graphs follow a power law:

N(h) hH, where H is defined as the ‘hop exponent’ Use ‘individual hop exponent’ as a measure of importance

Page 20: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Tic-Tac-Toe

Show: The best opening move is the center square Each possible board configuration is a node and

there is an edge from board x to board y if it is a possible move

Compute individual neighborhood functions for each of the 9 possible first moves

Page 21: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Clustering Movies

Consider IMDB (Internet Movie Data Base) where each movie is identified as being in one or more classes (such as documentaries, dramas, comedies, etc)

Construct a graph for each class and cluster similar ones

Page 22: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Internet Router Data How robust the Internet is to router failures

Delete some number of routers and measure connectivity

-Random failures do not disrupt the Internet

-Targeted failures can dramatically disrupt it

Page 23: Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Conclusions

ANF uses an efficient and accurate approximation algorithm

ANF tool provides several advantages including following: Accurate Fast Low storage requirements Parallelizable

ANF makes it possible to answer many interesting questions