presented by ozgur d. sahin. outline introduction neighborhood functions anf algorithm modifications...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Presented by Ozgur D. Sahin
Outline
Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions
Introduction & Motivation
Graph-based data is becoming more importatnt Internet modeling, academic citations, phone records,
movie databases, CAD circuits Example Questions:
How robust is the Internet to failures? What are the most influential database papers? What is the best opening move in tic-tac-toe? Are phone call patterns in Asia similar to those in the U.S.?
Goal: Quickly answer questions on graph- represented data
Answering Questions
We can answer these questions if we can compute following three properties related to connectivity and neighborhood structure: Graph Similarity: Decide if two graphs have similar
connectivity/neighborhood structure Subgraph Similarity: Compare how two subgraphs of a
given graph are connected Vertex Importance: Assign an importance to each node
based on its connectivity
This paper provides such a tool: ANF (Approximate Neighborhood Function)
Challenges
Following properties should be satisfied: Error Guarantees: Accurate estimates Fast: Scale linearly with n (# of nodes) and m (# of edges)
Low Storage Adapts to available memory Parallelizable Sequential scan of the edge file Estimates per node
Definitions - Neighborhood Functionsdist(u,v): # of edges on the shortest path from u to v
Define following neighborhood functions:
Definitions - Neighborhood FunctionsGeneralize these two definitions to deal with subgraphs:
Basic ANF Algorithm
N(h) can be computed by a graph traversal Graph traversal accesses edges in random order Running time is O(nm)
Access edges in sequential order: M(x,h) is the set of nodes within distance h of node x
Basic ANF Algorithm
How to compute the number of distinct elements in the set M(x,h): A dictionary data structure: O(n2log n) time/space Use bits to mark membership: O(n2) space Use ‘probabilistic counting algorithm’
Approximate set sizes using ‘log n+r’ bits
Probabilistic counting algorithm Approximate set sizes using ‘log n+r’ bits Instead of one bit per node, give half the nodes bit 0, a
quarter of them bit 1, and so on (A node is given bit i with probability 1/2i+1)
The approximation of the size of a set is proportional to 2b, where b is the least bit that has not been set in the bit representation of this set
Use k parallel approximations M(x,h) is represented by k(log n+r) bits
Basic ANF Algorithm Consider a ring with 5 nodes
Example for k=3 and r=0 Bit 0 is the leftmost bit in each 3-bit mask
M(2,1) is the union of M(2,0), M(1,0), and M(3,0): M(2,1)=M(2,0) OR M(1,0) OR M(3,0)
IN(2,1) is computed from the average of the least zero bit positions: Avg=(2+1+1)/3=4/3 IN(2,1) = (24/3)/0.77359 = 3.25
Basic ANF Algorithm
Modifications
M(x,h) uses M(y,h-1) but not M(y,h-2), so just keep the M(y,h-1) during iteration h.
Include a mark bit to handle generalized neighborhood functions
Break bit masks into smaller pieces if they are larger than the available memory
Leading Ones Compression
As ANF runs, most bit masks will have many leading 1’s
Compress bit masks by including a counter of the leading ones
Bit shuffling of k parallel bit masks enables further compression: 11010,11100 1111011000
Provides up to 23% speed-up
Experiments
Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic
Evaluation Metric:
Experiments - Accuracy
k=64: - ANF achieves less than 7% error
- ANF’s error is independent of the data set
Experiments - Time
Experiments - Scalability
Data Mining with ANF
ANF tool can be used to answer graph mining problems: Best opening move for Tic-Tac-Toe game Clustering movie classes Measuring the robustness of the Internet
Use summarized statistics derived from neighborhood function: Many real graphs follow a power law:
N(h) hH, where H is defined as the ‘hop exponent’ Use ‘individual hop exponent’ as a measure of importance
Tic-Tac-Toe
Show: The best opening move is the center square Each possible board configuration is a node and
there is an edge from board x to board y if it is a possible move
Compute individual neighborhood functions for each of the 9 possible first moves
Clustering Movies
Consider IMDB (Internet Movie Data Base) where each movie is identified as being in one or more classes (such as documentaries, dramas, comedies, etc)
Construct a graph for each class and cluster similar ones
Internet Router Data How robust the Internet is to router failures
Delete some number of routers and measure connectivity
-Random failures do not disrupt the Internet
-Targeted failures can dramatically disrupt it
Conclusions
ANF uses an efficient and accurate approximation algorithm
ANF tool provides several advantages including following: Accurate Fast Low storage requirements Parallelizable
ANF makes it possible to answer many interesting questions