acfocs 2004chakrabarti1 using graphs in unstructured and semistructured data mining soumen...
out of 142
Post on 19-Dec-2015
Embed Size (px)
- Slide 1
- Slide 2
- ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen
- Slide 3
- ACFOCS 2004Chakrabarti2 Acknowledgments C. Faloutsos, CMU W. Cohen, CMU IBM Almaden (many colleagues) IIT Bombay (many students) S. Sarawagi, IIT Bombay S. Sudarshan, IIT Bombay
- Slide 4
- ACFOCS 2004Chakrabarti3 Graphs are everywhere Phone network, Internet, Web Databases, XML, email, blogs Web of trust (epinion) Text and language artifacts (WordNet) Commodity distribution networks Internet Map [lumeta.com] Food Web [Martinez1991] Protein Interactions [genomebiology.com]
- Slide 5
- ACFOCS 2004Chakrabarti4 Why analyze graphs? What properties do real-life graphs have? How important is a node? What is importance? Who is the best customer to target in a social network? Who spread a raging rumor? How similar are two nodes? How do nodes influence each other? Can I predict some property of a node based on its neighborhood?
- Slide 6
- ACFOCS 2004Chakrabarti5 Outline, some more detail Part 1 (Modeling graphs) What do real-life graphs look like? What laws govern their formation, evolution and properties? What structural analyses are useful? Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining
- Slide 7
- ACFOCS 2004Chakrabarti6 Modeling and generating realistic graphs
- Slide 8
- ACFOCS 2004Chakrabarti7 Questions What do real graphs look like? Edges, communities, clustering effects What properties of nodes, edges are important to model? Degree, paths, cycles, What local and global properties are important to measure? How to artificially generate realistic graphs?
- Slide 9
- ACFOCS 2004Chakrabarti8 Modeling: why care? Algorithm design Can skewed degree distribution make our algorithm faster? Extrapolation How well will Pagerank work on the Web 10 years from now? Sampling Make sure scaled-down algorithm shows same performance/behavior on large-scale data Deviation detection Is this page trying to spam the search engine?
- Slide 10
- ACFOCS 2004Chakrabarti9 Laws degree distributions Q: avg degree is ~10 - what is the most probable degree? degree count ?? 10
- Slide 11
- ACFOCS 2004Chakrabarti10 Laws degree distributions Q: avg degree is ~10 - what is the most probable degree? degree count ?? 10 count 10
- Slide 12
- ACFOCS 2004Chakrabarti11 Power-law: outdegree O The plot is linear in log-log scale [FFF99] freq = degree (-2.15) O = -2.15 Exponent = slope Outdegree Frequency Nov97 -2.15
- Slide 13
- ACFOCS 2004Chakrabarti12 Power-law: rank R The plot is a line in log-log scale Exponent = slope R = -0.74 R outdegree Rank: nodes in decreasing outdegree order Dec98
- Slide 14
- ACFOCS 2004Chakrabarti13 Eigenvalues Let A be the adjacency matrix of graph The eigenvalue satisfies A v = v, where v is some vector Eigenvalues are strongly related to graph topology A BC D
- Slide 15
- ACFOCS 2004Chakrabarti14 Power-law: eigenvalues of E Eigenvalues in decreasing order E = -0.48 Exponent = slope Eigenvalue Rank of decreasing eigenvalue Dec98
- Slide 16
- ACFOCS 2004Chakrabarti15 The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within 1,2, h hops? Potential answer: 1 hop -> 3 neighbors 2 hops -> 3 * 3 h hops -> 3h
- Slide 17
- ACFOCS 2004Chakrabarti16 The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within 1,2, h hops? Potential answer: 1 hop -> 3 neighbors 2 hops -> 3 * 3 h hops -> 3h WE HAVE DUPLICATES!
- Slide 18
- ACFOCS 2004Chakrabarti17 The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within 1,2, h hops? Potential answer: 1 hop -> 3 neighbors 2 hops -> 3 * 3 h hops -> 3h avg degree: meaningless!
- Slide 19
- ACFOCS 2004Chakrabarti18 Power-law: hop-plot H Pairs of nodes as a function of hops N(h)= h H H = 4.86 Dec 98 Hops # of Pairs Hops Router level 95 H = 2.83
- Slide 20
- ACFOCS 2004Chakrabarti19 Observation Q: Intuition behind hop exponent? A: intrinsic=fractal dimensionality of the network N(h) ~ h 1 N(h) ~ h 2...
- Slide 21
- ACFOCS 2004Chakrabarti20 Any other laws? The Web looks like a bow-tie [Kumar+1999] IN, SCC, OUT, tendrils Disconnected components
- Slide 22
- ACFOCS 2004Chakrabarti21 Generators How to generate graphs from a realistic distribution? Difficulty: simultaneously preserving many local and global properties seen in realistic graphs Erdos-Renyi: switch on each edge independently with some probability Problem: degree distribution not power-law Degree-based Process-based (preferential attachment)
- Slide 23
- ACFOCS 2004Chakrabarti22 Degree-based generator Fix the degree distribution (e.g., Zipf) Assign degrees to nodes Add matching edges to satisfy degrees No direct control over other properties ACL model [AielloCL2000]
- Slide 24
- ACFOCS 2004Chakrabarti23 Process-based: Preferential attachment Start with a clique with m nodes Add one node v at every time step v makes m links to old nodes Suppose old node u has degree d(u) Let p u = d(u)/ w d(w) v invokes a multinomial distribution defined by the set of ps And links to whichever us show up At time t, there are m+t nodes, mt links What is the degree distribution?
- Slide 25
- ACFOCS 2004Chakrabarti24 Preferential attachment: analysis k i (t) = degree of node i at time t Discrete random variable Approximate as continuous random variable Let i (t) = E(k i (t)), expectation over random linking choices At time t, the infinitesimal expected growth rate of i (t) is, by linearity of expectation, m degrees to addTotal degree at t Time at which node i was born
- Slide 26
- ACFOCS 2004Chakrabarti25 Preferential attachment, continued Expected degree of each node grows as square-root of age: Let the current time be t A node must be old enough for its degree to be large; for i (t) > k, we need Therefore, the fraction of nodes with degree larger than k is Pr(degree = k) const/k 3 (data closer to 2)
- Slide 27
- ACFOCS 2004Chakrabarti26 Bipartite cores Basic preferential attachment does not Explain dense/complete bipartite cores (100,000s in a O(20 million)-page crawl) Account for influence of search engines The story isnt over yet 2:3 core (n:m core) log m Number of cores n=2 n=7
- Slide 28
- ACFOCS 2004Chakrabarti27 Other process-based generators Copying model [KumarRRTU2000] New node v picks old reference node r u.a.r. v adds k new links to old nodes; for i th link: W.p. add a link to an old node picked u.a.r. W.p. 1 copy the i th link from r More difficult to analyze Reference node compression techniques! H.O.T.: connect to closest, high-connectivity neighbor [Fabrikant+2002] Winner does not take all [Pennock+2002]
- Slide 29
- ACFOCS 2004Chakrabarti28 Reference-based graph compression Well-motivated: pack graph into limited fast memory for, e.g., query-time Web analysis Standard approach Assign integer IDs to URLs, lexicographic order Delta or Gamma encoding of outlink IDs If link-copying is rampant, should be able to compress Outlinks(u) by recording A reference node r Outlinks(r) Outlinks(u) the correction Finding r : whats optimal? practical? [Adler, Mitzenmacher, Boldi, Vigna 20022004]
- Slide 30
- ACFOCS 2004Chakrabarti29 Reference-based compression, contd r is a candidate reference for u if Outlinks(r) Outlinks(u) is large enough Given G, construct G in which Directed edge from r to u with edge cost = number of bits needed to write down Outlinks(u) Outlinks(r) Dummy node z, z has no outlinks in G z connected to each u in G cost(z,u) = #bits to write Outlinks(u) w/o ref Shortest path tree rooted at z In practice, pick recent r 2.58 bits/link
- Slide 31
- ACFOCS 2004Chakrabarti30 Summary: Power laws are everywhere Bible: rank vs. word frequency Length of file transfers [Bestavros+] Web hit counts [Huberman] Click-stream data [Montgomery+01] Lotkas law of publication count (CiteSeer data) log(rank) log(freq) a the log(#citations) log(count) J. Ullman
- Slide 32
- ACFOCS 2004Chakrabarti31 Resources Generators R-MAT firstname.lastname@example.org@cs.cmu.edu BRITE www.cs.bu.edu/brite/www.cs.bu.edu/brite/ INET topology.eecs.umich.edu/inettopology.eecs.umich.edu/inet Visualization tools Graphviz www.graphviz.orgwww.graphviz.org Pajek vlado.fmf.uni-lj.si/pub/networks/pajekvlado.fmf.uni-lj.si/pub/networks/pajek Kevin Bacon web site www.cs.virginia.edu/oracle www.cs.virginia.edu/oracle Erds numbers etc.
- Slide 33
- ACFOCS 2004Chakrabarti32 R-MAT: Recursive MATrix generator Goals Power-law in- and out- degrees Power-law eigenvalues Small diameter (six degrees of separation) Simple, few para
View more >