a foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/w14-gm.pdf•anomaly...
TRANSCRIPT
A foray into graph miningNeil Shah
April 15th, 2019
(Graph) data is prevalent
• 2.5 exabytes of data produced every day• 90% generated in the last 2 years• Data is produced as the product of a highly interconnected world
1.3 billion users1 billion daily mobile views
244 million users480 million products
187 million daily actives3.5 billion daily snaps
(Graph) data shapes perspectives
Movie
recommendation
Search engine
ranking
Product
purchasing Social platform
interaction
What’s in a graph?
• Graphs consist of nodes, edges and attributes• ex: Facebook social network where
• nodes = individuals• edges = friendship• attributes = gender (node), # of messages exchanged (edge)
• Graphs can easily model relationships between entities• Who-follows-whom on a social network• Who-buys-what on an e-commerce platform• Who-calls-whom using a certain cellular provider
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and ranking• Clustering• Anomaly detection
• Takeaways
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and ranking• Clustering• Anomaly detection
• Takeaways
Graph preliminaries – directionality
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
Graph preliminaries – degree
• Degree: # of adjacent edges
• Degree(u7) = 2
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
Graph preliminaries – out- and in-degree
• Degree: # of adjacent edges• Out-degree: # outgoing edges• In-degree: # incoming edges
• Out-degree(u4) = 1• In-degree(u6) = 2
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
Graph preliminaries – weighted degree
• Weighted degree: total sum of adjacent edge weights• i.e. “how many times did two users
communicate”
• Weighted-degree(u6) = 7
3
4
12
9
1
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u116
Graph preliminaries – ego(net)
• Ego: single, central node
• Ego network (egonet): nodes and edges within one “hop” from ego
• Egonet(u7) =• Nodes {u7, u3, u5}• Edges {u7-u3, u7-u5}
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
Graph preliminaries – connectivity
• Two nodes are connected if there is a path between them.• A graph is fully connected if all node
pairs are connected.
• u1 and u8 are connected• u3 and u5 are connected• u1 and u9 are not connected• This graph is not fully connected
u1
u2
u3
u4
u5
u6
Users-by-users
u7
u8
u9
u10
u11
Graph preliminaries – node and edge types
• A heterogeneous graph has multiple node and/or edge types.
• Users and products• Who-buys-what and who-rates-what
u1
u2
u3
u4
u5
u6
p1
p2
p3
p4
p5
Users ProductsUsers-by-products
Graph preliminaries – matrix representation
• Graph connectivity can be summarized in an adjacency matrix.• Ai,j = # (or weight) of edges from node i to j• A usually very sparse (makes compact representations possible!)
u1
u2
u3
u4
u5
u6
Users-by-usersu7
u8
u9
u10
u11
11
11
11
11
1
User
s
Users
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and ranking• Clustering• Anomaly detection
• Takeaways
Key question: What does a graph “look like”?
• At first look… large, unwieldy and seemingly random.
• Spoiler: In actuality, most real-world graphs are far from random.
Lyon ’03Trace-route paths on the internet
A quick detour: “Random” graphs
• Erdos-Renyi random graph model: graph G(n,p)• n = number of nodes• p = probability of an edge between two nodes (independent edges)
• Expected # of edges:
• Degree distribution: (binom.)
Babaoglu’ 18
What about real graphs?
• X-axis: degree, Y-axis: frequency/probability• Degree distributions of real graphs are not “random”• What exactly are they, then?
Log(# posts) vs. log(# users)log(# visitors) vs. log(# sites)log(# peers) vs. log(# routers)
Faloutsos ‘99 Viswanath ‘09Adamic ‘02
The “scale-free” property
• Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law:
• Scaling the input by a multiple simply results in proportional scaling of the whole function
• Power laws are linear in log-log scales
• Typical 2 ≤ # ≤ 3
log(# visitors) vs. log(# sites)
Scale-freeness is evident in many domains
Newman ‘05
Why are many real graphs scale-free?
• Hypothesis: preferential attachment, or a “rich-get-richer” effect
• Generative process to construct a network:• Start with !" nodes, each with at least 1 edge• At each timestep, add a new node with ! edges
connecting it to ! already existing nodes• Probability of new node to connect to node # depends on
the degree $% as
• Many real-world variants of this effect:academic citations, recommendation, virality
log(# visitors) vs. log(# sites)
Real graphs have “small-world” effects
• How “far apart” are nodes in real graphs?• Interestingly, not very far! The typical number is 6. You may have heard of
the “six degrees of separation”
• Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64)• Leskovec ‘08: avg. distance between node pairs on MSN messenger
has mode 6 (sample size 180M nodes and 1.3B edges)
What causes the small-world effect?
• Hypothesis: The abundance of hubs, or high-degree nodes• Even though most nodes aren’t connected to most other nodes, they are
connected to hubs, which facilitate paths
log(# visitors) vs. log(# sites)
How do real graphs “grow” over time?
• Consider a time-evolving graph !• If it has "($) nodes and &($) edges at time t…• Suppose that "($ + 1) = 2"($)• What is &($ + 1)?
• Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2, generally
Real graphs exhibit densification
Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)
Moreover, the graph diameter shrinks
• Graph diameter = max(distance between node pairs)
• Leskovec ‘05 shows that diameteractually shrinks over time, instead of growing. In other words, nodestend to get closer
• Hypothesis: Once again due toprevalence and growth of hubs
Much more work done on graph behaviors
• Generative graph models (Leskovec ‘05)• Patterns in sizes of connected components (Kang ‘10)• Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10)• Temporal structure evolution (Shah ‘15)
…
the list goes on
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and ranking• Clustering• Anomaly detection
• Takeaways
Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance
• Link prediction and recommendation• Local methods• Global methods
PageRank for large-scale search engines
• Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance?
• Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns
• 150 million web pages à 1.7 billion links
Backlinks and Forward links:ØA and B are C’s backlinksØC is A and B’s forward link
Content adapted from Li ‘09
Simplified PageRank
• !: a web page• "!: the set of u’s backlinks• #$: the number of forward links of page v• %: the normalization factor to make & a
probability distribution
• Simplified PageRank is the stationary probability dist. of a random-walk on the graph; a surfer keeps clicking successive pages at random.
Idea: each page equally distributes its own PageRank to its forward-links recursively.
“An important page has many important pages pointing to it”
Simplified PageRank
PageRank Calculation: first iteration
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
Simplified PageRank
PageRank Calculation: second iteration
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
Simplified PageRank
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Convergence after some iterations
Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
Problem with Simplified PageRank
A loop:
During each iteration, the loop accumulates rank but never distributes rank to other pages!
The problem in practice
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Read as “Microsoft gives all of its PageRank to Microsoft”
The problem in practice
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Read as “Microsoft gives all of its PageRank to Microsoft”
The problem in practice
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
Read as “Microsoft gives all of its PageRank to Microsoft”
All roads lead to Microsoft
A modified solution: (true) PageRank
• This subtle modification solves the problem of “sinks” • PageRanks converge to the dominant eigenvector of the appropriately
configured/normalized adjacency matrix, due to Markov chain theory! Cool!
• Modified PageRank is the same as the simple model, with the exception of the surfer having a random jump probability.
!(#): a distribution of ranks of web pages that the surfer can jump towhen he/she “gets bored” after clicking on successive links.
A modified solution: PageRank
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)
Yahoo Amzn MS
Initial PageRank scores
20% random jump probability
PageRank converges quickly and produces empirically good results
• PR (322 Million Links): 52 iterations• PR (161 Million Links): 45 iterations• Scaling factor is roughly linear in logn
Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance
• Link prediction and recommendation• Local methods• Global methods
Exploiting local structure for predicting links
• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?
• Key idea: measure affiliation between u and other nodes by u’s graph neighborhood!
Rich literature in previous measures
Liben-Nowell ‘04
Users-by-users
Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance
• Link prediction and recommendation• Local methods• Global methods
Exploiting global structure for predicting links
• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?
• Key idea: measure affiliation between u and other nodes via a latent factor model/embedding that compactly encodes “interests”
Singular value decomposition
• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense
blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”
A "! #$
%×' %×(
(×( (×'
)*)+
),~ x x
()*≥ )+ ≥ …),)
Singular value decomposition
• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense
blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”
nus
ers
m videos
~
“music lovers”“artist spotlights”
“adrenaline junkies”“action movies”
“dabbling cooks”“baking shows”
"# "$ "%
&$&# &%
+ + …
'# '$ '%
Recommendation from latent factors
• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors
• Compute vector similarity between user-user or user-item vectors (depending on application)• Cosine similarity/dot product are common choices
Koren ‘09
Recommendation from latent factors
• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors• Compute vector similarity between user-user or user-item vectors
(depending on application)• Cosine similarity/dot product are common choices
Koren ‘09
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and prediction• Clustering• Anomaly detection
• Takeaways
Graph clustering for knowledge extraction
• Key problem: what can we learn about group dynamics from graph interactions? Are there natural “clusters” of behaviors?
• Key idea: tightly-knit graph interactions form graph clusters, which can indicate community behaviors. These are useful for • Behavioral understanding• Computational load balancing• Graph compression• Visualization
advertiser
query
Finding graph clusters
• Given a graph G, we want to find clusters
• Need to:• Formalize the notion of a cluster• Need to design an algorithm that will find sets of nodes that are good clusters
Content adapted from Leskovec ‘10
Clustering objective functions
• Essentially all objectives use the intuition that a good cluster S has• Many edges internally• Few edges pointing outside
• Simplest objective function:• Conductance
• Small conductance corresponds to good clusters
• There are many other formalizations of roughly this intuition• Graph objectives are generally hard to optimize directly. Greedy/approximate
algorithms are common
Clustering objective functions
• Single-criterion (considers either internal or external)• Modularity: m-E(m)• Modularity Ratio: m-E(m)• Volume: åu d(u)=2m+c• Edges cut: c
• Multi-criterion (considers both)• Conductance: c/(2m+c)• Expansion: c/n• Density: 1-m/n2
• CutRatio: c/n(N-n)• Normalized Cut: c/(2m+c) + c/2(M-m)+c• Max ODF: max frac. of edges of a node pointing outside S• Average-ODF: avg. frac. of edges of a node pointing outside • Flake-ODF: frac. of nodes with mode than ½ edges inside
S
n: nodes in Sm: edges in Sc: edges pointing
outside S
Multiple types of clustering algorithms
• Global spectral• Compute graph Laplacian matrix L = D – A• Find 2nd smallest eigenvector of L• Split by sign to get a partitioning of nodes (related
to graph “cut”)• Recurse to get more clusters
• Local spectral• Pick random seed node• Build local clusters around seed nodes based on random walk/PageRank• Prune cluster from graph and repeat
Flow-based algorithms
• METIS: multi-level graph partitioning• If it’s too expensive to partition a big graph… coarsen it into a smaller graph• If it’s still to big, keep coarsening
• Compute a partition and uncoarsen the graph
• Improve heuristically• Swap vertices• Local search
Measuring clustering algorithm performance
• How to quantify performance:• What is the score of clusters across a range of sizes?
• Network Community Profile (NCP) (Leskovec ‘08)• The score of the best cluster of size k
NCPs for a real graph (LiveJournal)• 500 node comms. from Local Spectral
• 500 node comms. from METIS
Interestingly, Local Spectral clusters are more compact and tighter, despite having higher (worse) conductance than METIS!
NCPs for various objectives (Local Spectral)• Multiple objectives can be pretty
similar• Conductance• Expansion• Normalized Cut• Cut-ratio• Avg-ODF
• Max-ODF prefers small clusters, Flake-ODF prefers large clusters• Internal density not very good (large
clusters are very sparse)
You should know…
• Many types of clustering objectives and algorithms -- can use NCP to analyze them• Not many “good” large clusters – real graphs are complicated!
• Different types bias for various aspects (cluster size, internal and external connectivity)
• Overemphasis on clustering objectives can actually lead to “bad” looking clusters according to human intuition
Roadmap
• Preliminaries
• Notable graph properties
• Cool applications• Recommendation and prediction• Clustering• Anomaly detection
• Takeaways
Graph-based anomaly detection
• Key problem: what kinds of anomalous behaviors exist in real graphs, and can we find such anomalies automatically?
• Key idea: we can identify various types of “anomalous” behaviors by building null/normal models and penalizing excessive deviation• Node-based anomalies• Group anomalies (too large, too dense to be
a real community)
Anomalies in graphs: important applications
• Email networks• Spammers
• Computer networks• Hackers/port scanning
• Phone-call networks• Telemarketers
• Social networks• Fake engagement
Major goal
• How to go from a graph to a quantitative model/pattern?
Local, egonet-based anomaly detection
• What does a typical node look like?• Can’t say much about just a node in isolation• Let’s consider the egonets!
• For each node, • extract egonet (=1-step-away neighbors)• extract features (#edges, total weight, etc.) • extract patterns (norms)• compare with the rest of the population (detect anomalies)
Users-by-users
Content adapted from Akoglu ‘10
What is anomalous?
• Not obvious!
What is anomalous?
Near-star
Near-clique
telemarketer, port scanner,people adding friendsindiscriminatively, etc.
tightly connected people, terrorist groups?, discussion group, etc.
Heavy vicinity
too much money wrt number of accounts, high donation wrt number of donors, etc.
single-minded, tight company
Dominant heavy link
Basic features to study
• Ni : number of neighbors (degree) of ego i• Ei : number of edges in egonet I• Wi : total weight of egonet I• λw,I : 1st eigenvalue of the weighted adjacency matrix of egonet i
Obs. 1: Egonet Density Power Law
Ei ∝ Niα
1 ≤ α ≤ 2
Differentiates “dense” from “sparse” neighborhoods
Obs. 2: Egonet Weight Power Law
Wi ∝ Eiβ
β ≥ 1
Differentiates “heavy” from “light” neighborhoods
Obs. 3: Egonet !"Weight Power Lawλw,i∝ Wi
γ
0.5 ≤ γ ≤ 1
Differentiates “uniform” distribution from “dominant” heavy edges
Scoring node anomalies
violates our “laws” far away from most pointsAnomaly ≈
scoredist = distance to fitting linescoreoutl = outlierness scorescore = func( scoredist , scoreoutl )
Triaging anomalies
ü can interpret the type of anomaly
ü can sort nodes wrttheir outlierness scores
Interesting results: Blog post-to-post graph
Part of a group of posts who all link to each other Post linking to many other
posts indiscriminately
Interesting results: Committee-to-candidate donations graph
$87M - DNC$25M - RNC
Interesting results: Author-to-conference publishing graph
Has published 40 papers, but to the same conference (and nowhere else)
Have published hundreds of papers, to almost as many conferences!
Group anomalies on graphs
Bob’s
Carol’s
Alice’s
Alice
Content adapted from Shin ‘16
Fraud forms dense blocks
Rest
aura
nts
AccountsRestaurants Accounts
Adjacency Matrix
Tensor modeling for attributed graphs• Natural dense blocks
are sparse on the time axis (formed gradually)• Suspicious dense
blocks are also denseon the time axis (due to synchronous behavior)• Suspicious dense
blocks are denser than natural dense blocks in the tensor model
Rest
aura
nts
Timesta
mp
Sparse
Dense
Accounts
A cell indicates that account i rates restaurant j at time t
Adjacency Tensor
Applications
• Dense bocks signal anomalies/fraud in many multi-attribute graphs
Src IP
DstI
P
Timesta
mp
Src User
DstU
ser
Timesta
mp
UserPa
ge
Timesta
mp
TCP Dumps Wikipedia Revision History
Time-evolvingSocial Network
How to find dense blocks in such tensors?
• Exact solutions are combinatorial and intractable• Greedy solutions and heuristics are practical (i.e. greedily optimize a
“suspiciousness” metric)• What metric?
Assume a block (subtensor) ! in a 3-way tensor "• #$%&((): *+ + *- + *.• /01 ( = 345678((): *+×*-×*.• :;<<((): sum of entries in (
=
*+
*-
*.
(
Some notable choices:
Traditional Density: ρ? (, = = ABCC ( /Vol(B)(maximized by single entry with max. value)
Arithmetic Avg. Degree: ρI (, = = ABCC ( /Size(B)
Geometric Avg. Degree: ρN (, = = ABCC ( /O Vol B
Detecting a single dense block• Greedy search method • Starts from the entire tensor
5 3 04 6 12 0 0
1 0 1
00 ! = 2.9
Detecting a single dense block• Remove a slice to maximize density !
5 3 04 6 12 0 0 " = 3
Detecting a single dense block
5 3 4 6 2 0 ! =3.3
• Remove a slice to maximize density #
Detecting a single dense block
5 3 4 6 2 0 ! = 3.6
• Remove a slice to maximize density &
Detecting a single dense block• Output: return the densest block so far
5 3 4 6 2 0 ! = 3.6
Handling multiple blocks• Remove found blocks before finding others
Find & Remove
Find & Remove
Find & Remove
Restore
Algorithm details
• Theorem 1 [Remove Minimum Mass First]Among slices in the same mode, removing the slice with minimum mass is always best
12 > 9 > 2
• Theorem 2 [Approximation Guarantee]
!" #,% ≥ '(!" #
∗, %
Density metric Input Tensor Order Densest Block
Practical discoveries
TCP connections forming the densest blocks are network attacks
First three blocks found
Src IP
DstI
P
Timesta
mp
Practical discoveries
First three blocks found by M-Zoom
Page edit wars : 11 usersrevised 10 pages, 2,305 timeswithin 16 hours
User
Page
Timesta
mp
Takeaways
• Graphs provide a means of describing interactions between objects
• Almost all real graphs are “non-random” and obey various patterns
• Considerable literature in graph mining focuses on learning to leverage large-scale interaction patterns to• Recommend users new content based on what they might like• Identify interesting group behaviors and community norms• Discover abnormalities that correspond to fraud or “audit-worthy” events