a foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/w14-gm.pdf•anomaly...

A foray into graph miningNeil Shah

April 15th, 2019

(Graph) data is prevalent

• 2.5 exabytes of data produced every day• 90% generated in the last 2 years• Data is produced as the product of a highly interconnected world

1.3 billion users1 billion daily mobile views

244 million users480 million products

187 million daily actives3.5 billion daily snaps

(Graph) data shapes perspectives

Movie

recommendation

Search engine

ranking

Product

purchasing Social platform

interaction

What’s in a graph?

• Graphs consist of nodes, edges and attributes• ex: Facebook social network where

• nodes = individuals• edges = friendship• attributes = gender (node), # of messages exchanged (edge)

• Graphs can easily model relationships between entities• Who-follows-whom on a social network• Who-buys-what on an e-commerce platform• Who-calls-whom using a certain cellular provider

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and ranking• Clustering• Anomaly detection

• Takeaways

Graph preliminaries – directionality

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Graph preliminaries – degree

• Degree: # of adjacent edges

• Degree(u7) = 2

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Graph preliminaries – out- and in-degree

• Degree: # of adjacent edges• Out-degree: # outgoing edges• In-degree: # incoming edges

• Out-degree(u4) = 1• In-degree(u6) = 2

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Graph preliminaries – weighted degree

• Weighted degree: total sum of adjacent edge weights• i.e. “how many times did two users

communicate”

• Weighted-degree(u6) = 7

3

4

12

9

1

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u116

Graph preliminaries – ego(net)

• Ego: single, central node

• Ego network (egonet): nodes and edges within one “hop” from ego

• Egonet(u7) =• Nodes {u7, u3, u5}• Edges {u7-u3, u7-u5}

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Graph preliminaries – connectivity

• Two nodes are connected if there is a path between them.• A graph is fully connected if all node

pairs are connected.

• u1 and u8 are connected• u3 and u5 are connected• u1 and u9 are not connected• This graph is not fully connected

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Graph preliminaries – node and edge types

• A heterogeneous graph has multiple node and/or edge types.

• Users and products• Who-buys-what and who-rates-what

u1

u2

u3

u4

u5

u6

p1

p2

p3

p4

p5

Users ProductsUsers-by-products

Graph preliminaries – matrix representation

• Graph connectivity can be summarized in an adjacency matrix.• Ai,j = # (or weight) of edges from node i to j• A usually very sparse (makes compact representations possible!)

u1

u2

u3

u4

u5

u6

Users-by-usersu7

u8

u9

u10

u11

11

11

11

11

1

User

s

Users

Roadmap

• Preliminaries



• Takeaways

Key question: What does a graph “look like”?

• At first look… large, unwieldy and seemingly random.

• Spoiler: In actuality, most real-world graphs are far from random.

Lyon ’03Trace-route paths on the internet

A quick detour: “Random” graphs

• Erdos-Renyi random graph model: graph G(n,p)• n = number of nodes• p = probability of an edge between two nodes (independent edges)

• Expected # of edges:

• Degree distribution: (binom.)

Babaoglu’ 18

What about real graphs?

• X-axis: degree, Y-axis: frequency/probability• Degree distributions of real graphs are not “random”• What exactly are they, then?

Log(# posts) vs. log(# users)log(# visitors) vs. log(# sites)log(# peers) vs. log(# routers)

Faloutsos ‘99 Viswanath ‘09Adamic ‘02

The “scale-free” property

• Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law:

• Scaling the input by a multiple simply results in proportional scaling of the whole function

• Power laws are linear in log-log scales

• Typical 2 ≤ # ≤ 3

log(# visitors) vs. log(# sites)

Scale-freeness is evident in many domains

Newman ‘05

Why are many real graphs scale-free?

• Hypothesis: preferential attachment, or a “rich-get-richer” effect

• Generative process to construct a network:• Start with !" nodes, each with at least 1 edge• At each timestep, add a new node with ! edges

connecting it to ! already existing nodes• Probability of new node to connect to node # depends on

the degree $% as

• Many real-world variants of this effect:academic citations, recommendation, virality


Real graphs have “small-world” effects

• How “far apart” are nodes in real graphs?• Interestingly, not very far! The typical number is 6. You may have heard of

the “six degrees of separation”

• Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64)• Leskovec ‘08: avg. distance between node pairs on MSN messenger

has mode 6 (sample size 180M nodes and 1.3B edges)

What causes the small-world effect?

• Hypothesis: The abundance of hubs, or high-degree nodes• Even though most nodes aren’t connected to most other nodes, they are

connected to hubs, which facilitate paths


How do real graphs “grow” over time?

• Consider a time-evolving graph !• If it has "($) nodes and &($) edges at time t…• Suppose that "($ + 1) = 2"($)• What is &($ + 1)?

• Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2, generally

Real graphs exhibit densification

Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)

Moreover, the graph diameter shrinks

• Graph diameter = max(distance between node pairs)

• Leskovec ‘05 shows that diameteractually shrinks over time, instead of growing. In other words, nodestend to get closer

• Hypothesis: Once again due toprevalence and growth of hubs

Much more work done on graph behaviors

• Generative graph models (Leskovec ‘05)• Patterns in sizes of connected components (Kang ‘10)• Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10)• Temporal structure evolution (Shah ‘15)

…

the list goes on

Roadmap

• Preliminaries



• Takeaways

Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance

• Link prediction and recommendation• Local methods• Global methods

PageRank for large-scale search engines

• Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance?

• Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns

• 150 million web pages à 1.7 billion links

Backlinks and Forward links:ØA and B are C’s backlinksØC is A and B’s forward link

Content adapted from Li ‘09

Simplified PageRank

• !: a web page• "!: the set of u’s backlinks• #$: the number of forward links of page v• %: the normalization factor to make & a

probability distribution

• Simplified PageRank is the stationary probability dist. of a random-walk on the graph; a surfer keeps clicking successive pages at random.

Idea: each page equally distributes its own PageRank to its forward-links recursively.

“An important page has many important pages pointing to it”

Simplified PageRank

PageRank Calculation: first iteration

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”

Simplified PageRank

PageRank Calculation: second iteration


Yahoo Amzn MS



Simplified PageRank


Yahoo Amzn MS


Convergence after some iterations


Problem with Simplified PageRank

A loop:

During each iteration, the loop accumulates rank but never distributes rank to other pages!

The problem in practice


Yahoo Amzn MS


Read as “Microsoft gives all of its PageRank to Microsoft”

The problem in practice


Yahoo Amzn MS


Read as “Microsoft gives all of its PageRank to Microsoft”

All roads lead to Microsoft

A modified solution: (true) PageRank

• This subtle modification solves the problem of “sinks” • PageRanks converge to the dominant eigenvector of the appropriately

configured/normalized adjacency matrix, due to Markov chain theory! Cool!

• Modified PageRank is the same as the simple model, with the exception of the surfer having a random jump probability.

!(#): a distribution of ranks of web pages that the surfer can jump towhen he/she “gets bored” after clicking on successive links.

A modified solution: PageRank


Yahoo Amzn MS


20% random jump probability

PageRank converges quickly and produces empirically good results

• PR (322 Million Links): 52 iterations• PR (161 Million Links): 45 iterations• Scaling factor is roughly linear in logn

Exploiting local structure for predicting links

• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?

• Key idea: measure affiliation between u and other nodes by u’s graph neighborhood!

Rich literature in previous measures

Liben-Nowell ‘04

Users-by-users

Exploiting global structure for predicting links

• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?

• Key idea: measure affiliation between u and other nodes via a latent factor model/embedding that compactly encodes “interests”

Singular value decomposition

• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense

blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”

A "! #$

%×' %×(

(×( (×'

)*)+

),~ x x

()*≥ )+ ≥ …),)

Singular value decomposition

• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense

blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”

nus

ers

m videos

~

“music lovers”“artist spotlights”

“adrenaline junkies”“action movies”

“dabbling cooks”“baking shows”

"# "$ "%

&$&# &%

+ + …

'# '$ '%

Recommendation from latent factors

• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors

• Compute vector similarity between user-user or user-item vectors (depending on application)• Cosine similarity/dot product are common choices

Koren ‘09

Recommendation from latent factors

• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors• Compute vector similarity between user-user or user-item vectors

(depending on application)• Cosine similarity/dot product are common choices

Koren ‘09

Roadmap

• Preliminaries


• Cool applications• Recommendation and prediction• Clustering• Anomaly detection

• Takeaways

Graph clustering for knowledge extraction

• Key problem: what can we learn about group dynamics from graph interactions? Are there natural “clusters” of behaviors?

• Key idea: tightly-knit graph interactions form graph clusters, which can indicate community behaviors. These are useful for • Behavioral understanding• Computational load balancing• Graph compression• Visualization

advertiser

query

Finding graph clusters

• Given a graph G, we want to find clusters

• Need to:• Formalize the notion of a cluster• Need to design an algorithm that will find sets of nodes that are good clusters

Content adapted from Leskovec ‘10

Clustering objective functions

• Essentially all objectives use the intuition that a good cluster S has• Many edges internally• Few edges pointing outside

• Simplest objective function:• Conductance

• Small conductance corresponds to good clusters

• There are many other formalizations of roughly this intuition• Graph objectives are generally hard to optimize directly. Greedy/approximate

algorithms are common

Clustering objective functions

• Single-criterion (considers either internal or external)• Modularity: m-E(m)• Modularity Ratio: m-E(m)• Volume: åu d(u)=2m+c• Edges cut: c

• Multi-criterion (considers both)• Conductance: c/(2m+c)• Expansion: c/n• Density: 1-m/n2

• CutRatio: c/n(N-n)• Normalized Cut: c/(2m+c) + c/2(M-m)+c• Max ODF: max frac. of edges of a node pointing outside S• Average-ODF: avg. frac. of edges of a node pointing outside • Flake-ODF: frac. of nodes with mode than ½ edges inside

S

n: nodes in Sm: edges in Sc: edges pointing

outside S

Multiple types of clustering algorithms

• Global spectral• Compute graph Laplacian matrix L = D – A• Find 2nd smallest eigenvector of L• Split by sign to get a partitioning of nodes (related

to graph “cut”)• Recurse to get more clusters

• Local spectral• Pick random seed node• Build local clusters around seed nodes based on random walk/PageRank• Prune cluster from graph and repeat

Flow-based algorithms

• METIS: multi-level graph partitioning• If it’s too expensive to partition a big graph… coarsen it into a smaller graph• If it’s still to big, keep coarsening

• Compute a partition and uncoarsen the graph

• Improve heuristically• Swap vertices• Local search

Measuring clustering algorithm performance

• How to quantify performance:• What is the score of clusters across a range of sizes?

• Network Community Profile (NCP) (Leskovec ‘08)• The score of the best cluster of size k

NCPs for a real graph (LiveJournal)• 500 node comms. from Local Spectral

• 500 node comms. from METIS

Interestingly, Local Spectral clusters are more compact and tighter, despite having higher (worse) conductance than METIS!

NCPs for various objectives (Local Spectral)• Multiple objectives can be pretty

similar• Conductance• Expansion• Normalized Cut• Cut-ratio• Avg-ODF

• Max-ODF prefers small clusters, Flake-ODF prefers large clusters• Internal density not very good (large

clusters are very sparse)

You should know…

• Many types of clustering objectives and algorithms -- can use NCP to analyze them• Not many “good” large clusters – real graphs are complicated!

• Different types bias for various aspects (cluster size, internal and external connectivity)

• Overemphasis on clustering objectives can actually lead to “bad” looking clusters according to human intuition

Roadmap

• Preliminaries


• Cool applications• Recommendation and prediction• Clustering• Anomaly detection

• Takeaways

Graph-based anomaly detection

• Key problem: what kinds of anomalous behaviors exist in real graphs, and can we find such anomalies automatically?

• Key idea: we can identify various types of “anomalous” behaviors by building null/normal models and penalizing excessive deviation• Node-based anomalies• Group anomalies (too large, too dense to be

a real community)

Anomalies in graphs: important applications

• Email networks• Spammers

• Computer networks• Hackers/port scanning

• Phone-call networks• Telemarketers

• Social networks• Fake engagement

Major goal

• How to go from a graph to a quantitative model/pattern?

Local, egonet-based anomaly detection

• What does a typical node look like?• Can’t say much about just a node in isolation• Let’s consider the egonets!

• For each node, • extract egonet (=1-step-away neighbors)• extract features (#edges, total weight, etc.) • extract patterns (norms)• compare with the rest of the population (detect anomalies)

Users-by-users

Content adapted from Akoglu ‘10

What is anomalous?

• Not obvious!

What is anomalous?

Near-star

Near-clique

telemarketer, port scanner,people adding friendsindiscriminatively, etc.

tightly connected people, terrorist groups?, discussion group, etc.

Heavy vicinity

too much money wrt number of accounts, high donation wrt number of donors, etc.

single-minded, tight company

Dominant heavy link

Basic features to study

• Ni : number of neighbors (degree) of ego i• Ei : number of edges in egonet I• Wi : total weight of egonet I• λw,I : 1st eigenvalue of the weighted adjacency matrix of egonet i

Obs. 1: Egonet Density Power Law

Ei ∝ Niα

1 ≤ α ≤ 2

Differentiates “dense” from “sparse” neighborhoods

Obs. 2: Egonet Weight Power Law

Wi ∝ Eiβ

β ≥ 1

Differentiates “heavy” from “light” neighborhoods

Obs. 3: Egonet !"Weight Power Lawλw,i∝ Wi

γ

0.5 ≤ γ ≤ 1

Differentiates “uniform” distribution from “dominant” heavy edges

Scoring node anomalies

violates our “laws” far away from most pointsAnomaly ≈

scoredist = distance to fitting linescoreoutl = outlierness scorescore = func( scoredist , scoreoutl )

Triaging anomalies

ü can interpret the type of anomaly

ü can sort nodes wrttheir outlierness scores

Interesting results: Blog post-to-post graph

Part of a group of posts who all link to each other Post linking to many other

posts indiscriminately

Interesting results: Committee-to-candidate donations graph

$87M - DNC$25M - RNC

Interesting results: Author-to-conference publishing graph

Has published 40 papers, but to the same conference (and nowhere else)

Have published hundreds of papers, to almost as many conferences!

Group anomalies on graphs

Bob’s

Carol’s

Alice’s

Alice

Content adapted from Shin ‘16

Fraud forms dense blocks

Rest

aura

nts

AccountsRestaurants Accounts

Adjacency Matrix

Tensor modeling for attributed graphs• Natural dense blocks

are sparse on the time axis (formed gradually)• Suspicious dense

blocks are also denseon the time axis (due to synchronous behavior)• Suspicious dense

blocks are denser than natural dense blocks in the tensor model

Rest

aura

nts

Timesta

mp

Sparse

Dense

Accounts

A cell indicates that account i rates restaurant j at time t

Adjacency Tensor

Applications

• Dense bocks signal anomalies/fraud in many multi-attribute graphs

Src IP

DstI

P

Timesta

mp

Src User

DstU

ser

Timesta

mp

UserPa

ge

Timesta

mp

TCP Dumps Wikipedia Revision History

Time-evolvingSocial Network

How to find dense blocks in such tensors?

• Exact solutions are combinatorial and intractable• Greedy solutions and heuristics are practical (i.e. greedily optimize a

“suspiciousness” metric)• What metric?

Assume a block (subtensor) ! in a 3-way tensor "• #$%&((): *+ + *- + *.• /01 ( = 345678((): *+×*-×*.• :;<<((): sum of entries in (

=

*+

*-

*.

(

Some notable choices:

Traditional Density: ρ? (, = = ABCC ( /Vol(B)(maximized by single entry with max. value)

Arithmetic Avg. Degree: ρI (, = = ABCC ( /Size(B)

Geometric Avg. Degree: ρN (, = = ABCC ( /O Vol B

Detecting a single dense block• Greedy search method • Starts from the entire tensor

5 3 04 6 12 0 0

1 0 1

00 ! = 2.9

Detecting a single dense block• Remove a slice to maximize density !

5 3 04 6 12 0 0 " = 3

Detecting a single dense block

5 3 4 6 2 0 ! =3.3

• Remove a slice to maximize density #

Detecting a single dense block

5 3 4 6 2 0 ! = 3.6

• Remove a slice to maximize density &

Detecting a single dense block• Output: return the densest block so far

5 3 4 6 2 0 ! = 3.6

Handling multiple blocks• Remove found blocks before finding others

Find & Remove

Find & Remove

Find & Remove

Restore

Algorithm details

• Theorem 1 [Remove Minimum Mass First]Among slices in the same mode, removing the slice with minimum mass is always best

12 > 9 > 2

• Theorem 2 [Approximation Guarantee]

!" #,% ≥ '(!" #

∗, %

Density metric Input Tensor Order Densest Block

Practical discoveries

TCP connections forming the densest blocks are network attacks

First three blocks found

Src IP

DstI

P

Timesta

mp

Practical discoveries

First three blocks found by M-Zoom

Page edit wars : 11 usersrevised 10 pages, 2,305 timeswithin 16 hours

User

Page

Timesta

mp

Takeaways

• Graphs provide a means of describing interactions between objects

• Almost all real graphs are “non-random” and obey various patterns

• Considerable literature in graph mining focuses on learning to leverage large-scale interaction patterns to• Recommend users new content based on what they might like• Identify interesting group behaviors and community norms• Discover abnormalities that correspond to fraud or “audit-worthy” events

a foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/w14-gm.pdf•anomaly...

Documents