1 1 stanford university 2 mpi for biological cybernetics 3 california institute of technology...

25
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

Upload: simon-warner

Post on 30-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

1 Stanford University2 MPI for Biological Cybernetics

3 California Institute of Technology

Inferring Networks of Diffusion and Influence

Manuel Gomez Rodriguez1,2

Jure Leskovec1

Andreas Krause3

2

Hidden and implicit networks

Many social or information networks are implicit or hard to observe: Hidden/hard-to-reach populations:

Network of needle sharing between the drug injection users Implicit connections:

Network of information propagation in online news media

But can observe results of the processes taking place on such (invisible) networks: Virus propagation:

Drug users get sick, and we observe when they see the doctor Information networks:

We observe when media sites mention information Question: Can we infer the hidden networks?

3

Inferring the Network

There is a directed social network over which diffusions take place:

But we do not observe the edges of the network

We only see the time when a node gets infected: Cascade c1: (a, 1), (c, 2), (b, 6), (e,9) Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8)

Task: Want to infer the underlying network

bb

dd

ee

aa

cc

aa

cc

bb

eecc

aabb

dd

4

Examples and Applications

Virus propagation Word of mouth & Viral marketing

Can we infer the underlying network?

Viruses propagate through the network

We only observe when people get sick

But NOT who infected them

Recommendations and influence propagate

We only observe when people buy products

But NOT who influenced them

Process

We observe

It’s hidden

5

Our Problem Formulation

Plan for the talk:

1. Define a continuous time model of diffusion

2. Define the likelihood of the observed cascades given a network

3. Show how to efficiently compute the likelihood of cascades

4. Show how to efficiently find a graph G that maximizes the likelihood

Note: There is a super-exponential number of graphs, O(NN*N) Our method finds a near-optimal graph in O(N2)!

6

cccccc

bbaa bbaauu vv

dd

Cascade Diffusion Model

Continuous time cascade diffusion model: Cascade c reaches node u at tu And spreads to u’s neighbors:

with probability β cascade propagates along edge (u, v) And we determine the infection time of node v

tv = tu + Δe.g.: Δ ~ Exponential or Power-law

ta tb tcΔ1 Δ2

We assume each node v has only one parent!

Δ3 Δ4

te tf

7

Likelihood of a Single Cascade

bb

dd

ee

aa

cc

aa

cc

bb

ee

Probability that cascade c propagates from node u to node v is:

Pc(u, v) P(tv - tu) with tv > tu

Prob. that cascade c propagates in a tree pattern T:

Since not all nodes get infected by the diffusion process we introduce the external influence node m: Pc(m, v) = ε

mmεεε

Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

8

Finding the Diffusion Network

There are many possible propagation trees that are consistent with the observed data:

c: (a, 1), (c, 2), (b, 3), (e, 4)

Need to consider all possible propagation trees T supported by the graph G:

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

Likelihood of a set of cascades C: Want to find a graph:

Good news

Computing P(c|G) is tractable:Even though there are O(nn) possible propagation trees.

Matrix Tree Theorem can compute this in O(n3)!

Bad news

We actually want to search over graphs:

There is a super-exponential number of graphs!

9

An Alternative Formulation

We consider only the most likely tree Maximum log-likelihood for a cascade c under a

graph G:

Log-likelihood of G given a set of cascades C:

The problem is still intractable (NP-hard)

But present an algorithm that finds near-optimal networks in O(N2)

10

Max Directed Spanning Tree

Given a cascade c, What is the most likely propagation tree?

where

Greedy parent selection of each node gives globally optimal tree!

A maximum directed spanning tree (MDST): The sub-graph of G induced by the nodes in

the cascade c is a DAG Because edges point forward in time

For each node, just picks an in-edge of max-weight:

11

Objective function is Submodular

Theorem:Log-likelihood Fc(G) of cascade c is monotonic, and submodular in the edges of the graph G

Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph

Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)

Proof:

ss

A B VxV

w

w’

xA

Bjj

oo

Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w

≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)

rr

aa

kk

iiii

kk

Then, log-likelihood FC(G) is monotonic, and submodular too

12

Finding the Diffusion Graph

Use the greedy hill-climbing to maximize FC(G): For i=1…k:

At every step, pick the edge that maximizes the marginal improvement

bb

dd

ee

aa

ccLocalized update

Marginal gainsb ac ad aa bc bd be b

: 12 : 3 : 6 : 20 : 18 : 4 : 5

a cb cb dc de d

: 15 : 8 : 16 : 8 : 10

b ed e

: 7 : 13

: 17 : 2 : 3

Localized update

Localized update

Localized update

: 1 : 1

: 8 : 7

: 6

1. Approximation guarantee (≈ 0.63 of OPT)

2. Tight on-line bounds on the solution quality

3. Speed-ups:Lazy evaluation (by submodularity)

Localized update (by the structure of the problem)

Benefits:

13

Experimental Setup

We validate our method on:

How many edges of G can we find?

Precision-Recall Break-even point

How many cascades do we need?

How fast is the algorithm?

How well do we optimize the likelihood Fc(G)?

Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G

Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)

14

Small synthetic network:

True networkTrue network Baseline networkBaseline network Our methodOur method

14

Small Synthetic Example

Pick k strongest edges:

15

Synthetic Networks

Performance does not depend on the network structure: Synthetic Networks: Forest Fire, Kronecker, etc. Transmission time distribution: Exponential, Power Law

Break-even point of > 90%

1024 node hierarchical Kronecker exponential transmission model

1000 node Forest Fire (α = 1.1) power law transmission model

16

How good is our graph?

We achieve ≈ 90 % of the best possible network!

17

How many cascades do we need?

With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!

18

Running Time

Lazy evaluation and localized updates speed up 2 orders of magnitude!

Can infer a networks of 10k nodes in several hours

19

Real Data: Information diffusion

MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases

(quotes)

http://memetracker.org

We infer the network of information diffusion: Who tends to copy (repeat after) whom

20

Real Network

We use the hyperlinks between sites to generate the edges of a ground truth G

From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks:

time when a site creates a link

2. cascades of (MemeTracker) textual phrases

time when site mentions the informationCan we infer the hyperlinks network

from……times of when sites created

links? …times when sites mentioned

information?

ee

ffccaa

ee

ffccaa

21

Real Network: Performance

500 node hyperlink network using hyperlinks cascades

500 node hyperlink network using MemeTracker cascades

Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

22

5,000 news sites:

BlogsMainstream media

Diffusion Network

23BlogsMainstream media

Diffusion Network (small part)

24

Networks and Processes We infer hidden networks based on diffusion data

(timestamps)

Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:

It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound

Future work: Learn both the network and the diffusion model Applications to other domains: biology, neuroscience, etc.

25

Thanks!For more (Code & Data):http://snap.stanford.edu/netinf