staticgreedy: solving the scalability-accuracy dilemma in influence maximization

StaticGreedy: Solving the Scalability-Accuracy Dilemma in Influence Maximization

Suqi ChengResearch Center of Web Data Sciences & Engineering

Institute of Computing Technology, Chinese Academy of [email protected],[email protected]

http://www.nascgroup.org/~chengsuqi

Authors: Suqi Cheng, Huawei Shen, Junming Huang, Guoqing Zhang, Xueqi Cheng

mailto:[email protected]

http://www.nascgroup.org/~chengsuqi

2

Outline

• Background• Preliminaries• Motivation• StaticGreedy algorithm• Experiments

3

Information Cascade

• An action or idea are adopted one by one due to social influence– cascade through social relationships

• Main Applications– Word-of-Mouth marketing– Out-break detection– Popularity prediction

social network

4

Word-of-Mouth Marketing

• To promote a product by seeding a few users; users adopting the product will recommend it

• Advantages: efficient; cost-effective

Company seed users follow-up activated users

free product/discount influence

How to select the optimal seed users?

5

Influence Maximization for Viral Marketing

• Objective function– Influence spread I(S) : expected number of activated

(influenced/adpoted) nodes– Maximize I(S)

• Input:– A social influence graph G=(V, E)

– An information cascade model– An integer k, |S| ≤ k

• Output: A seed set S

6

Information Cascade Model

• Independent cascade (IC) model– each edge (u, v) has a propagation probability

p(u, v)– each newly activated node u independently

activates its out-neighbor v with probability p(u, v)

– a discrete time model

• Influence spread estimation on IC model– Monte Carlo simulation– Heuristic methods

0.1 0.2

0.3 0.1

0.1

0.5

0.4

0.1

0.4 0.4

0.2

0.2

0.10.5

0.3

Social influence graph

[Leskovec, 2008]

7

Difficulties in Influence Maximization

Greedy approximate algorithm [Kempe, KDD’03]

(1-1/e-ε)-approximation iteratively select nodes with largest

marginal influence spread guaranteed by submodularity and

montonicity properties of influence spread function

accurate

inefficient

Difficulty 1: Influence maximization problem is NP-hard.[kempe, KDD’03]

Existing solutions

Heuristics Degree Pagerank Betweennes

efficient

inaccurate

8

Difficulties in Influence Maximization

Existing solutions

Heuristic methods DegreeDiscount[Chen,

KDD’09] CGA[Wang, KDD‘10] PMIA[Chen,KDD’10] IRIE[Jung, ICDM’12]

efficient

inaccurate

Monte-Carlo simulation CELF optimization[Leskovec,KDD’07] NewGreedy[Chen, KDD’09] CELF++ optimization[Goyal,WWW’11]

accurate

time-consuming

Difficulty 2: To exactly compute influence spread is #P-hard. [Chen, KDD’10]

A scalability-accuracy delimma!

9

Our works

• Objective : to propose an influence maximization algorithm to solve the scalability-accuracy dilemma

Algorithm Accuracy Scalability

Approximate algorithms

Greedy [Kempe, KDD’03] gurannteed low

CreedyCELF [Leskovec, KDD’07] gurannteed low

GreedyCELF++ [Goyal, WWW’11] gurannteed low

NewGreedy/MixedGreedy

[Chen, KDD’09] gurannteed low

StaticGreedy [cheng, CIKM’13] gurannteed high

Heuristics

Degree ungurannteed high

PageRank [Page, 1999] ungurannteed high

DegreeDiscount [Chen, KDD’09] ungurannteed high

PMIA [Chen, KDD’10] ungurannteed high

IRIE [Jung, ICDM’12] ungurannteed high

SP1M [Kimura, PKDD’06] ungurannteed relatively low

10

Preliminaries-1

• Social influence graph: G=(V, E), n=|V|, m=|E|

• Influence spread: I(S)

• Marginal influence spread: M(v|S)=I(S{v}) - I(S)

guaranteeguarantee

• Greedy approximate algorithm– iteratively select nodes with the largest marginal influence spread– provide 1-1/e-ε approximation

• Properties of I(S) under independent cascade model– submodularity: I(S{v}) - I(S) I(T{v}) - I(S) iff vV, S T V

– monotonicity: I(S{v}) I(S)

Influence spread estimation

11

Preliminaries-2

• Monte Carlo simulation for influence spread estimation– to approximate true values of influence spread by realizations

method An instance Advantage Disadvantage

simulation modeling the information cascade process

relatively low time complexity

estimate one seed set at a time

snapshot[Chen, KDD’09]

removing each edge (u, v) from G with probability 1-p(u, v)

can estimate any seed set simultaneously

relatively high time complexity

equivalent

12

Motivation

• In existing greedy algorithms– a risk of unguaranteed submodularity and monotonicity of influence

spread function

influence graph snapshot1 snapshot 2

iteration 1 iteration 2

Submodularity is breaked!

0 4 0 4

1 4 1 2 4 2

( { }) ( ) ({ }) ( ) 1

( { }) ( ) ({ , }) ({ }) 3

I S v I S I v I

I S v I S I v v I v

– caused by using different results of Monte Carlo simulation across different influence spread estimation

– a very large value of R is required, e.g. R=20000R: number of Monte Carlo simulations for estimation

13

StaticGreedy algorithm

• Core idea: to always use the same snapshots for influence spread estimation– influence spread function is submodular and monotone– a small value of R is required, e.g. R=100

Part1: Generate R static snapshots

Part 2: Greedy selection

14

Performance analysis: Convergence rate

• provide (1-1/e-ε)-approximation with a small value of R

d R,k

log R

*,

, *

( ) ( )

( )k R k

R kk

I S I Sd

I S

seed set size = 50

NetHEPT: a benchmark networkuniform independent cascade (UIC) model: p(u, v) = p = 0.01weighted independent cascade (WIC) model: p(u, v) = 1/(# of in-neighbors of v)

15

Performance analysis: Scalabilitylo

g R

min

seed set size

min ,min{ | 0.005}R kR R d

seed set size

log

runn

ing

time

(sec

)

≈103 times≈102 times

Minimal R required Running time

R is significantly reduced Running time is significantly reduced

16

Performance analysis: Complexity

2

,

' 10

' u v

R R

m p m

n: number of nodes in social influence graphm: number of edges in social influence graphm’: expected number of edges in a snapshot

17

Speed up StaticGreedy

• A dynamic update strategy– calculates the marginal gain in an efficient incremental manner

• at each step t, for each snapshot: M(v) M(v) - |R(v)R(vt*)|, R(v) R(v) - R(v)R(vt*)

– trades space for time

v2v1

v3 v4 v5

v6 v7 v8

M(v1)=4M(v2)=3M(v3)=2M(v4)=1M(v5)=1M(v6)=1M(v7)=2M(v8)=1

v1

snapshot

initial

R(v): reachable nodes from v in the snapshot

18

Speed up StaticGreedy

• A dynamic update strategy– calculates the marginal gain in an efficient incremental manner

• at each step t, for each snapshot: M(v) M(v) - |R(v)R(vt*)|, R(v) R(v) - R(v)R(vt*)

– trades space for time

v2v1

v3 v4 v5

v6 v7 v8



v1

directlyupdate

snapshot

after select v* = v1

R(v): reachable nodes from v in the snapshot

-1-4

-2 -1

-1

19

Experiments: setup

• Algorithms: – Our algorithms: StaticGreedyCELF, StaticGreedyDU– Baselines: CELFGreedy, SP1M, PMIA, Degree, DegreeDiscount

• Tested datasets

• Independent cascade models– uniform independent cascade(UIC) model: p(u, v) = p = 0.01– weighted independent cascade(WIC) model: p(u, v) = 1/(# of in-neighbors of v)

• Metrics: Influence spread, running time

20

Experiments: influence spread

• StaticGreedy achieves better accuracy than other heuristics

NetPHY

DBLP

UIC model

UIC model

WIC model

WIC model

21

Experiments: running time• StaticGreedy runs >103 times faster than CELFGreedy• StaticGreedy has comparable scalability to state-of-the-art heuristics• StaticGreedyDU always runs faster than StaticGreedyCELF

log

runn

ing

time

(sec

)

UIC model WIC model

22

conclusion• Essential reason of the inefficiency of existing greedy algorithms

– a risk of unguaranteed submodularity and monotonicity– caused by different Monte Carlo simulations across different estimations– a very large value of R is required guaranteed accuracy + inefficiency

• StaticGreedy algorithm– guaranteed submodularity and monotonicity– using the same Monte Carlo simulations across different estimations– a small value of R is required guaranteed accuracy + high scalability

– runs >103 times quicker than conventional greedy algorithms

• A dynamic update strategy to speed up StaticGreedy– about 10 times faster

23

Thank you!Thank you!

Q & AQ & A

staticgreedy: solving the scalability-accuracy dilemma in influence maximization

Documents

ismarginal influence

social influence graphleskovec

influence maximization

social influence graph

neighbor v

social influencecascade

users users

scalabilityaccuracy