scalable vaccine distribution in large graphs given uncertain data

Scalable Vaccine Distribution in Large Graphs given Uncertain

Data

Scalable Vaccine Distribution in Large Graphs given Uncertain

Data

Yao Zhang, B. Aditya Prakash

Department of Computer Science

Virginia Tech

CIKM, Shanghai, November 6, 2014

2

OutlineOutline• Motivation• Problem Definition• Our Proposed Methods• Experiments• Conclusion

Zhang and Prakash, CIKM2014

3

Propagation on networksPropagation on networks

[from leverage.com]

[from the Economist]

Information spreads over social networksE.g., Millions of photos/messages sharing

Virus outbreaks over population networkE.g., WHO estimates 5,000 to 10,000 new Ebola cases weekly in West Africa by the first week of December


4

Motivation I: Diffusion models – Social MediaMotivation I: Diffusion models – Social Media• In social media, information spreads over

friendship networks• E.g., Rumor spreads over Facebook fridendship

network

• Independent cascade model (IC) [Kempe+, KDD03]

• Weights βij: propagation prob. from i to j

• Each node has only one chance to infect its neighbors

Rumor spreading


β12β13

5

Motivation I: Diffusion models – EpidemiologyMotivation I: Diffusion models – Epidemiology• In epidemiology, virus spreads over population

contact networks• E.g., ebola, chickenpox, etc. may spread if people are

coming to contact

• SIR model [Anderson+ 1991]

• Susceptible-Infectious-Recovered

• Weights βij: propagation prob. from i to j

• Recovered prob. δ for each infected node

Ebola spreading


β12 β13

δ

6

Motivation II: ImmunizationMotivation II: Immunization

• Epidemiology• Centers for Disease Control

(CDC)• Which people to vaccinate to

control spread of Ebola?

• Social Media• Twitter• Which people to warn to stop

rumors like “wall street crashing”

Common abstract goal: “find best nodes to remove”


7

Immunization StrategiesImmunization Strategies• Pre-emptive Strategy• choose nodes before the

epidemic starts• Netshield [Tong+ 2010]

• Minimize the epidemic threshold (which is focusing on the largest eigenvalue[Prakash+ 2011]), above which a lot of people get infected

Which nodes to vaccinate


8

Immunization StrategiesImmunization Strategies• Pre-emptive Strategy• choose nodes before the

epidemic starts• Netshield [Tong+ 2010]

• Data-aware Strategy• choose nodes knowing current

infections (which nodes are infected)

• DAVA-fast algorithm [Zhang and Prakash 2014]


However…Zhang and Prakash, CIKM2014

9

Motivation III: Real Data is UncertainMotivation III: Real Data is Uncertain

• Epidemiology• Public-health surveillance

We don’t know who exactly are infected

??

Each level have a certain probability to miss some truly infected people

CNN headlines

Not sure

Not sure

Surveillance Pyramid [Nishiura+, PLoS ONE 2011]

CDC

Lab

Hospital


10

• Social Media• Twitter: due to the uniform samples [Morstatter+,

ICWSM 2013], the relevant ‘infected’ tweets may be missed

?

?

Missing

Missing


We don’t know who exactly are infected

Tweets

Sampled Tweets

Sampling


11

How to design immunization strategy in the presence of uncertainty?• Not sure if some nodes are

infected• More realistic intervention

Challenge• Cannot vaccinate/warn people

who are already infected


??

?

?We call it Uncertain Data-Aware Vaccination Problem

this paper



12

OutlineOutline• Motivation• Problem Definition• Uncertainty Models• Problem Formulation

• Our Proposed Methods• Experiments• Conclusion


13

Uncertainty ModelsUncertainty Models• Uniform

• Identical prob. to be infected• E.g., Twitter API

• Surveillance• Each node takes a prob. from a set P• E.g., Surveillance pyramid

• Prop-Deg• The prob. to be infected is proportional to a

node’s degree• E.g., people with larger connections have

higher prob. to be infected

• General• Each node has its own infected prob.

Tweets

Sampled Tweets

Sampling

We assume factorizable distributions:


14

Problem FormulationProblem FormulationUncertain Data-Aware Vaccination Problem (UDAV)

Given: graph G(V,E), uncertain model U, infected node set I

Find: the best set S of k nodes to vaccinate

Such that: the final expected epidemic size is minimized

Which two nodes to vaccinate

? 0.5

? 0.8

? 0.5

? 0.8

Formally: the expected number of infected nodes after vaccination in Gi

Expected epidemic size

a “possible” world


15

Complexity of UDAVComplexity of UDAV• NP-hard, and cannot be approximated within

an absolute error• A special case of UDAV (equal to the deterministic

case) is NP-hard [Zhang+ 2014]


16



17

Overview of proposed methodsOverview of proposed methods• UDAV is a stochastic optimization problem• Sampling based method

• the Sample Average Approximation (SAA) framework

• Expectation based method• the expected “situation”

Which two nodes to vaccinate

? 0.5

? 0.8

? 0.5

? 0.8

Hedging Uncertainty


18

OutlineOutline• Motivation• Problem Definition• Our Proposed Methods• Sample-Cascade• Expect-Max

• Experiments• Conclusion


19

Sample-Cascade: IdeaSample-Cascade: Idea

Idea: sample deterministic cases, and take the average

? 0.5

? 0.8

UDAV can also be formulated as: The benefit of vaccinating the healthy node set Si in deterministic graph Gi

Working on the sampled graphs

...Sample L

Sample 1...

Expected benefit

4 “possible” worlds


20

Sample-CascadeSample-CascadeIssue 1: how to approximate

See paper for details

Solution: use its lower bound (Lemma 1)

Expected benefit on the dominator tree of Gi

Dominator tree: every path from the root to v contains u (see [Lengauer and Tarjan, 1979]).Here, the root is the set of infected nodes.

Working on trees

... Dominator tree of sampled graphs

...Samples


21

• Algorithm: 1. Sample Gi from G and U, and Build dominator trees of Gi

2. Select a* such that

3. Remove a from G

4. Goto Step 2 until |S|=k

Sample-CascadeSample-Cascade

Working on trees

...Dominator tree of sampled graphs


...

22

Sample-CascadeSample-CascadeIssue 2: number of samples l

Running time: O(l*(k|E|+k|V|+ |V|log|V|))

Accurate, but too slow for large networks!

Solution: (Hoeffding's Inequality) Worse case l=O(|V|2)

Working on trees

...Dominator tree of sampled graphs


...

23

OutlineOutline• Motivation• Problem Definition• Our Proposed Methods• Sample-Cascade• Expect-Max

• Experiments• Conclusion


24

Expect-Max: IdeaExpect-Max: IdeaIdea: construct the expected “situation” (graph)

? 0.5

? 0.8

Original Graph

: edge from super node

Create a “super node”

0.5

0.8

1.0

See more details in the paper

Lemma: when the budget=1, UDAV can be exactly solved on the expected graph

Expected Graph GE

How to calculate it?

Super node


25

Calculating Benefit on the Expected GraphCalculating Benefit on the Expected Graph

• We propose two methods to calculate• Using dominator tree

• Expect-Dom

• Using the drop of the first eigenvalue• Expect-Eig


26

Expect-DomExpect-DomIdea: use to approximate , the benefit on the expected graph GE 0.5

0.8

1.0

Expected Graph GE

Dominator tree of GE

Construct Dominator tree

Step: 1. GE=Construct the expected

graph

2. T = Build a dominator tree of

GE

3. Select v with max. benefit on

T

4. Remove v from G



27

Expect-EigExpect-EigIdea: use to approximate , the benefit on the expected graph GE

0.5

0.8

1.0

Expected Graph GE

0.5

0.8

1.0

Expected Graph GE

: the drop of the first eigenvalue(Measuring the threshold of the epidemic).

Lemma : The number of newly infected nodes is bounded by the first eigenvalue (details in the paper)

(Can be computed fast [Tong+, ICDM 2010])

Calculate


28

Expect-EigExpect-EigIdea: use to approximate , the benefit on the expected graph GE 0.5

0.8

1.0

Expected Graph GE

0.5

0.8

1.0

Calculate

Step: 1. GE=Construct the expected

graph

2. Select v with max.

3. Remove v from G



29

Expect-Dom vs. Expect-EigExpect-Dom vs. Expect-Eig• Let α be the support of U

• the percentage of nodes that may be initially infected


? 0.5

? 0.8

α=0.5

30

Expect-Dom vs. Expect-EigExpect-Dom vs. Expect-Eig

More formal justification in the paper

• Let α be the support of U • the percentage of nodes that may be initially infected

• As α increases, • Observation I: Expect-Dom becomes worse

• Intuition: α is equal to 0: the deterministic case of UDAV (can be solved by DAVA-fast [Zhang and Prakash 2014])

• Observation II: Expect-Eig becomes better• α increases, we have more and more uncertainty, which is

close to the pre-emptive case (can be solved by Netshield [Tong+ 2010])


31

Expect-Max: a hybrid algorithmExpect-Max: a hybrid algorithm

As they are complementary for different distributions and different networks (we don’t know where the crosspoint is)

• pick the better one between Expect-Dom and Expect-Eig

Idea: put Expect-Dom and Expect-Eig together

Running time (subquadratic): O(k(|V|+|E|)+|V|log|V|+T)


32

Extending to SIRExtending to SIR• Our methods can be extended to SIR

model• Idea: using an equivalent IC model with the

propagation probability

See paper for details


33



34

Experiments: datasetsExperiments: datasets• Social Media

• AS router graph: OREGON• Hyperlink network: STANFORD• Peer-to-peer network: GNUTELLA• Friendship network: BRIGHTKITE

• Epidemiology• PORTLAND and MIAMI

• large urban social-contact graph used in national smallpox modeling studies [Eubank+, 2004]

KARATE OREGON STANFORD GNUTELLA BRIGHTKITE PORTLAND MIAMI

|V| 34 633 8,929 10,876 59,228 0.5 million 0.6 million

|E| 156 2,172 53,829 39,994 0.2 million 1.6 million 2.1 million

Model IC IC IC IC IC SIR SIR


35

Experiments: setupExperiments: setup• Uncertainty models• Uniform: p=0.6• Surveillance: p is chosen from {0.1, 0.5}

• Prop-Deg: pi=di/dmax

• Settings• Uniformly randomly pick 5% of nodes as infected• Number of samples: 500

See more details in the paper

Tweets

Sampled Tweets

Sampling


36

Experiments: baselinesExperiments: baselines• OPTIMAL: brute-force algorithm which tries all possible cases

(optimal, and only run it on KARATE)

• RANDOM: randomly uniformly choose nodes from W

• DEGREE: choose top-k nodes from W according to weighted degrees

• PAGERANK: choose top-k nodes from W with top pageranks

• PER-PRANK: choose top-k nodes from W with top personalized pageranks with respect to infected nodes

• DAVA-fast• A fast data-aware immunization method in presence of already

infected nodes [Zhang+, SDM 14]

W: a set of nodes that are not definitely infected (0<=p<1)


37

Results: Sample-CasResults: Sample-Cas

Sample-Case Saves at least 90% of nodes compared to OPTIMALHigher

is better

Close to optimal


38

Results: Expect-Max: α mattersResults: Expect-Max: α mattersSTANFORD BRIGHTKITE

R>1: Expect-Dom is betterR<1: Expect-Eig is better

R=1: cross point (different for different networks and different distributions)

This is why we use Expect-Max


39

Results: EffectivenessResults: Effectiveness

(See more results in the paper)

GNUTELLA (IC) MIAMI (SIR)

Higher is better

Sample-Cas and Expect-Max consistently outperform the baseline algorithms.

10K nodes


40

Results: ScalabilityResults: Scalability

Lower is better

did not finish within 24 hours R

un

nin

g t

ime(

sec.

)


41



42

ConclusionConclusionUncertain Data-Aware Vaccination

Given: Graph and Uncertain model

Find: ‘best’ k nodes for vaccination• Uncertainty models

• Uniform, Surveillance, Prop-Deg, General

• Proposed Methods• Sample-Cas: sampling graphs (slow,

accurate)• Expect-Max: constructing expected

graph (fast, subquadratic)

0.5

0.8

1.0

? 0.5

? 0.8

...

Expected Graph

Sampling


43

Any questions?Any questions?Code at:http://people.cs.vt.edu/~yaozhang Funding:

Yao Zhang B. Aditya Prakash


scalable vaccine distribution in large graphs given uncertain data

Documents

vaccinatehowever zhang

social networkse

best nodes

uncertain datayao zhang

infected node5 ebola

lot of people

spread of ebola

real data