multiple source detection without knowing the underlying ...€¦ · algorithm 1 label propagation...

7
Multiple Source Detection without Knowing the Underlying Propagation Model Zheng Wang, Chaokun Wang, Jisheng Pei, Xiaojun Ye School of Software, Tsinghua University, Beijing 100084, P.R. China Tsinghua National Laboratory for Information Science and Technology (TNList) {zheng-wang13, pjs07}@mails.tsinghua.edu.cn; {chaokun, yexj}@tsinghua.edu.cn Abstract Information source detection, which is the reverse problem of information diffusion, has attracted considerable research effort recently. Most existing approaches assume that the un- derlying propagation model is fixed and given as input, which may limit their application range. In this paper, we study the multiple source detection problem when the underlying prop- agation model is unknown. Our basic idea is source promi- nence, namely the nodes surrounded by larger proportions of infected nodes are more likely to be infection sources. As such, we propose a multiple source detection method called Label Propagation based Source Identification (LPSI). Our method lets infection status iteratively propagate in the net- work as labels, and finally uses local peaks of the label propa- gation result as source nodes. In addition, both the convergent and iterative versions of LPSI are given. Extensive experi- ments are conducted on several real-world datasets to demon- strate the effectiveness of the proposed method. Introduction Information diffusion is one of the most important topics in social network research (Centola 2010; Wang et al. 2016). Recently, people are more interested in its reverse problem: Given a snapshot of a partially infected network, can we identify the infection sources? The answer to this problem has vast applications in mitigating the damage of epidemics caused by infectious diseases, rumor spreading in social me- dia, and so forth. The abovementioned multiple source detection problem has attracted many researchers (Prakash, Vreeken, and Faloutsos 2012; Luo, Tay, and Leng 2013; Zang et al. 2015; Chen, Zhu, and Ying 2016). These studies deal with this problem differently according to the types of the used propagation models such as the Susceptible-Infected (SI) model (Anderson, May, and Anderson 1992) and Susceptible-Infected-Recovered (SIR) model (Allen 1994), i.e., they all assume that the underlying propagation model is fixed and known. However, in practice, identifying the correct propagation model always needs prior knowledge, which limits the appli- cation range of source detection methods. For instance, it is Corresponding author: Chaokun Wang. Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. +1 +1 +1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 Assign Labels Label Propagation z x y uninfected nodes infected nodes 0 1 2 3 4 5 6 7 8 9 0 1 2 1 2 x y 14 13 11 12 15 10 0 1 2 3 4 5 6 7 8 9 0 1 2 1 2 x y 14 13 11 12 15 10 0.77 0.60 0.47 0.76 0.77 0.15 0.77 1.10 0.92 -0.36 -0.47 -0.47 -0.47 -0.47 -0.38 -0.32 0 1 2 3 4 5 6 7 8 9 0 1 2 1 2 x y 14 13 11 12 15 10 Label Propagation Figure 1: The framework of our LPSI approach to the mul- tiple source detection problem. hard to choose an appropriate propagation model for a new infectious disease or online rumor. Moreover, it is difficult to acquire the true values of parameters in the pre-selected underlying propagation model. Therefore, it is necessary but challenging to detect infection sources without knowing the underlying propagation model. In this paper, we address the multiple source detection problem based on the idea of source prominence, namely the nodes surrounded by larger proportions of infected nodes are more likely to be source ones. To better understand the basic idea, one can imagine that a part of the network has been infected by an infection source. On the one hand, at the mar- gin of the infected region, nodes tend to have less infected neighbors. On the other hand, in the center of the infected region, nodes tend to have more infected neighbors. This in- tuition is reasonable in most existing propagation models, such as the SI and SIR models. Inspired by the primary idea of source prominence, we propose a multiple source detection method called Label Propagation based Source Identification (LPSI). Our ap- proach tries to automatically identify actual source nodes without knowing the underlying propagation model. The general process is as follows. We first assign positive labels to the infected nodes, and negative labels to the uninfected nodes in the network. Then we iteratively propagate label in- Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) 217

Upload: others

Post on 08-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

Multiple Source Detection withoutKnowing the Underlying Propagation Model

Zheng Wang, Chaokun Wang,∗ Jisheng Pei, Xiaojun YeSchool of Software, Tsinghua University, Beijing 100084, P.R. China

Tsinghua National Laboratory for Information Science and Technology (TNList){zheng-wang13, pjs07}@mails.tsinghua.edu.cn; {chaokun, yexj}@tsinghua.edu.cn

Abstract

Information source detection, which is the reverse problemof information diffusion, has attracted considerable researcheffort recently. Most existing approaches assume that the un-derlying propagation model is fixed and given as input, whichmay limit their application range. In this paper, we study themultiple source detection problem when the underlying prop-agation model is unknown. Our basic idea is source promi-nence, namely the nodes surrounded by larger proportionsof infected nodes are more likely to be infection sources. Assuch, we propose a multiple source detection method calledLabel Propagation based Source Identification (LPSI). Ourmethod lets infection status iteratively propagate in the net-work as labels, and finally uses local peaks of the label propa-gation result as source nodes. In addition, both the convergentand iterative versions of LPSI are given. Extensive experi-ments are conducted on several real-world datasets to demon-strate the effectiveness of the proposed method.

Introduction

Information diffusion is one of the most important topics insocial network research (Centola 2010; Wang et al. 2016).Recently, people are more interested in its reverse problem:Given a snapshot of a partially infected network, can weidentify the infection sources? The answer to this problemhas vast applications in mitigating the damage of epidemicscaused by infectious diseases, rumor spreading in social me-dia, and so forth.

The abovementioned multiple source detection problemhas attracted many researchers (Prakash, Vreeken, andFaloutsos 2012; Luo, Tay, and Leng 2013; Zang et al.2015; Chen, Zhu, and Ying 2016). These studies dealwith this problem differently according to the types of theused propagation models such as the Susceptible-Infected(SI) model (Anderson, May, and Anderson 1992) andSusceptible-Infected-Recovered (SIR) model (Allen 1994),i.e., they all assume that the underlying propagation modelis fixed and known.

However, in practice, identifying the correct propagationmodel always needs prior knowledge, which limits the appli-cation range of source detection methods. For instance, it is

∗Corresponding author: Chaokun Wang.Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

+1

+1

+1

+1

+1

+1

+1+1

+1

-1 -1

-1

-1

-1

-1

-1

Assign Labels

Labe

l Pr

opag

atio

n

z

x

y

uninfected nodes

infected nodes

0

1

2 3

4

5

6

78

90 1 2

1

2

x

y

14

1311

12

15

10

0

1

2 3

4

5

6

78

90 1 2

1

2

x

y

14

1311

12

15

10

0.77

0.60

0.47

0.76

0.77

0.15

0.771.10

0.92

-0.36-0.47

-0.47

-0.47

-0.47

-0.38

-0.32

0

1

2 3

4

5

6

78

90 1 2

1

2

x

y

14

1311

12

15

10

Label Propagation

Figure 1: The framework of our LPSI approach to the mul-tiple source detection problem.

hard to choose an appropriate propagation model for a newinfectious disease or online rumor. Moreover, it is difficultto acquire the true values of parameters in the pre-selectedunderlying propagation model. Therefore, it is necessary butchallenging to detect infection sources without knowing theunderlying propagation model.

In this paper, we address the multiple source detectionproblem based on the idea of source prominence, namely thenodes surrounded by larger proportions of infected nodes aremore likely to be source ones. To better understand the basicidea, one can imagine that a part of the network has beeninfected by an infection source. On the one hand, at the mar-gin of the infected region, nodes tend to have less infectedneighbors. On the other hand, in the center of the infectedregion, nodes tend to have more infected neighbors. This in-tuition is reasonable in most existing propagation models,such as the SI and SIR models.

Inspired by the primary idea of source prominence, wepropose a multiple source detection method called LabelPropagation based Source Identification (LPSI). Our ap-proach tries to automatically identify actual source nodeswithout knowing the underlying propagation model. Thegeneral process is as follows. We first assign positive labelsto the infected nodes, and negative labels to the uninfectednodes in the network. Then we iteratively propagate label in-

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

217

Page 2: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

formation among nodes based on a probability matrix whichis generated according to the network structure. Finally, weget the convergence result, where “source” nodes are shownas local peaks with the highest “infected” label values.

Example: As an example, Fig. 1 depicts a partially in-fected network, in which a sub-network has been infected bya stochastic process starting from two sources (nodes 1 and11). The red nodes are infected nodes and the white ones areuninfected. At first, we assign positive labels (+1) to infectedones, and negative labels (-1) to uninfected ones. After that,their label values are propagated and updated iteratively, andthe propagation result at iteration step 10 is shown at the bot-tom right of Fig. 1. Finally, we get the convergence result,where nodes 1 and 11 are two local maximum points. As aresult, we consider these two nodes as infection sources.

Contributions: The major contributions of this paper aresummarized as follows:

• We present and formalize the multiple source detectionproblem without knowing the underlying propagationmodel, which has rarely been mentioned in the literature.

• We propose the Label Propagation based Source Identi-fication (LPSI) method for the above problem. In addi-tion, both the convergent and iterative versions of LPSIare brought forward.

• Extensive experiments conducted on real-world datasetsdemonstrate the effectiveness and efficiency of our meth-ods in identifying the actual source nodes.

Preliminaries

In this section, we briefly review some propagation modelsproposed so far, and formulate our problem.

Propagation Models

According to (Easley and Kleinberg 2010), existing propa-gation models could be categorized as either infection mod-els or influence models, with respect to their intended appli-cations.

Infection Models To describe the transmission of commu-nicable disease through individuals, various infection (epi-demic) models are proposed, such as Susceptible-Infected(SI) model (Anderson, May, and Anderson 1992) andSusceptible-Infected-Recovered (SIR) model (Allen 1994).

In the SI model, each node is in one of two states: suscep-tible (S) or infected (I). Once a node is infected, it staysinfected forever. In each discrete time step, each infectednode tries to infect its susceptible (uninfected) neighbors in-dependently with probability p, which reflects the strengthof the disease spread. While in the SIR model, each nodehas three possible states: susceptible (S), infected (I) andrecovered (R). The infection process is similar except thatinfected nodes can recover with probability q. In addition,recovered nodes cannot be infected any more.

Influence Models In order to model how users influ-ence each other in a social network, researchers have pro-posed many influence models, such as Independent-Cascade

(IC) model (Goldenberg, Libai, and Muller 2001) and Lin-ear Threshold (LT) Model (Kempe, Kleinberg, and Tardos2003).

In both IC and LT models, every node is represented asa binary variable with either active (infected) or inactive(susceptible) status. The major difference between these twomodels is the way how an active node influences its neigh-bors. In the IC model, when a node first becomes active ata time step, it has exactly one chance to independently in-fluence (infect) its susceptible neighbors, and cannot acti-vate neighbors in subsequent rounds. While in the LT model,the sum of incoming edge influence degrees on any node isassumed to be at most 1 and every node has an activationthreshold uniformly at random from [0, 1]. At each time step,a node is activated by their activated neighbors if the sum ofinfluence degrees exceeds its threshold. In both models, theinfluence propagates until no more nodes can become active.

Multiple Source Detection Problem

The multiple source detection problem studied in this papercan be formulated as follows: Given a social network G =(V,E), an infection node vector Y = (Y1, . . . , Y|V |) whereYi = 1 indicates node i is infected and Yi = −1 otherwise,the goal is to find the original infection source set S ⊂ V .

Note that in the above definition, we do not assume the un-derlying propagation model is known. As such, our methodis propagation model independent, which has a broader ap-plication range compared to state-of-the-art methods.

The LPSI Method

We formally present our LPSI method (in Alg. 1) in thissection. The aim of LPSI is to identify the original infectionsource number as well as source nodes of a partially infectednetwork, which can be achieved by the following three steps.

Step 1: Assign labels to the partially infected networkLabel vector Gt, initiated with the infection node vector Y ,is used to assign labels to the nodes at time t in network G. Inother words, at the beginning, we assign positive labels (+1)and negative labels (-1) to infected and uninfected nodes inthe network, respectively.

Step 2: Label propagation on the network Before start-ing label propagation, we should build a weight matrixwhich decides the label propagation probability amongnodes. We build the weight matrix W on edge set E, whereWij=1 represents there is an edge between node i and nodej. This matrix is further symmetrically normalized as S(Line 2 in Alg. 1), in which Sij represents the label prop-agation probability from node j to node i.

Based on the matrix S, we propagate labels on the net-work iteratively. In each iteration, each node gets a fractionof label information from its neighborhood, and retains somelabel information of its initial state. Therefore the label valueof node i at time t+1 becomes:

Gt+1i = α

j:j∈N (i)

SijGtj + (1− α)Yi (1)

where 0 < α < 1 is the fraction of label informationthat node i gets from its neighbors, and N (i) represents the

218

Page 3: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

Algorithm 1 Label Propagation based Source Identification(LPSI)Input: The infected network G=(V,E), parameter α ;

The initial infection node vector Y .Output: The source set S .

1: Form the weight matrix W defined by Wij = 1 if thereexists an edge connecting nodes i and j;

2: Construct the matrix S = D−1/2WD−1/2, where D isa diagonal matrix with its (i,i)-element equal to the sumof the i-th row of W ;

3: Gt=0 ← Y ;4: while Gt does not reach the convergence G∗ do5: for each node i do6: Gt+1

i = α∑

j:j∈N (i) SijGtj + (1− α)Yi;

7: end for8: t = t+ 1 ;9: end while

10: S = {} ;11: for each original infected node i do12: if G∗

i > all i’s neighbors’ G∗ value then13: S = S ∪ {i};14: end if15: end for16: return S ;

neighborhood of node i. We can stop this iteration when con-vergence is reached. Here, “convergence” means the labelvalues of nodes will not change in several successive itera-tions of label propagation (the convergence analysis can befound in the next section).

Step 3: Sources Identification Suppose the label vectorGt finally converges to G∗ at the end of the above label prop-agation process. One node i is identified as a source node ifit satisfies the following two conditions: 1) node i is an in-fected node initially, i.e., Yi = 1; and 2) its final label valueG∗

i is larger than those of its neighbors.The first condition reflects the fact that infected nodes are

more likely to be sources than uninfected ones. Althoughthis selection may miss a few recovered source nodes undersome propagation models such as the SIR model, it avoidsthe confusion of the nodes which have never been infected.The second condition ensures the detected sources should belocal maximum points in the label propagation result, whichkeeps consistent with the primary idea of source promi-nence. As such, these local maxima are determined by bothnode infection status (labels) and network structure. Moreexplanations about the local maxima could be found in thenext section.

Algorithm Analysis

In this section, we analyze the properties of convergenceand local maxima in the label propagation process of LPSI.In addition, we discuss the relationship between the idea ofsource prominence and propagation models.

Convergence Analysis

In our LPSI method, the iteration equation of the label prop-agation (Eq. 1) can be rewritten as Gt+1 = αSGt+(1−α)Y .By the initial condition that G0 = Y , we have:

Gt = (αS)tY + (1− α)

t−1∑

i=0

(αS)iY. (2)

As proved in (Zhou et al. 2004; Wang and Zhang 2008),the parameter 0 < α < 1 and normalized matrix S willlead: limt→∞(αS)t = 0, and limt→∞

∑t−1i=0(αS)

i = (I −αS)−1, where I is an n × n identity matrix. Consequently,the iteration will converge to:

G∗ = (1− α)(I − αS)−1Y. (3)

Therefore, the label propagation iteration in Algorithm 1will finally converge. In addition, Eq. 3 shows that we canobtain the convergence result directly without any iterations.

Local Maxima in Label Propagation

In our LPSI method, the convergence label vector G∗ mini-mizes the following cost function (Zhou et al. 2004):

Q(G)=1

2

(n∑

i,j=1

Wij

∥∥∥∥∥ Gi√Dii

− Gj√Djj

∥∥∥∥∥2

n∑i=1

‖Gi−Yi‖2).

(4)The first term of the right-hand side in the cost function

is the smoothness constraint, which means that label values(i.e., infection status) should not change too much betweenconnected nodes. In this constraint, the difference betweentwo nodes is further normalized by their degrees, whichkeeps consistent with the basic idea of source prominence,i.e., the nodes surrounded by larger proportions of infectednodes tend to have higher infected label values. Similarly,the nodes surrounded by larger proportions of uninfectednodes tend to have lower infected (i.e., higher uninfected) la-bel values. Thus, the infected label values of nodes increaseas they get closer to source nodes. Intuitively, a “source” islikely to be a local maximum point surrounded by a groupof neighboring nodes, whose infected label values decreasewith respect to their distance from the source.

The second term of the right-hand side in Eq. 4 is the fit-ting constraint, which means that the final label propagationresult G∗ should not change too much from the original la-bel assignment (i.e., initial infection status). This trade-offbetween these two constraints is captured by the parameterμ which has a linear relationship with the parameter α inEq. 3 (Zhou et al. 2004).

Source Prominence vs. Propagation models

In this section we revisit the aforementioned two types ofpropagation models, i.e., infection models and influencemodels. Regardless of different propagation models, nodesclose to source nodes would have a higher probability toget infected (activated) than the nodes far away from sourcenodes. It can be explained by the fact that the infection ini-tially starts from source nodes, and is further propagatedto the rest of the network. Clearly, this phenomenon exists

219

Page 4: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

in the general propagation progress. Therefore, the sourceprominence effect should hold in most existing propagationmodels, which is further verified in our later experiments.In addition, the same intuition has been adopted in severalexisting studies, such as (Prakash, Vreeken, and Faloutsos2012), (Shah and Zaman 2011) and (Zang et al. 2015).

Two Versions of Our MethodBased on the above convergence analysis, in this section, wefirst present the convergent version of our method. Then, tobalance accuracy and efficiency, we further give the iterativeversion of our method.

The Convergent Version

LPSI con is the convergent version of LPSI method. The“convergent” means we get the convergence result of the la-bel propagation (Lines 3 to 9 in Alg. 1), which can be ob-tained either by the iteration procedure or by Eq. 3.

Time Complexity Algorithm 1 first builds two matrices(Lines 1 and 2), and the running time is O(|E|), where Eis the edge set. Then it gets the convergence result of labelpropagation (Lines 3 to 9), and the running time is O(N3) 1,where N is the node number. Finally, it finds local maxima(Lines 11 to 15), and the running time is O(L∗N), where Lis the average number of neighbors per node. Consequently,the overall complexity of LPSI con is O(N3).

The Iterative Version

LPSI iter is the iterative version of LPSI method. The “iter-ative” means the propagation result is obtained in a few iter-ations (Lines 3 to 9 in Alg. 1), i.e., the iteration terminateswithout considering whether convergence is guaranteed.

As shown above, getting the convergence result of labelpropagation has high complexity (O(N3)), which may beprohibitive for some practical applications. However, to cap-ture the intuition of source prominence, the convergence re-sult may not be necessary. Another quantity of interest is:how many iteration steps are needed to capture this intu-ition? In our later experiments, we show that a small itera-tion number (5 in our evaluation) is enough.

Time Complexity The time complexity of the iterativeversion of label propagation (Lines 3 to 9 in Alg. 1) isO(t∗L∗N), where L is the average number of neighbors pernode and t is the number of iterations. In addition, the timecomplexity of the other steps in Alg. 1 is O(|E| + L ∗ N).Consequently, the overall time complexity of LPSI iter isO(t ∗ L ∗ N). Note that, L ∗ N can be described by theedge number |E|. Therefore, LPSI iter has a linear complex-ity with respect to the number of edges.

Experiments

Experimental Setup

Datasets As stated in Table 1, we use the following threereal-world datasets:

1Here we actually calculate the convergence by Eq. 3, andadopt the fact that time complexity of matrix inversion is close toO(N3) (Zhu, Lafferty, and Rosenfeld 2005).

Table 1: DatasetsDataset #Nodes #Edges #avg(degree)KARATE 34 78 4.6Jazz 198 2, 742 27.7Ego-Facebook 4, 039 88, 234 43.7

1. KARATE (Zachary 1977) is a social network of friend-ships between 34 members of a karate club at a US uni-versity in the 1970s.

2. Jazz (Gleiser and Danon 2003) is a network of Jazz bandsperforming from 1912 to 1940.

3. Ego-Facebook (Leskovec and Mcauley 2012) is a Face-book graph dataset obtained from survey participants.

Propagation models As mentioned above, existing prop-agation models can be categorized into infection modelsand influence models. To evaluate our method extensively,in each of these categories, we consider two representativepropagation models. We test two different infection models:SI model and SIR model. As the same in (Zhu and Ying2016; Luo 2015; Zhu and Ying 2014), the infection prob-ability p is chosen uniformly from (0, 1) for the SI model,and an extra recovery probability q is chosen uniformly from(0, p) for the SIR model.

In addition, we evaluate two different influence models:IC model and LT model. In the IC model, the infection prob-ability p is chosen uniformly from (0, 1). In the LT model,as in (Kempe, Kleinberg, and Tardos 2003), we treat the in-fection weights among nodes as follows. If nodes u, v havedegrees du and dv , then the infection weight of edge (u,v)is 1/dv , and edge (v, u) has weight 1/du. Furthermore, thethreshold of each node is uniformly chosen from a small in-terval [0, 0.5], so as to infect a large part of the network 2.

Comparing Methods We test the convergent version(LPSI con) and the iterative version (LPSI iter) of ourLPSI method under both SI and SIR models. Under theSI model, we compare these two versions with Net-Sleuth (Prakash, Vreeken, and Faloutsos 2012). Under theSIR model, we compare these two versions with Zang’smethod (Zang et al. 2015). In addition, a variant of Zang’smethod (denoted as Zang si 3) is tested under the SI model.To date, these comparing methods are the latest and mostwell-known solutions for the multiple source detection prob-lem.

Since there are few (comparable) works under the IC andLT models, we only test LPSI con and LPSI iter here. Anoverview of this comparison can be found in Table 2.

In both LPSI con and LPSI iter, we set the parameterα=0.5. In addition, to show the effectiveness of LPSI iter,its iteration number is set to a small one (5 in this study). All

2 (Chen, Wang, and Yang 2009) and (Kempe, Kleinberg, andTardos 2003) have shown that if the threshold is chosen from [0, 1],it is hard for a small set of sources to infect a large part of network.

3The original Zang’s method is just designed for the SIR model,in which recovered nodes should be identified first. We can ignorethis recovery step and use the remaining steps to detect sourcesunder the SI model.

220

Page 5: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

K=2 K=3 K=50

0.1

0.2

0.3

0.4F

−score

LPSI_conLPSI_iterNetSleuthZang_si

SI Model

(a) KARATEK=2 K=3 K=5

0

0.1

0.2

0.3

0.4

F−

score

LPSI_conLPSI_iterNetSleuthZang_si

SI Model

(b) JazzK=3 K=5 K=10

0

0.02

0.04

0.06

0.08

0.1

F−

score

LPSI_conLPSI_iterNetSleuthZang_si

SI Model

(c) Ego-Facebook

K=2 K=3 K=50

0.1

0.2

0.3

0.4

F−

score

LPSI_conLPSI_iterZang

SIR Model

(d) KARATEK=2 K=3 K=5

0

0.1

0.2

0.3

0.4

F−

score

LPSI_conLPSI_iterZang

SIR Model

(e) JazzK=3 K=5 K=10

0

0.01

0.02

0.03

0.04

0.05

0.06

F−

score

LPSI_conLPSI_iterZang

SIR Model

(f) Ego-Facebook

Figure 2: Source detection accuracies under infection models,i.e., SI model (row 1) and SIR model (row 2).

K=2 K=3 K=50

0.1

0.2

0.3

0.4

F−

score

LPSI_conLPSI_iter

IC Model

(a) KARATEK=2 K=3 K=5

0

0.1

0.2

0.3

0.4

F−

score

LPSI_conLPSI_iter

IC Model

(b) JazzK=3 K=5 K=10

0

0.02

0.04

0.06

0.08

F−

score

LPSI_conLPSI_iter

IC Model

(c) Ego-Facebook

K=2 K=3 K=50

0.1

0.2

0.3

0.4

0.5

F−

score

LPSI_conLPSI_iter

LT Model

(d) KARATEK=2 K=3 K=5

0

0.05

0.1

0.15

0.2

F−

score

LPSI_conLPSI_iter

LT Model

(e) JazzK=3 K=5 K=10

0

0.02

0.04

0.06

0.08

0.1

0.12

F−

score

LPSI_conLPSI_iter

LT Model

(f) Ego-Facebook

Figure 3: Source detection accuracies under influence models,i.e., IC model (row 1) and the LT model (row 2).

Table 2: Comparing Methods under Infection ModelsMethod SI model SIR model IC model LT modelLPSI con

√ √ √ √LPSI iter

√ √ √ √NetSleuth

√Zang

√Zang si

parameters in other methods are adjusted to achieve the bestperformance.

All algorithms are implemented in Matlab. The programruns on a server with Intel(R) Core(TM) i7-2600 3.40GHzCPU and 32 GB memory.

Experimental Settings For an extensive comparison, wecompare these methods on all datasets with different sourcenumbers. For the small datasets KARATE and Jazz, we varythe source number K=2, 3, 5. For the other larger datasetEgo-Facebook, we vary the source number K=3, 5, 10,which is known to be the largest source number that hasbeen evaluated (Prakash, Vreeken, and Faloutsos 2012;Zang et al. 2015).

All reported results are averaged over 500 independentruns. In each run, we first randomly generate a set of sourcesin the dataset. As the same in (Prakash, Vreeken, and Falout-sos 2012), we then simulate an infection till at least 30%4 ofthe network is infected, and give the resulting footprint asinput. Finally, we use different methods to detect the sourceset so as to evaluate their performance.

Evaluation of Source Detection

We compare the source detection accuracy of different meth-ods. The standard recall, precision and F-score metrics areused to validate this accuracy by comparing the detected

4Since KARATE is a small dataset compared to the testedsource numbers, we set the max infect rate to 50% for this dataset.

source set with the actual source set. Figures 2 and 3 showthe experimental results under infection models and influ-ence models, respectively. Due to space constraints, we onlyshow the F-score of these methods, and the complete listingof results is available on the author’s homepage.

Evaluation under Infection Models Figure 2 shows theexperimental results under the SI model and SIR model. Thefirst observation is that even without knowing the underlyinginfection model, LPSI con and LPSI iter still significantlyoutperform NetSleuth and Zang’s methods. This superioritybecomes more remarkable as the network size increases. Onthe small datasets (KARATE and Jazz), in terms of F-score,LPSI con and LPSI iter outperform the other two methodsby 50%∼200% relatively. On the other larger dataset (Ego-Facebook), the outperformance is more pronounced (around100%∼500% relatively). The reason may be that our LPSImethod is designed based on the idea of source prominencewhich holds under commonly used infection models. Incontrast, the other two methods fail to identify the correctsources in most cases. As stated in (Prakash, Vreeken, andFaloutsos 2012), NetSleuth tends to return the most likely“sources” which could re-produce the given infected net-work, rather than real sources. On the other hand, experi-mental results in (Zang et al. 2015) also show that Zang’smethod can hardly locate the real sources, but could only lo-cate “approximate sources” near the real sources even whenthe source number is given.

The second observation is that LPSI con and LPSI itercould handle the multiple source detection problem underboth SI and SIR models. As shown in Fig. 2, when the epi-demic model switches from the SI to SIR model, the resultsof F-score only suffer a slight decline. This indicates that thesource prominence exists under both infection models, al-though it seems a little less significant under the SIR model.

The third observation is that even under a small iterationnumber setting (5 in our experiments), LPSI iter is still com-petitive with LPSI con. This means that the source promi-

221

Page 6: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

1 5 10 15 200

100

200

300

400

500

600

700

# of Network Nodes (×1000)

Ela

pse

d T

ime

(s)

LPSI_conLPSI_iterNetSleuthZang_si

Figure 4: Scalability on synthetic data.

nence could be easily captured by LPSI iter in a few itera-tion steps. It also indicates this iterative version has a goodtime performance, which is verified in the following experi-ments.

Evaluation under Influence Models Figure 3 shows theperformance of LPSI con and LPSI iter under the IC modeland LT model. We can see that our proposed methods stillreach the similar performance under these two influencemodels, which indicates that the source prominence couldalso be easily captured by LPSI iter under influence models.

The other observation is that the performance of LPSI conand LPSI iter does not vary a lot between infection modelsand influence models. For instance, in KARATE and Ego-Facebook datasets, the F-score values of these two meth-ods are very similar. Although the performance of these twomethods declines in the Jazz dataset under the LT model,we also find that the detection accuracy would be similar asthat under the IC model when the activation threshold in LTmodel is uniformly chosen from other values such as [0, 0.2]or [0, 0.4]. These results indicate that the source prominenceeffect exists under the general propagation models.

Scalability

Scalability analysis is performed on the synthetic scale-freenetworks (Barabasi and Albert 1999) under the SI model.The performance under other models is similar, so we omitit here. We vary the number of nodes in the network andtest the elapsed time. Figure 4 shows the results. Zang’smethod is the most computationally intensive algorithm,which contains the leading eigenvector based communitydetection (Zang et al. 2015; Wang et al. 2015) and between-ness centrality calculation. As expected, the convergencemethod LPSI con is also time-costly. In contrast, the timecosts of both LPSI iter and Netsleuth increase similarly andslowly with the increase of the network size. This is becausethey both have linear complexity with respect to the numberof edges of the network. In addition, LPSI iter can alwayskeep significantly lower time cost than NetSleuth.

Impact of Parameter α

The parameter α in our method (Eq. 1) is introduced to con-trol the effects from neighbors during the label (infectionstatus) propagation process. Therefore, we investigate theimpact of α via analyzing how its changes would affect the

0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

α

F−

sco

re

SI ModelSIR ModelIC ModelLT Model

Figure 5: Parameter α in LPSI con on Ego-Facebook withK=5.

performance of LPSI con in terms of F-score. (The perfor-mance of LPSI iter is similar, so we omit it here for the limi-tation of space.) Figure 5 shows the results on Ego-Facebook(#source=5) under all four mentioned propagation models.

We observe that no matter under which propagationmodel setting, the F-score values decrease when α ap-proaches 0 or 1. On the other hand, the performances withα ∈ [0.2, 0.6] are always stable and preferred. These obser-vations confirm with the intuition that we should considerboth the initial infection status and the effects from neigh-bors for source detection.

Related Work

The information source detection problem has been exten-sively studied recently. In general, existing work can be di-vided into two categories. The first category focuses on thesingle source detection problem. (Shah and Zaman 2010;2011) introduced and formalized the problem of identifyingthe single source of an epidemic under the SI model. (Zhuand Ying 2013) studied the single source detection problemunder the SIR model. (Zhu and Ying 2014) further stud-ied this problem with sparse observations, and (Shen et al.2016) considered the infection time information.

The second category focuses on the multiple source de-tection problem. (Lappas et al. 2010) studied the problemof identifying K effectors under the IC model, in whichthe source number K should be specified manually. (Luo,Tay, and Leng 2013) considered the multiple source detec-tion problem under the SI model, when the number of in-fection sources is bounded. The work most related to oursis (Prakash, Vreeken, and Faloutsos 2012) and (Zang et al.2015), which could automatically identify the source num-ber as well as the actual source nodes under the SI modeland SIR model, respectively.

Contrary to assuming that the underlying propagationmodel is fixed and given as input, we consider the multi-ple source detection problem when the propagation model isunknown in this work.

Conclusion

In this paper, we study the multiple source detection problemwhen the underlying propagation model is unknown. Basedon the idea of source prominence, we introduce a multiple

222

Page 7: Multiple Source Detection without Knowing the Underlying ...€¦ · Algorithm 1 Label Propagation based Source Identification (LPSI) Input: The infected network G=(V,E), parameter

source detection method LPSI. In addition, both the conver-gent and iterative versions of LPSI are given. Extensive ex-perimental results show that even without knowing the un-derlying propagation model, these two versions still attainhigh accuracy in detecting the source nodes. In particular,the iterative version of LPSI achieves high scalability aswell as superior performance. These inspiring results indi-cate that this work expands the application range of multiplesource detection methods.

AcknowledgementsSpecial thanks to Dr. B. Aditya Prakash who providedthe source code of NetSleuth. This work is supported inpart by the National Natural Science Foundation of China(No. 61373023, No. 61170064).

ReferencesAllen, L. J. 1994. Some discrete-time si, sir, and sis epi-demic models. Mathematical biosciences 124(1):83–105.Anderson, R. M.; May, R. M.; and Anderson, B. 1992. Infec-tious diseases of humans: dynamics and control, volume 28.Wiley Online Library.Barabasi, A.-L., and Albert, R. 1999. Emergence of scalingin random networks. Science 286(5439):509–512.Centola, D. 2010. The spread of behavior in an online socialnetwork experiment. science 329(5996):1194–1197.Chen, W.; Wang, Y.; and Yang, S. 2009. Efficient influencemaximization in social networks. In Proceedings of the 15thACM SIGKDD international conference on Knowledge dis-covery and data mining, 199–208. ACM.Chen, Z.; Zhu, K.; and Ying, L. 2016. Detecting multipleinformation sources in networks under the sir model. IEEETransactions on Network Science and Engineering 3(1):17–31.Easley, D., and Kleinberg, J. 2010. Networks, crowds, andmarkets: Reasoning about a highly connected world. Cam-bridge University Press.Gleiser, P. M., and Danon, L. 2003. Community structure injazz. Advances in complex systems 6(04):565–573.Goldenberg, J.; Libai, B.; and Muller, E. 2001. Talk of thenetwork: A complex systems look at the underlying processof word-of-mouth. Marketing letters 12(3):211–223.Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizingthe spread of influence through a social network. In Proceed-ings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining, 137–146. ACM.Lappas, T.; Terzi, E.; Gunopulos, D.; and Mannila, H. 2010.Finding effectors in social networks. In Proceedings of the16th ACM SIGKDD international conference on Knowledgediscovery and data mining, 1059–1068. ACM.Leskovec, J., and Mcauley, J. J. 2012. Learning to discoversocial circles in ego networks. In Advances in neural infor-mation processing systems, 539–547.Luo, W.; Tay, W. P.; and Leng, M. 2013. Identifying infec-tion sources and regions in large networks. IEEE Transac-tions on Signal Processing 61(11):2850–2865.

Luo, W. 2015. Identifying infection sources in a network.Ph.D. Dissertation, Nanyang Technological University.Prakash, B. A.; Vreeken, J.; and Faloutsos, C. 2012. Spot-ting culprits in epidemics: How many and which ones?In IEEE 12th International Conference on Data Mining(ICDM), 11–20. IEEE.Shah, D., and Zaman, T. 2010. Detecting sources of com-puter viruses in networks: theory and experiment. In ACMSIGMETRICS Performance Evaluation Review, volume 38,203–214. ACM.Shah, D., and Zaman, T. 2011. Rumors in a network:Who’s the culprit? IEEE Transactions on Information The-ory 57(8):5163–5181.Shen, Z.; Cao, S.; Wang, W.-X.; Di, Z.; and Stanley, H. E.2016. Locating the source of diffusion in complex networksby time-reversal backward spreading. Physical Review E93(3):032301.Wang, F., and Zhang, C. 2008. Label propagation throughlinear neighborhoods. IEEE Transactions on Knowledgeand Data Engineering 20(1):55–67.Wang, M.; Wang, C.; Yu, J. X.; and Zhang, J. 2015. Commu-nity detection in social networks: an in-depth benchmarkingstudy with a procedure-oriented framework. Proceedings ofthe VLDB Endowment 8(10):998–1009.Wang, Z.; Wang, C.; Pei, J.; Ye, X.; and Yu, P. S. 2016.Causality based propagation history ranking in social net-works. In IJCAI.Zachary, W. W. 1977. An information flow model for con-flict and fission in small groups. Journal of anthropologicalresearch 452–473.Zang, W.; Zhang, P.; Zhou, C.; and Guo, L. 2015. Locat-ing multiple sources in social networks under the sir model:A divide-and-conquer approach. Journal of ComputationalScience.Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; andScholkopf, B. 2004. Learning with local and global consis-tency. Advances in neural information processing systems16(16):321–328.Zhu, K., and Ying, L. 2013. Information source detection inthe sir model: A sample path based approach. In InformationTheory and Applications Workshop (ITA), 2013, 1–9. IEEE.Zhu, K., and Ying, L. 2014. A robust information sourceestimator with sparse observations. Computational SocialNetworks 1(1):1–21.Zhu, K., and Ying, L. 2016. Information source detectionin the sir model: a sample-path-based approach. IEEE/ACMTransactions on Networking 24(1):408–421.Zhu, X.; Lafferty, J.; and Rosenfeld, R. 2005. Semi-supervised learning with graphs. Carnegie Mellon Univer-sity, Language Technologies Institute, School of ComputerScience.

223