outbound spit filter with optimal performance guarantees · 2013. 8. 16. · outbound spit filter...

14
Outbound SPIT filter with optimal performance guarantees Tobias Jung a,, Sylvain Martin a , Mohamed Nassar b , Damien Ernst a , Guy Leduc a a Montefiore Institute, Department of Electrical Engineering and Computer Science, University of Liège, Belgium b INRIA Grand Est – LORIA Research Center, France article info Article history: Received 18 September 2012 Received in revised form 4 February 2013 Accepted 18 February 2013 Available online 28 February 2013 Keywords: Security Internet telephony SPAM Sequential probability ratio test abstract This paper 1 presents a formal framework for identifying and filtering SPIT calls (SPam in Internet Telephony) in an outbound scenario with provable optimal performance. In so doing, our work is largely different from related previous work: our goal is to rigorously formalize the problem in terms of mathematical decision theory, find the optimal solution to the prob- lem, and derive concrete bounds for its expected loss (number of mistakes the SPIT filter will make in the worst case). This goal is achieved by considering an abstracted scenario amena- ble to theoretical analysis, namely SPIT detection in an outbound scenario with pure sources. Our methodology is to first define the cost of making an error (false positive and false nega- tive), apply Wald’s sequential probability ratio test to the individual sources, and then deter- mine analytically error probabilities such that the resulting expected loss is minimized. The benefits of our approach are: (1) the method is optimal (in a sense defined in the paper); (2) the method does not rely on manual tuning and tweaking of parameters but is completely self-contained and mathematically justified; (3) the method is computationally simple and scalable. These are desirable features that would make our method a component of choice in larger, autonomic frameworks. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Over the last years, Voice over IP (VoIP) has gained momentum as the natural complementary to emails, although its adoption is still young. The technologies em- ployed in VoIP are widely similar to those used for emails and a large portion is actually identical. As a result, one can easily produce hundreds of concurrent calls per second from a single machine, replaying a pre-encoded message as soon as the callee accepts the call. This application of SPAM over Internet Telephony – also known as SPIT – is considered by many experts of VoIP as a severe potential threat to the usability of the technology [15]. More con- cerning, many of the defensive measures that are effective against email SPAM do not directly translate in SPIT miti- gation: unlike with SPAM in emails, where the content of a message is text and is available to be analyzed before the decision is made of whether to deliver it or flag it as SPAM, the content of a phone call is a voice stream and is only available when the call is actually answered. The simplest guard against SPIT would be to enforce strongly authenticated identities (maintaining caller iden- tities on a secure and central server) together with person- alized white lists (allowing only friends to call) and a consent framework (having unknown users first ask for permission to get added to the list). However, this is not supported by the current communication protocols and also seems to be infeasible in practice. Thus a number of different approaches have been previously suggested to address SPIT prevention, which mostly derive from experi- ence in e-mail or web SPAM defense. They range from rep- utation-based [6,1] and call-frequency based [14] dynamic 1389-1286/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.comnet.2013.02.013 Corresponding author. Tel.: +32 4 366 5605. E-mail addresses: [email protected] (T. Jung), [email protected] (S. Martin), [email protected] (M. Nassar), [email protected] (D. Ernst), [email protected] (G. Leduc). 1 Note that this article is an expanded journal version of the earlier conference paper [5] which appeared in the proceedings of the International Conference on Autonomous Infrastructure, Management and Security (AIMS’12). Computer Networks 57 (2013) 1630–1643 Contents lists available at SciVerse ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

Upload: others

Post on 02-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Computer Networks 57 (2013) 1630–1643

Contents lists available at SciVerse ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/locate /comnet

Outbound SPIT filter with optimal performance guarantees

1389-1286/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.comnet.2013.02.013

⇑ Corresponding author. Tel.: +32 4 366 5605.E-mail addresses: [email protected] (T. Jung), [email protected]

(S. Martin), [email protected] (M. Nassar), [email protected] (D.Ernst), [email protected] (G. Leduc).

1 Note that this article is an expanded journal version of the earlierconference paper [5] which appeared in the proceedings of the InternationalConference on Autonomous Infrastructure, Management and Security(AIMS’12).

Tobias Jung a,⇑, Sylvain Martin a, Mohamed Nassar b, Damien Ernst a, Guy Leduc a

a Montefiore Institute, Department of Electrical Engineering and Computer Science, University of Liège, Belgiumb INRIA Grand Est – LORIA Research Center, France

a r t i c l e i n f o a b s t r a c t

Article history:Received 18 September 2012Received in revised form 4 February 2013Accepted 18 February 2013Available online 28 February 2013

Keywords:SecurityInternet telephonySPAMSequential probability ratio test

This paper1 presents a formal framework for identifying and filtering SPIT calls (SPam inInternet Telephony) in an outbound scenario with provable optimal performance. In so doing,our work is largely different from related previous work: our goal is to rigorously formalizethe problem in terms of mathematical decision theory, find the optimal solution to the prob-lem, and derive concrete bounds for its expected loss (number of mistakes the SPIT filter willmake in the worst case). This goal is achieved by considering an abstracted scenario amena-ble to theoretical analysis, namely SPIT detection in an outbound scenario with pure sources.Our methodology is to first define the cost of making an error (false positive and false nega-tive), apply Wald’s sequential probability ratio test to the individual sources, and then deter-mine analytically error probabilities such that the resulting expected loss is minimized. Thebenefits of our approach are: (1) the method is optimal (in a sense defined in the paper); (2)the method does not rely on manual tuning and tweaking of parameters but is completelyself-contained and mathematically justified; (3) the method is computationally simple andscalable. These are desirable features that would make our method a component of choicein larger, autonomic frameworks.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Over the last years, Voice over IP (VoIP) has gainedmomentum as the natural complementary to emails,although its adoption is still young. The technologies em-ployed in VoIP are widely similar to those used for emailsand a large portion is actually identical. As a result, one caneasily produce hundreds of concurrent calls per secondfrom a single machine, replaying a pre-encoded messageas soon as the callee accepts the call. This application ofSPAM over Internet Telephony – also known as SPIT – isconsidered by many experts of VoIP as a severe potential

threat to the usability of the technology [15]. More con-cerning, many of the defensive measures that are effectiveagainst email SPAM do not directly translate in SPIT miti-gation: unlike with SPAM in emails, where the content ofa message is text and is available to be analyzed beforethe decision is made of whether to deliver it or flag it asSPAM, the content of a phone call is a voice stream and isonly available when the call is actually answered.

The simplest guard against SPIT would be to enforcestrongly authenticated identities (maintaining caller iden-tities on a secure and central server) together with person-alized white lists (allowing only friends to call) and aconsent framework (having unknown users first ask forpermission to get added to the list). However, this is notsupported by the current communication protocols andalso seems to be infeasible in practice. Thus a number ofdifferent approaches have been previously suggested toaddress SPIT prevention, which mostly derive from experi-ence in e-mail or web SPAM defense. They range from rep-utation-based [6,1] and call-frequency based [14] dynamic

Page 2: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1631

black-listing, fingerprinting [22], to challenging suspiciouscalls by CAPTCHAs [13,16,11], or the use of more sophisti-cated machine learning. For example, [9,7] suggests SVMfor anomaly detection in call histories, [8] decision treesfor classification, and [21] semi-supervised learning, a var-iant of k-means clustering with features optimized frompartially labeled data, to cluster and discriminate SPIT fromregular calls.

These methods provide interesting building blocks, but,in our opinion, suffer from two main shortcomings. First,they do not provide performance guarantees in the sensethat it is difficult to get an estimate of the number of SPITcalls that will go through and the number of regular callsthey will erroneously stop. Second, they require a lot ofhand-tuning for working well, which cannot be sustainedin future’s networks.

The initial motivation for this paper was to explorewhether there would be ways to design SPIT filters thatwould not suffer from these two shortcomings. For this,we start by considering an abstracted scenario amenableto theoretical analysis where we make essentially two sim-plifying assumptions:

1. we are dealing with pure source SPIT detection in anoutbound scenario,

2. we can extract features from calls (such as, for example,call duration) whose distribution for SPIT and regularcalls is known in advance.

Here, ‘‘outbound scenario’’ means that our SPIT detectorwill be located in, or at the edge of, the network where thesource resides, and will check all outgoing calls originatingfrom within the network. Technically, this means that weare able to easily map calls to sources and that we can ob-serve multiple calls from each source. By ‘‘pure source’’ wemean that a source either produces only SPIT or only NON-SPIT calls for a certain observation horizon. By ‘‘known inadvance’’ we mean that the filter requires knowledgeabout the world in form of a generative model for the fea-tures of both SPIT and NON-SPIT calls. As in practice the‘‘true’’ generative model will, of course, never be availableand needs to be estimated from data, this requirementmeans that we need to have labeled2 instances of both SPITand NON-SPIT calls. Assuming that these requirements canall be fulfilled, we have been able to design a SPIT filterwhich requires no tuning and no user feedback and whichis optimal in a sense that will be defined later in this paper.

This paper reports on this SPIT filter and is organized intwo parts: one theoretical in Section 3 and one practical inSection 4. The theoretical part starts with Section 3.1describing precisely and in mathematical terms the con-text in which we will design the SPIT filter. Section 3.2shows how it is then possible to derive from a simple sta-tistical test a SPIT filter with the desired properties andSection 3.3 provides analytical expressions to compute itsperformance. Monte Carlo simulations in Sections 3.4 and3.5 then examine the theoretical performance of the SPIT

2 As it turns out and will be explained later, it is at the time of thiswriting unfortunately quite difficult to acquire actual samples of SPIT calls.

filter. The practical part starts with Section 4.1 describinghow the SPIT filter could be integrated as one module intoa larger hierarchical SPIT prevention framework. The pri-mary purpose of this section is to demonstrate that theassumptions we make in Section 3 are well justified andcan be easily dealt with in a real world application. (Notethat a detailed description of the system architecture isnot the goal of this paper.) For example, Section 4.2 de-scribes how the assumption that the distributions for SPITand NON-SPIT must be known in advance can be dealt withusing maximum likelihood estimation from labeled priordata (which in addition allows us to elegantly addressthe problem of nonstationary attackers). Then, usinglearned distributions, Section 4.3 demonstrates for data ex-tracted from a large database of real-world voice call datathat the performance of our SPIT filter remains in accor-dance with the theoretical performance bounds derivedanalytically in Section 3 and degrades gracefully as thelearned distribution departs from the model.

2. Related work

To systematically place our work in the context of re-lated prior work, we will have to consider it along twoaxes. The first axis deals with (low-level) detection algo-rithms: here we have to deal with the question on what ab-stract object we want to work with (e.g., SIP headers,stream data, call histories), how to represent this objectsuch that computational algorithms can be applied (e.g.,what features), and what precise algorithm is applied to ar-rive at a decision, which can be a binary classification(NON-SPIT/SPIT), a score (interpreted as how likely it isto be NON-SPIT/SPIT), or something else. The second axisdeals with larger SPIT detection frameworks in which the(low-level) detection algorithm is only a small piece. Theframework manages and controls the complete flow andencompasses detection, countermeasures, and self-heal-ing. The formal framework for SPIT filtering we proposein this paper clearly belongs to the first category and onlyaddresses low-level detection.

The Progressive Multi Gray-leveling (PMGL) proposedin [14] is a low-level detector that monitors call patternsand assigns a gray level to each caller based on the numberof calls placed on a short and long term. If the gray level ishigher than a certain threshold, the call is deemed SPIT andthe caller blocked. The PMGL is similar to what we aredoing in that it attempts to identify sources that are com-promised by bot nets in an outbound scenario. The majorweakness of the PMGL is: (1) that it relies on a weak fea-ture, as spitters can easily adapt and reduce their callingrate to remain below the detection threshold, and highcalling rates can also have other valid causes such as a nat-ural disaster; (2) that it relies on ‘‘carefully’’ hand-tuneddetection thresholds to work, which makes good perfor-mance in the real world questionable and – in our opinion– is not a very desirable property as it does not come withany worst-case bounds or performance guarantees. Ourapproach is exactly the opposite as it starts from a mathe-matically justified scenario and explicitly provides perfor-mance guarantees and worst-case bounds. Our approach

Page 3: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

3 ‘‘Block’’ may seem like a rather severe option which is here justifiedbecause of our focus on abstract modeling. In practice, as we will describein Section 4, the SPIT filter would ideally run in parallel with otherdetection algorithms and thus only recommend a course of action.

4 It should be noted that, alternatively, one could imagine modeling thisscenario as an optimal stopping problem with just two actions: (1) acceptnext call; (2) block all future calls. Doing this, however, would require adifferent mathematical approach, namely dynamic programming overbelief states. While technically there is no problem in doing it, it is notthe scenario we consider in this paper.

1632 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

is also more generic because it can work with any featurerepresentation: while we suggest call duration is a betterchoice than call rate, our framework will work with what-ever feature representation a network operator mightthink is a good choice (given their data).

In [21] a low-level detector based on semi-supervisedclustering is proposed. They use a large number of call fea-tures, and because most of the features become availableonly after a call is accepted, is also primarily meant to clas-sify pure sources (as we do). The algorithm they propose ismore complex and computationally more demanding thanwhat we propose. In addition, their algorithm also relies onhand-tuned parameters and is hard to study analytically;thus again it is impossible to have performance guaranteesand derive worst-case bounds for it. Performance-wise it ishard to precisely compare the results due to a differentexperimental setup, but our algorithm compares favorablyand achieves a very high accuracy.

The authors of [9,7] propose to use support vector ma-chines for identifying a varied set of VoIP misbehaviors,including SPIT. Their approach works on a different repre-sentation of the problem (call histories) with a differentgoal in mind and thus is not directly comparable with whatwe do. While it also cannot offer performance guaranteesand worst-case bounds, in some respect it is more generalthan what we propose; it also describes both detection andremediation mechanisms in a larger framework.

In SEAL [13], the authors propose a complete frame-work for SPIT prevention which is organized in two stages(passive and active). The passive stage performs low-leveldetection and consists of simple, unintrusive and computa-tionally cheap tests, which, however, will only be success-ful in some cases and can be easily fooled otherwise. Thepurpose of the passive stage is to screen incoming callsand flag those that could be SPIT. The active stage then per-forms the more complex, intrusive and computationallyexpensive tests, which with very high probability can iden-tify SPIT (these tests more actively interact with the caller).SEAL is very similar to what we sketch in Section 4 of thispaper. On the other hand, the low-level detection per-formed in SEAL is rather basic: while it is more widelyapplicable than what we do, it essentially consists only incomparing a weighted sum of features against a threshold.As with all the other related work, weights and thresholdsagain need to be carefully determined by hand; and again,since the problem is not modeled mathematically, perfor-mance guarantees and worst-case bounds are impossibleto derive.

Finally, it should be noted that, while the problem ofSPIT detection can, in some sense, be related to the prob-lem of anomaly detection and prevention of DoS attacksin VoIP networks, for example see the work in[12,23,4,24], it is not the same. The reason why this isnot the same is that these security threats are typicallyspecific attacks aimed at disrupting the normal operationof the network. SPIT on the other hand operates on the so-cial level and may consist of unwanted advertisements orphishing attempts – but not per se malicious code. As aconsequence, techniques from anomaly detection and pre-vention of DoS attacks cannot be directly applied to SPITdetection.

3. A SPIT filter with theoretically optimal performance

This section describes a formal framework for an out-bound SPIT filter for which it is possible to prove optimal-ity and provide performance guarantees. Note that thissection is stated from a theoretical point of view. In Sec-tion 4 we outline how one could implement it in a realworld scenario.

3.1. Problem statement

As shown in Fig. 1, we assume the following situation:the SPIT filter receives and monitors incoming calls froma number of different sources. The sources are independentfrom each other and will each place numerous calls overtime. We assume that, over a given observation horizon,each source will either only produce regular calls or will onlyproduce SPIT calls. As the sources are independent, we canrun a separate instance of the SPIT filter for each and thusin the following will only deal with the case of a singlesource.

Every time a call arrives at the filter, the filter can essen-tially do one of two things: accept the call and pass it on tothe recipient or block3 the call. Each call is associated withcertain features and we assume that only if a call is accepted,we can observe the corresponding features (which would befor example the case with call duration as feature). The fun-damental assumption we make is that the distribution overthe features will be different depending on whether a call isa SPIT call or a regular call, and, in this Section 3, that thesedistributions are known in advance. To quantify the perfor-mance of the filter, we consider three types of costs: (1) thecost for erroneously accepting a SPIT call; (2) the cost forerroneously blocking a NON-SPIT call; and (3) the basic costfor running the filter and extracting the features regardlessof whether the call is SPIT or NON-SPIT.

Our goal is to decide, after observing a few calls,whether or not the source sends out SPIT. More precisely,we look for a decision policy that initially accepts all calls,thus refining the belief about whether or not the source isSPIT, and then at some point decides to either block or ac-cept all future calls from the source. Seen as a sequentialdecision problem, the SPIT filter has three4 possible actions:(1) accept the next call, which reveals its features and thusrefines the belief about the type of the source, (2) block allfuture calls, and (3) accept all future calls.

In doing so, we arrive at a well defined concept of loss,which we define as the total costs accumulated over theobservation horizon. Within this framework, every con-ceivable SPIT filter algorithm will then have a performancenumber: its expected loss. The particular SPIT filter that we

Page 4: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Fig. 1. Sketch of the simplified problem. The SPIT filter operates in an outbound fashion and checks all outgoing calls originating from within the network.The filter treats the source of a call as being pure over a certain observation horizon. In this scenario, each source (which will correspond to a registereduser) will try to sequentially place calls over time which are either all SPIT (if the source is compromised by a SPIT bot) or all NON-SPIT (if the source is aregular user). The SPIT filter then has to decide for each source individually if, given the calls that originated from that source in the past, it is a regular useror a SPIT bot.

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1633

are going to describe below will be one that minimizes thisexpected loss.

Note that the expected loss is an example for the typicalexploration versus exploitation dilemma. On the one side,since the decision to accept or block all future calls is ter-minal, we want to be very certain about its correctnessto avoid making an expensive error. On the other side, aslong as we are observing we are automatically acceptingall calls and thus will increase our loss both because ofthe basic cost of running the filter plus the potential costof having accepted a SPIT call. To minimize our expectedloss, we therefore also want to observe as few samples aspossible.

To address the problem mathematically, we employWald’s sequential probability ratio test for simple hypoth-eses introduced in [18]. The sequential probability ratiotest (SPRT) has the remarkable property that among allsequential tests procedures it minimizes the expectednumber of samples for a given level of certainty andregardless of which hypothesis is true (the optimality ofSPRT was proved in [19]). In addition, the SPRT comes withbounds for the expected stopping time and thus allows usto derive concrete expressions for the expected loss as afunction of the characteristics of the particular problem(meaning we can express the loss as a function of theparameters of the distribution for SPIT or NON-SPIT). Final-ly, SPRT requires only simple algebraic operations to carryout and thus is easy to implement and computationallycheap to run.

3.2. SPIT detection via the SPRT

The SPRT is a test of one simple statistical hypothesisagainst another which operates in an online fashion andprocesses the data sequentially. At every time step a newobservation is processed and one of the following threedecisions is made: (1) the hypothesis being tested is ac-cepted, (2) the hypothesis being tested is rejected andthe alternative hypothesis is accepted, (3) the test statisticis not yet conclusive and further observations are neces-

sary. If the first or the second decision is made, the testprocedure is terminated. Otherwise the process continuesuntil at some later time the first or second decision ismade.

Two kind of misclassification errors may arise: decideto accept calls when source is SPIT, or decide to blockall future calls when source is NON-SPIT. Different costsmay be assigned to each kind, upon which the perfor-mance optimization process described in Section 3.3 isbuilt.

To model the SPIT detection problem with the SPRT, wenow proceed as follows: Assume we can make sequentialobservations from one source of a priori unknown type SPITor NON-SPIT. Let xt denote the features of the tth call we ob-serve, modeled by random variable Xt. The Xt are i.i.d. withcommon distribution (or density) pX. The calls all originatefrom one source which can either be of type SPIT with dis-tribution pSPIT(x) = p(xjsource = SPIT) or of type NON-SPITwith distribution pNON-SPIT(x) = p(xjsource = NON-SPIT). Ini-tially, the type of the source we are receiving calls from isnot known; we write p(SPIT) for the prior probability of asource being SPIT and p(NON-SPIT) for the prior probabilityof being NON-SPIT (note p(NON-SPIT) = 1 � p(SPIT)).

In order to learn the type of the source, we observe callsx1,x2, . . . and test the hypothesis

H0 : pX ¼ pSPIT versus H1 : pX ¼ pNON-SPIT: ð1Þ

(Note again that in this formulation we assume that thedensities pSPIT and pNON-SPIT are both known so that wecan readily evaluate the expression p(xjsource = SPIT) andp(xjsource = NON-SPIT) for any given x.)

At time t we observe xt. Let

kt :¼ pðx1; . . . ; xtjNON-SPITÞpðx1; . . . ; xt jSPITÞ ¼

Yt

i¼1

pðxijNON-SPITÞpðxijSPITÞ ð2Þ

be the ratio of the likelihoods of each hypothesis after tobservations x1, . . . ,xt. Since the Xi are independent, wecan factor the joint distribution on the left side to obtainthe right side. In practice, it will be more convenient for

Page 5: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

1634 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

numerical reasons to work with the log-likelihoods. Doingthis allows us to write a particular simple recursive updatefor the log-likelihood ratio Kt :¼ logkt, that is

Kt :¼ Kt�1 þ logpðxtjNON-SPITÞ

pðxt jSPITÞ : ð3Þ

After each update we examine which of the following threecases applies and act accordingly:

A < kt < B) continue monitoring; ð4Þkt P B) accept H1ðdecide NON-SPITÞ; ð5Þkt 6 A) accept H0ðdecide SPITÞ: ð6Þ

Thresholds A and B with 0 < A < 1 < B <1 depend on thedesired accuracy or error probabilities of the test:

a :¼ Pfaccept H1jH0 trueg¼ Pfdecide NON-SPITjsource¼ SPITg; ð7Þb :¼ Pfreject H1jH1 trueg¼ Pfdecide SPITjsource¼NON-SPITg: ð8Þ

Note that a and b need to be specified in advance such thatcertain accuracy requirements are met (see next sectionwhere we consider the expected loss of the procedure).The threshold values A and B and error probabilities aand b are connected in the following way (cf. [18], Eqs.(3.18) and (3.19))

b 6 Að1� aÞ and a 6 ð1� bÞ=B: ð9Þ

Note that the inequalities arise because of the discrete nat-ure of making observations (i.e., at t = 1,2, . . .) which re-sults in kt not being able to hit the boundaries exactly. Inpractice we will neglect this excess and treat the inequali-ties as equalities (cf. [18], p.131ff):

A ¼ b=ð1� aÞ and B ¼ ð1� bÞ=a: ð10Þ

Let T be the random time at which the sequence of thekt leaves the open interval (A,B) and a decision is made thatterminates the sampling process. (Note that stopping timeT is a random quantity due to the randomness of the sam-ple generation.) The SPRT provides the following pair ofinequalities for the expected stopping time in both cases(cf. [18], Eqs. (4.80) and (4.81))

EXi�pSPIT½T�P 1

j0a log

1� baþ ð1� aÞ log

b1� a

� �; ð11Þ

EXi�pNON-SPIT½T�P 1

j1b log

b1� a

þ ð1� bÞ log1� b

a

� �ð12Þ

(which we can treat as equalities when we use Eq. (10)).The constants ji with j0 < 0 < j1 are the Kullback–Leiblerinformation numbers defined by

j0 ¼ Ex�pSPITlog

pðxjNON-SPITÞpðxjSPITÞ

� �; ð13Þ

j1 ¼ Ex�pNON-SPITlog

pðxjNON-SPITÞpðxjSPITÞ

� �: ð14Þ

The constants ji can be interpreted as a measure of howdifficult it is to distinguish between pSPIT and pNON-SPIT.The smaller they are the more difficult is the problem.

3.3. Theoretical performance of the SPIT filter

We will now look at the performance of our SPIT filterand derive expressions for its expected loss. Let us assumewe are going to receive a finite number N of calls and thatN is sufficiently large such that the test will always stop be-fore the calls are exhausted. We consider the following sca-lar costs per call:

c0

basic cost of running the filter c1 cost of erroneously accepting SPIT c2 cost of erroneously blocking NON-SPIT

Let us recall the decision-making policy the filter imple-ments: at the beginning, all calls are accepted (observingsamples xi) until the test becomes sufficiently certainabout its prediction. Once the test has become sufficientlycertain, the filter executes the following rule: if the test re-turns that the source is SPIT, then all future calls from itwill be blocked. If the test says that the source is NON-SPIT,then all future calls from it will be accepted. Let L denotethe loss incurred by this policy (note that L is a randomquantity). To compute the expected loss E½L�, we have to di-vide N into two parts: the first part from 1 to T correspondsto the running time of the test where all calls are automat-ically accepted (T < N being the random stopping time withexpectation given in Eqs. (11), (12)), the second part fromT + 1 to N corresponds to the time after the test.

If H0 is true, that is, the source is SPIT, the loss L will bethe random quantity

Ljsource ¼ SPIT ¼ ðc0 þ c1ÞT þ ac1ðN � TÞ; ð15Þ

where (c0 + c1)T is the cost of the test, a the probability ofmaking the wrong decision, and c1(N � T) the cost of mak-ing the wrong decision for the remaining calls. Takingexpectations gives

EXi�pSPIT½L� ¼ ac1N þ ½c0 þ c1ð1� aÞ�EXi�pSPIT

½T�: ð16Þ

Likewise, if H1 is true, that is, the source is NON-SPIT,our loss will be the random quantity

Ljsource ¼ NON-SPIT ¼ c0T þ bc2ðN � TÞ; ð17Þ

where c0T is the cost of the test, b the probability of makingthe wrong decision, and c2(N � T) the cost of making thewrong decision for the remaining calls. Taking expectationgives

EXi�pNON-SPIT½L� ¼ bc2N þ ½c0 � bc2�EXi�pNON-SPIT

½T�: ð18Þ

The total expected loss takes into consideration bothcases and is simply

E½L� ¼ pðSPITÞ � EXi�pSPIT½L� þ pðNON-SPITÞ

� EXi�pNON-SPIT½L�; ð19Þ

where p(SPIT) and p(NON-SPIT) is the prior probability fora source sending out SPIT or regular calls.

To summarize this part, let us consider the situationshown in Fig. 2 where we want to apply the filter in prac-tice. Recall that in order to run the filter (see Eqs. (3)–(6)),we need four objects: pSPIT, pNON-SPIT, a, and b. The first twoare obvious: any specific problem is fully characterized bythe joint distribution of features and class which we can

Page 6: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Fig. 2. A dependency graph for the SPIT filter. The sketch shows how the various objects are related and which quantities are required to perform whatcomputational step. Note that the computational steps inside the gray box are fully automated.

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1635

factor into pSPIT, pNON-SPIT, and the priors p(SPIT) andp(NONSPIT). These distributions can be estimated fromdata as is described in Section 4. Knowing these distribu-tions, we can compute j0,j1. We assume that the costc0,c1,c2 and the number of calls N are defined externally.Looking at Eq. (19), we see that, given all this information,the expected loss will be a function of a,b. To make thisvisually more clear, we will write the expected loss as afunction of two variables, Lða; bÞ. One natural way ofchoosing a,b for the SPIT filter now is to look for that set-ting a⁄,b⁄ that will minimize the expected loss L. Doing thisresults in the generally nonconvex optimization problem

mina;b

Lða; bÞ s:t: a P 0; b P 0; ð20Þ

which has to be solved by iterative techniques. In practice,to ensure reasonable results one will also bound the vari-ables using box constraints, such as a,b 2 [10�6,10�1].

3.4. Example: exponential duration distribution

In this section we assume that the distributions pSPIT

and pNON-SPIT are of a certain parametric form for which itis possible to compute the derived quantities analyticallyand in closed form. Specifically, we assume that both areexponential distributions with parameters k0,k1 > 0, thatis, are given by

pðxjSPITÞ ¼ k0 expð�k0xÞ;pðxjNONSPITÞ ¼ k1 expð�k1xÞ ð21Þ

for x > 0. While this section is primarily meant as a numer-ical example to illustrate the behavior of a SPRT-based SPITfilter theoretically, it is not an altogether unreasonable sce-nario to assume for a real world SPIT filter. For example, wecould assume that the only observable feature x of (ac-cepted) calls is their duration (see Section 4). In this caseSPIT calls will have a shorter duration than regular calls be-

cause after a callee answers the call, they will hang up assoon as they realize it is SPIT. The majority of regular callson the other hand will tend to have a longer duration.While this certainly simplifies the situation from the realworld (e.g., it is generally assumed that call duration fol-lows a more complex and heavy-tailed distribution [2]),we can imagine that both durations can be modeled byan exponential distribution with an average (expected)length of SPIT calls of 1/k0 minutes and an average length

of NON-SPIT calls of 1/k1 minutes 1k1> 1

k0

� �.

First, let us consider the expected stopping time from Eqs.(11) and (12). From Eqs. (13) and (14) we have that theKullback–Leibler information number for Eq. (21) is given by

j0 ¼Z 1

0log

k1 expð�k1xÞk0 expð�k0xÞ

� �� k0 expð�k0xÞdx

¼ logk1

k0

Z 1

0k0 expð�k0xÞdxþ ðk0 � k1Þ

Z 1

0x � k0 expð�k0xÞdx

¼ logk1

k0þ 1� k1

k0:

ð22Þ

(On the second line, the first integral is an integral over adensity and thus is equal to one; the second integral isthe expectation of pSPIT and thus is equal to 1/k0.) Similarlywe obtain for j1 the expression

j1 ¼ logk1

k0� 1þ k0

k1: ð23Þ

Note that for more complex forms of distributions we mayno longer be able to evaluate ji in closed form.

As we can see, ji only depends on the ratio k1/k0. Thusfor fixed accuracy parameters a,b the expected stoppingtime in Eqs. (11), (12) will also only depend on the ratiok1/k0. The closer the ratio is to zero, the fewer samples willbe needed (the problem becomes easier); the closer the ra-tio is to one, the more samples will be needed (the problembecomes harder). Of course this result is intuitively clear:

Page 7: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Table 1How does the difficulty of the problem, expressed in terms of the ratio k1/k0, affect the expected number of samples until stopping ESPIT½T� and ENON-SPIT½T� fordifferent settings of the accuracy parameters a,b.

k1/k0 j0 j1 a,b = 0.05 a,b = 0.01 a,b = 0.001

ESPIT½T� ENON ½T� ESPIT½T� ENON ½T� ESPIT½T� ENON½T�

0.99 �0.00005 0.00005 52646.2 52294.7 89463.4 88865.9 136938.9 136024.50.95 �0.00129 0.00133 2049.0 1980.1 3481.9 3364.9 5329.7 5150.50.90 �0.00536 0.00575 494.3 460.8 840.0 783.0 1285.8 1198.60.70 �0.05667 0.07189 46.7 36.8 79.4 62.6 121.6 95.80.50 �0.19314 0.30685 13.7 8.6 23.3 14.6 35.6 22.40.30 �0.50397 1.12936 5.2 2.3 8.9 3.9 13.6 6.10.10 �1.40258 6.69741 1.8 0.3 3.2 0.6 4.9 1.00.01 �3.61517 94.39486 0.7 <0.1 1.2 <0.1 1.9 0.1

1636 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

the ratio k1/k0 determines how similar the distributionsare.

In Table 1 we examine numerically the impact of thedifficulty of the problem, in terms of the ratio k1/k0, onthe expected number of samples until stopping for differ-ent settings of accuracy a,b. For instance, an averageNON-SPIT call duration of 2 min as opposed to an averageduration of SPIT calls of 12 s leads to k1/k0 = 0.1, and distri-butions that are sufficiently dissimilar to arrive with highaccuracy at the correct decision within a very short obser-vation horizon: with accuracy a,b = 0.001 the filter has toobserve on the average 1.0 calls if the source is NON-SPITand 4.9 calls if the source is SPIT to make the correct deci-sion in at least 99.9% of all cases. (Notice that the stoppingtime is not symmetric.) On the other hand, with k1/k0 > 0.5the similarity between SPIT and NON-SPIT becomes toostrong, which is an indication that another feature or col-lection of features should be chosen to discriminate thetwo (see Section 4). In Table 2 we examine numericallyhow the ratio k1/k0, the setting of the cost c2 = kc1 (andc0 = 0), and the prior probabilities affect what choice ofaccuracy thresholds a⁄,b⁄ is optimal and how this affectsthe combined expected stopping time. Note that in this ta-ble we consider two different ‘‘worlds’’: one where 50% ofall sources are SPIT bots, and one where only 1% are SPITbots.

Table 2How do the characteristics of the problem, expressed as the ratio k1/k0 and prior p(thresholds a⁄, b⁄ and combined expected stopping time E½T� ¼ pðSPITÞESPIT½T� þ ð1

(N = 20) p(SPIT) = 0.5

k1/k0 c2 = c1 c2 = 10c1 c2 = 1000c

0.1a⁄ 1.01e�06 1.00e�06 1.00e�06b⁄ 3.92e�02 3.96e�03 3.97e�05E½T� 2.13 2.99 4.64

0.2a⁄ 1.00e�06 1.00e�06 9.98e�02b⁄ 8.04e�02 8.60e�03 5.25e�05E½T� 4.15 5.79 5.75

0.3a⁄ 1.00e�06 7.66e�02 9.99e�02b⁄ 9.40e�02 9.46e�03 8.94e�05E½T� 7.74 5.10 9.10

0.4a⁄ 1.62e�06 2.37e�06 7.11e�06b⁄ 9.78e�02 7.74e�02 1.02e�02E½T� 13.63 14.04 17.23

Next we will compute the log-likelihood ratio Kt. FromEq. (3) we have

Kt ¼Xt

i¼1

logpðxijNON-SPITÞ

pðxijSPITÞ

¼Xt

i¼1

logk1

k0þ ðk0 � k1Þxi

� �: ð24Þ

The decision regions for the SPRT from Eq. (4) are thus

logb

1� a< t � log

k1

k0þ ðk0 � k1Þ

Xt

i¼1

xi < log1� b

að25Þ

or, equivalently,

logb

1� aþ t � log

k0

k1

� �< ðk0 � k1Þ

Xt

i¼1

xi

< log1� b

aþ t � log

k0

k1

� �: ð26Þ

From the latter we can see that the boundaries of the deci-sion regions are straight and parallel lines (as a function tof samples). Running the SPRT can now be graphicallyvisualized as shown in Fig. 3: the log-likelihood ratio KT

starts for t = 1 in the middle region between the decisionboundaries and, with each new sample it observes from

SPIT), and the choice of the cost terms c2 = kc1 (and c0 = 0) affect the optimal� pð SPITÞÞENON-SPIT ½T�.

p(SPIT) = 0.01

1 c2 = c1 c2 = 10c1 c2 = 1000c1

1.15e�06 1.01e�06 5.85e�024.08e�04 4.01e�05 1.00e�062.07 2.11 0.51

5.37e�02 9.99e�02 9.75e�025.94e�04 5.31e�05 1.00e�061.29 1.06 1.10

9.99e�02 9.98e�02 6.80e�049.03e�04 9.04e�05 4.09e�052.12 2.17 6.59

7.37e�03 8.04e�03 1.03e�021.58e�03 1.55e�03 1.57e�038.49 8.34 7.91

Page 8: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Fig. 3. An example run of SPRT. The log-likelihood ratio Kt starts at t = 1 in the region between the decision boundaries and, with each new sample observedfrom the unknown source, performs a random walk over time. Eventually it will cross over one of the decision boundaries and either enter the regionmarked ‘‘decide SPIT’’ or enter the region marked ‘‘decide NON-SPIT’’.

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1637

the unknown source, does a random walk over time. Even-tually it will cross over one of the lines after which the cor-responding decision is made. For a fixed value of a,b,changing the ratio k0/k1 changes the slope of the decisionboundaries. For a fixed value of k0,k1, changing the accu-racy a,b shifts the decision boundaries upward anddownward.

3.5. Experiment: perfectly known distribution

To examine the (theoretical) performance of SPRT, weran a large number of Monte Carlo simulations for varioussettings of problem difficulty k1/k0 (k0 was set to 1, k1 wasvaried between 0.9 and 0.1) and accuracy a,b (a,b = 0.05,a,b = 0.01, a,b = 0.001). For each setting we performed50,000 independent runs and recorded, for each run, howmany samples were necessary for SPRT to reach a decisionand whether or not that decision was wrong. The experi-ment examines separately the case where the source isSPIT and NON-SPIT. The result of the simulation is shownin Fig. 4 and confirms the expected stopping time com-puted analytically in Table 1. In addition, the results showthat the actual number of mistakes made (the height of thebars in the figure) is in many cases notably smaller thanthe corresponding error probabilities a,b (the dashed hor-izontal lines in the figure), which are merely upper bounds.

4. Network operator’s perspective

Having so far described our SPIT filter from a purely the-oretical point of view, we now discuss the steps necessaryto deploy it in the real world. Note that in what follows it isneither our intent nor within the scope of the paper to de-scribe in detail the architecture of a fully functional SPITprevention system.

The section is structured as follows. First we will sketchhow the SPIT filter, which should more appropriately beseen as a SPIT detector, could be integrated into a largerSPIT prevention system as one building block among manyothers. We will make suggestions on how the problem-

dependent parts of the SPRT can be instantiated byspecifying:

� sources: how to map calls to sources such that the puresource assumption is fulfilled� features: what call features to use such that SPIT and

NON-SPIT calls are well presented� actions: what action to take if the SPRT indicates a

source is likely to send out SPIT

We will then explain how the distribution of the fea-tures that discriminate SPIT from NON-SPIT can be learnedfrom labeled data by first assuming that the distribution isof a certain parameterized form and then estimating theseparameters from the data via maximum likelihood. In thesecond part of the section we use data extracted from alarge database of real-world voice calls and demonstrateempirically that the performance of the SPIT filter underreal-world conditions with a priori unknown distributionis also very good.

4.1. Integration into a SPIT prevention system

For a network operator, a SPIT prevention system suchas the one we propose and sketch in Fig. 5 must allow bothto maintain and guarantee an acceptable level of serviceunder adverse operating conditions and have a low main-tenance cost. To achieve this goal, we adopt the overallstrategy presented in [17] which proposes a hierarchicalsystem consisting of two modular layers: a basic servicelayer and a diagnostic layer. The basic service layer man-ages and processes call requests and as a whole serves toprotect against attacks in VoIP networks – among whichSPIT is just regarded as one particular threat. For the pre-vention of SPIT the basic layer is made up of two subcom-ponents (a conceptually similar setup was also proposedfor SEAL [13]): always-on detection and on-demandprotection.

Always-on detection consists of passive modules whichessentially extract and make use of information which is‘‘already there’’ and thus have zero or very low computa-tional costs. On the other hand, these modules are only

Page 9: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Fig. 4. Monte Carlo simulation of SPRT with the results averaged over 50,000 independent runs and error bars denoting one standard deviation. The top rowshows the average number of calls necessary before SPRT stops for various settings of accuracy a,b. The red dots indicate the respective expected stoppingtime. The bottom row shows the proportion of SPRT ending up making the wrong decision (that is, accepting SPIT or blocking NON-SPIT) for various settingsof accuracy a,b (shown as horizontal lines). As can be seen, in many instances the height of the bars is notably below the corresponding horizontal line,meaning that the actual error rate can be much smaller than what the accuracy parameters a,b would suggest, which are merely an upper bound. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

1638 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

weak detectors in that they are successful only underrestrictive conditions. If the always-on detection compo-nent cannot establish with high certainty that a call isNON-SPIT, in which case it would be allowed to passthrough unharmed, the call is internally forwarded to theon-demand protection component.

The on-demand protection component consists of ac-tive modules which require additional processing and canhave medium to high computational costs; e.g., digit-basedaudio CAPTCHAs [16] or hidden Turing tests and computa-tional puzzles [11]. These modules are meant to protect thenetwork with high accuracy. However, because of the costinvolved (they interfere with natural communication, con-sume and block resources, and may to some degree annoyhuman callers), they are triggered only individually and ondemand. An intelligent decision policy controls the preciseway a call gets probed by the various security tests, suchthat resource consumption is minimized, until a final deci-sion SPIT or NON-SPIT can be made with high certainty.

4.1.1. Deployment as outbound SPIT filterLogically, the SPIT filter we propose would be located

within the always-on component. Physically, the SPIT filterwould be located at the proxy servers which form the gate-way between one’s own network and the outside world.The SPIT filter will then act as an outbound filter: it will per-form self-moderation of outgoing calls and unveil the pres-ence of a SPIT botnet within one’s own network beforeother networks take countermeasures. Previous experiencewith e-mail has shown that outbound filters are critical tokeep control over one’s traffic. They ensure that the wholenetwork’s address space will not end up on a black list as soonas a single of its child systems becomes enrolled into a botnet.

4.1.2. Defining sourcesFor an outbound filter, the definition of sources – that is,

the mapping of individual call requests to the appropriatestate slot in our filter – is straightforward since sourcescorrespond to registered users and/or customers. The

Page 10: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Fig. 5. A hierarchical and modular SPIT prevention system. The SPRT-based SPIT filter we propose in this paper is one particular module in the always-oncomponent of the basic service layer.

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1639

amount of information required per source (one additionalnumber) and the number of potential sources itself is lowenough to accommodate most operator’s need withouthaving to rely on aggregation.

4.1.3. Defining featuresThe choice of which call features to use in our SPIT filter

is crucial for its performance. While the SPIT filter is theo-retically guaranteed to work with any choice of single fea-ture or combination of features (under which thedistributions for SPIT and NON-SPIT are non-identical),for practical reasons features with the following propertiesare highly desired:

� Good separation of the distributions pSPIT and pNON�SPIT

as quantified by the Kullback–Leibler information num-bers j1,j0 from Eqs. (13) and (14). This ensures that thefilter will be able to stop a source from sending out fur-ther SPIT as quickly as possible.� Hard to manipulate for spitters.� Availability of data, e.g., from old logfiles.� Easy access to the feature during runtime, meaning that

the feature should be easily observable during normaloperation without requiring extra machinery.

We believe that in this regard a good choice of features inSPIT detection are features which capture the reaction ofusers to SPIT rather than features that capture the technicalproperties of SPIT bots. Indeed, a SPIT call is: (1) undesirableand has likely shorter duration, as the call would be hangedup by the callee with higher probability; (2) likely to beplaying back a pre-recorded message such that double-talk5

5 Double-talk means that caller and callee talk at the same time. As isdescribed in [20], this can be computed directly from the packet headerinformation and does not require expensive processing of the voice stream.

may occur; (3) unexpected with possibly longer ringing timeand a higher rate of unanswered/refused calls; (4) automatedwith likely shorter time-to-speech and fewer pause duringthe call. Although these features are more likely to be affectedby cultural or social habits, they are much harder to manipu-late for a spitter than features such as inter-arrival time orport number. They are also less likely to be affected by thetechnical characteristics of one specific botnet, and thus couldmore easily take the moving nature of SPIT attacks intoaccount.

In this paper, we argue that call duration might be agood feature (also because it simplifies calculation).

Of course, other choices of features are also possible. Infact, the theoretical framework in Section 3 allows one todo feature selection. In practice, one would thus start byidentifying a set of all possible candidate features. Givendata, one would then compute the j0 and j1 Kullback–Leibler information number either from parametric densityestimation (as shown below in Section 4.2), or in more com-plicated cases from non-parametric density estimation suchas, e.g., histograms. Knowing the respective j1,j0 allowsone to rank the subsets and ultimately to select the featuresthat achieve minimum expected regret, as the worst-casefalse positive and false negative rates can be explicitly com-puted using the equations presented in Section 3.3.

4.1.4. Defining decisionsFinally we have to talk about the actions the SPIT filter

can take. In Section 3 we have assumed that once a sourcehas been identified as a SPIT bot, all subsequent calls are tobe blocked. And conversely, once a source has been identi-fied as a regular user, all subsequent calls are to be allowedthrough. It is clear that in practice this decision rule alonewill not be sufficient. However, recall from the beginningof this section that our SPIT filter is meant to be only oneparticular module in a larger SPIT prevention system (see

Page 11: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

6 http://reality.media.mit.edu/download.php.7 The earlier work described in [10] set out to precisely change that. In it

the authors describe a methodology for creating SPIT traffic and alsoprovide a common data set for the use in benchmark comparisons.However, the data set they provide is generated from ‘‘emulated usersbased on a social model’’; in essence, the authors use common tools togenerate the SPIT traffic, where the relevant features, such as call duration,inter-arrival time, and behavior upon receiving a call are all modeled bysampling from distributions. For example, the call duration was generatedfrom an exponential distribution the parameter of which was specified byhand (which amounts to the same as what we do here).

1640 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

Fig. 5). Thus the outcome of the SPRT should be seen as an-other feature by itself, based on which a higher-level deci-sion-making policy would then act. (The specific details ofthis high-level decision-making policy are outside thescope of the paper.)

4.2. Example: learning the distribution from labeled data

Maximum likelihood estimation is one standard toolfrom statistics to learn distributions from labeled data. As-sume we are given n calls either all labeled as SPIT or alllabeled as NON-SPIT (without loss of generality we assumethey are all SPIT). To estimate the distribution pSPIT neces-sary to perform the SPRT, we proceed as follows. First, weextract the feature representation from each call, yieldingx1, . . . ,xn. Next, we make an assumption about the form ofpSPIT; for example, in this paper we assume that xi is the callduration and that we believe that an exponential distribu-tion with (unknown) parameter k would describe the datawell. To find the parameter k that best explains the data(under the assumption that the data is i.i.d. drawn froman exponential distribution) we then consider the likeli-hood of the data as function of k and maximize it (or equiv-alently, its logarithm):

maxk>0LðkÞ :¼ log pðx1; . . . ; xnjkÞ

¼ logYn

i¼1

pðxijkÞ !

¼Xn

i¼1

logðk expð�kxiÞÞ

¼ n logðkÞ � kXn

i¼1

xi:

ð27Þ

The best parameter kML is then found by equating thederivative of LðkÞ with zero, yielding n

k �P

xi ¼ 0, or

kML ¼nPni¼1xi

: ð28Þ

To run the SPIT filter we would thus take p(xjSPIT) :¼ kML -exp(�kMLx) in Eq. (21), with p(xjNONSPIT) learnedanalogously.

It should be noted that while the above procedure lead-ing to Eq. (28) is very basic and is only applicable for theexponential distribution we consider in the paper, themaximum likelihood procedure itself is more widely appli-cable and can be used to also fit more accurate but alsomore complex density models to the data (e.g., mixturemodels). For further information on this subject, we referthe interested to the vast literature on density estimationin statistics and machine learning.

4.3. Evaluation with learned distributions

As said above, in the real world we cannot assume thatwe know the generating distributions pSPIT and pNON-SPIT.Instead we have to build a reasonable estimate for the dis-tributions from labeled data. The natural question we haveto answer then is: what happens if the learned distributionused in the SPIT filter does not exactly match the true butunknown distribution generating the data we observe(remember, the case where they do match was examinedin Section 3.5).

To examine this point with real-world data, we usedcall data from 106 subjects collected from mobile phonesover several months by the MIT Media Lab and made pub-licly available6 in [3]. The dataset gives detailed informationfor each call and comprises about 100,000 regular voicecalls. Ideally we would have liked to perform the evaluationbased on real-world data for both SPIT and NON-SPIT. Unfor-tunately, this dataset only contains information about regu-lar calls and not SPIT—and at the time of writing, no othersuch dataset for SPIT is publicly available.7 In the followingwe will again consider call duration as feature for our filterto discriminate SPIT from NON-SPIT. To obtain SPIT callsfrom the dataset, we artificially divide it into two smallerdatasets: one that is designated as SPIT and one that is des-ignated as NON-SPIT. The set of SPIT calls is obtained by tak-ing 20% of all calls whose call duration is <80 s, theremaining calls are assigned to the set of NON-SPIT calls.Having thus prepared the data, our general experimentalprocedure is as follows (also refer to Fig. 2):

1. We first need to estimate the generating distributionspSPIT and pNON-SPIT. To do this, we assume that the esti-mates, p̂SPIT and p̂NONSPIT, are exponential distributionsthe parameters of which can be estimated via maxi-mum likelihood as in Eq. (28). In the remainder, wethen use the estimate p̂SPIT and p̂NONSPIT as a proxy forthe true but unknown distribution pSPIT and pNON-SPIT.Note that because the true distribution generating thedata is unlikely to be exactly an exponential, we willintroduce an estimation error which can negativelyaffect the performance of the SPIT filter (the theoreticalbounds we derived in Section 3 only apply if the data isgenerated from the true distribution). However, if thetrue distribution is ‘‘close’’ to an exponential, then wecan expect that the result obtained from using onlythe learned distribution will also be ‘‘close’’. In Fig. 6we give a visual comparison of the data and the fittedmodel; the figure shows a histogram plot of the actualdistribution of call duration in the data and the ideal-ized distribution from the fitted model. As can beglanced from the figure, the fit is good but not perfect:in particular for the NON-SPIT case the model underes-timates calls with a short duration (which will nega-tively affect the performance of the SPRT filter bymaking optimization select too optimistic errorthresholds).

2. Performing this step, we obtain the exponential distri-butions p̂SPIT with mean k�1

0 ¼ 30:23 s and p̂NONSPIT withmean k�1

1 ¼ 129:64 s so that the ratio becomes k1/k0 = 0.23.

Page 12: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

0 500 1000 15000

0.05

0.1

0.15

0.2

0.25

Call duration [sec]

Freq

uenc

y co

unts

Distribution of NON−SPIT calls

DataModel

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Call duration [sec]

Freq

uenc

y co

unts

Distribution of SPIT calls

DataModel

Fig. 6. Histogram of the distribution of the feature ‘‘call duration’’ in the data set for NON-SPIT calls (left panel) and SPIT calls (right panel). The red curveshows the frequency count that we would have obtained if the data were truly generated by an exponential distribution with parameter kML learned viamaximum likelihood (see Section 4.2). The figure shows to what extent the real-world data and the learned distribution model agree; for example, it can beseen that in the NON-SPIT case the model fits the data not perfectly and underestimates calls with short duration (which in turn affects the SPRT filter andmakes optimization select too optimistic error thresholds). (For interpretation of the references to color in this figure legend, the reader is referred to theweb version of this article.)

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1641

3. From this p̂SPIT and p̂ NONSPIT we then compute j0,j1

from Eqs. (22) and (23).4. We then systematically examine the behavior of the

SPIT filter when we vary the remaining design-specificcost parameters c1,c2,N (c0 = 0) and prior probabilityp(SPIT). For each setting of these parameters, the fol-lowing steps are repeated:(a) We first compute the a⁄,b⁄ that is optimal for each

particular setting by solving numerically Eq. (20)using MATLAB’s inbuilt interior point solver.

(b) We then simulate the SPIT filter by running1,000,000 independent trials for this setting. Eachsuch trial consists of first randomly determiningthe type of the source (a Bernoulli event generatedfrom the prior probability p(SPIT)) and then draw-ing calls uniformly at random from the correspond-ing dataset we prepared above. The average successrate obtained over all these trials is then reported inTable 3.

The results show that overall the performance dependson two factors: the prior probability of a source being SPITand the choice of the cost terms c1,c2 together with thenumber of calls N. If the prior probability for SPIT is verysmall (i.e., we expect that the majority of sources isNON-SPIT), the optimization procedure automatically se-lects a higher and less accurate threshold a for SPIT and,similarly, a lower and thus more accurate threshold b forNON-SPIT. This in turn increases the error rate for SPITbut decreases the error rate for NON-SPIT. For example,for N = 20 and c2 = 100c1, in the case p(SPIT) = 0.5 (i.e.,50% of all sources are SPIT), the average relative error ofthe filter for SPIT is 0.16% and the average relative errorfor NON-SPIT is 1.94%. The same situation for a worldwhere p(SPIT) = 0.01 (i.e., 1% of all sources are SPIT), theaverage relative error for SPIT increases to 6.29% and theaverage relative error for NON-SPIT decreases to 0.32%.The second factor affecting performance is the choice of

the cost terms which dictate if, all else being equal, it ismore important to avoid erroneously accepting SPIT orerroneously blocking NON-SPIT.

Finally, we can see from these empirical results on realdata that the empirical error rate and stopping time is (insome cases notable) higher than what we would have ex-pected from the theoretical analysis in Section 3. This,however, should not come as a surprise. The reason for thisdiscrepancy is of course that in our experiments the truedistribution generating the data is unknown and that theestimated distribution we use as its surrogate does notperfectly agree with it (see Fig. 6). Still, we can see thatthe performance degrades rather gracefully in the estima-tion error and is still quite accurate. In practice, one wouldalso use more sophisticated (and more accurate) methodsto estimate the underlying distributions from the data.

5. Summary

In this paper, we presented the first theoretical ap-proach to SPIT filtering that is based on a rigorous mathe-matical formulation of the underlying problem and, inconsequence, allows one to derive performance guaranteesin terms of worst case cumulative misclassification cost(the expected loss) and thus, on the number of samplesthat are required to establish with the required level ofconfidence that a source is indeed a spitter. The methodis optimal under the assumption of knowing the generat-ing distributions, does not rely on manual tuning andtweaking of parameters, and is computationally simpleand scalable. These are desirable features that make it acomponent of choice in a larger, autonomic framework.

Moreover, we have outlined the procedure that needs tobe followed to apply this SPIT filter as an outbound filter ina realistic SPIT prevention system, including which poten-tial call features to use and how the best feature could befound from the candidates via automated feature selection.

Page 13: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

Table 3Examining the empirical performance of the SPRT-based SPIT filter on the data set when systematically varying the design parameters c2 = k � c1 and N. Twoscenarios are considered: one where the prior pSPIT is set to 0.5 (50% of all sources are SPIT), and one where it is set to 0.01 (1% of all sources are SPIT). Each entryin the table consists of five values: a⁄, b⁄ are the parameters minimizing the loss; T is the empirical stopping time (followed by the expected stopping time); andRErr is the relative error for SPIT and NON-SPIT (i.e., what percentage of SPIT calls was erroneously accepted and what percentage of NON-SPIT calls waserroneously blocked).

p(SPIT) = 0.5 p(SPIT) = 0.01

c2 = c1 c2 = 2c1 c2 = 5c1 c2 = 100c1 c2 = c1 c2 = 2c1 c2 = 5c1 c2 = 100c1

N = 5a⁄ 1.00e�06 1.00e�06 1.00e�06 4.61e�02 4.56e�02 4.31e�02 4.40e�02 4.38e�02b⁄ 1.00e�01 9.09e�02 5.43e�02 6.49e�03 6.61e�03 6.65e�03 6.54e�03 6.69e�03T=E½T� 3.4/3.3 3.4/3.3 3.8/3.7 4.1/3.1 3.3/1.1 3.3/1.1 3.3/1.1 3.3/1.1RErr (SPIT) 11.247% 12.315% 23.752% 87.108% 87.280% 89.652% 86.753% 88.650%RErr (NON) 26.259% 25.157% 18.221% 1.844% 1.927% 1.965% 1.897% 1.959%

N = 10a⁄ 1.00e�06 1.00e�06 1.00e�06 1.00e�01 1.00e�01 1.00e�01 1.00e�01 1.74e�03b⁄ 1.00e�01 3.21e�02 1.65e�02 8.60e�05 8.68e�04 1.74e�04 8.69e�05 9.56e�05T=E½T� 4.1/3.3 5.1/4.1 5.6/4. 6.9/4. 4.1/0.8 4.2/0.8 4.3/0.9 5.5/2.3RErr (SPIT) 0.030% 0.272% 0.788% 70.421% 20.134% 53.386% 70.780% 68.744%RErr (NON) 33.439% 20.856% 15.883% 0.671% 3.539% 1.149% 0.624% 0.624%

N = 15a⁄ 1.00e�06 1.00e�06 1.00e�06 1.00e�01 9.98e�02 1.00e�01 1.00e�01 1.00e�01b⁄ 8.79e�02 1.87e�02 9.45e�03 5.78e�05 5.84e�04 1.17e�04 5.84e�05 1.00e�06T=E½T� 4.5/3.4 6.0/4.4 6.5/4.8 7.9/5.0 4.5/0.8 4.6/0.9 4.7/0.9 4.8/0.9RErr (SPIT) 0.000% 0.000% 0.006% 6.593% 1.034% 3.448% 6.442% 63.821%RErr (NON) 33.480% 18.918% 14.569% 1.642% 4.638% 2.211% 1.600% 0.096%

N = 20a⁄ 1.00e�06 1.00e�06 1.00e�06 1.00e�01 1.00e�06 6.50e�02 9.99e�02 1.00e�01b⁄ 6.35e�02 1.31e�02 6.60e�03 4.35e�05 6.70e�04 9.55e�05 4.39e�05 1.00e�06T=E½T� 5.0/3.6 6.5/4.6 7.1/5.0 8.1/5.1 8.78/4.7 5.0/1.0 4.8/0.9 4.9/0.9RErr (SPIT) 0.000% 0.000% 0.000% 0.163% 0.000% 0.000% 0.209% 6.293%RErr (NON) 30.660% 17.132% 13.134% 1.940% 5.679% 2.516% 1.857% 0.320%

N = 100a⁄ 1.00e�06 1.00e�06 1.00e�06 1.00e�06 1.01e�06 1.01e�06 1.01e�06 1.00e�06b⁄ 1.13e�02 2.25e�03 1.13e�03 1.13e�05 1.14e�04 2.28e�05 1.14e�05 1.00e�06T=E½T� 6.9/4.7 8.2/5.6 8.6/6.0 11.5/8.5 9.7/4.7 10.0/4.7 10.1/4.7 10.2/4.7RErr (SPIT) 0.000% 0.000% 0.000% 0.000% 0.000% 0.000% 0.000% 0.000%RErr (NON) 16.886% 9.432% 7.659% 1.458% 3.166% 1.923% 1.476% 0.610%

1642 T. Jung et al. / Computer Networks 57 (2013) 1630–1643

In particular, we have sketched how the generating distri-butions can be learned from data. The difficulty of theproblem of successfully detecting SPIT is then only relatedto how similar/dissimilar the generating distributions are.This difficulty can be quantitatively expressed in terms ofthe Kullback–Leibler information numbers j1,j0—whichin turn can be calculated analytically or approximatelyfrom the learned distributions. Taken together this meansthat the worst case performance of the SPIT filter can becomputed in real-world operation (and can thus be poten-tially used to tune the other hyperparameters of the wholesystem).

Our experimental evaluation verifies that our approachis feasible, efficient (‘‘efficient’’ meaning that only very fewcalls need to be observed from a source to identify SPIT),and able to produce highly accurate results even whenthe generating distribution is not a priori specified but in-ferred from data.

Acknowledgements

Sylvain Martin acknowledges the financial support ofthe Belgian National Fund of Scientific Research (FNRS).Tobias Jung acknowledges financial support from a research

fellowship of ULg. This work is also partially funded by EUproject ResumeNet, FP7-224619.

References

[1] N. Chaisamran, T. Okuda, G. Blanc, S. Yamaguchi, Trust-based VoIPspam detection based on call duration and human relationships, in:Proc. of the 11th Int. Symp. on Applications and the Internet (SAINT),2011.

[2] D.E. Duffy, A.A. Mcintosh, M. Rosenstein, W. Willinger, Statisticalanalysis of CCSN/SS7 traffic data from working CCS subnetworks, in:IEEE JSAC, 1994.

[3] N. Eagle, A. Pentland, D. Lazer, Inferring social network structureusing mobile phone data, Proceedings of the National Academy ofSciences (PNAS) 106 (36) (2009) 15274–15278.

[4] D. Geneiatakis, C. Lambrinoudakis, A lightweight protectionmechanism against signaling attacks in a sip-based VoIPenvironment, Telecommunication Systems 36 (4) (2008) 153–159.

[5] T. Jung, S. Martin, D. Ernst, G. Leduc, SPRT for SPIT: using thesequential probability ratio test for spam in VoIP prevention, in:Proc. of 6th Int. Conf. on Autonomous Infrastructure, Managementand Security (AIMS), Lecture Notes in Computer Science, Springer,2012.

[6] P. Kolan and R. Dantu, Socio-technical defense against voicespamming, in: ACM Transactions on Autonomous and AdaptiveSystems (TAAS), 2007.

[7] M. Nassar, O. Dabbebi, R. Badonnel, O. Festor, Risk management inVoIP infrastructure using support vector machines, in: Internationalconference on Network and Service Management (CNSM’10), 2010,pp. 48–55.

Page 14: Outbound SPIT filter with optimal performance guarantees · 2013. 8. 16. · Outbound SPIT filter with optimal performance guarantees Tobias Junga,⇑, Sylvain Martina, Mohamed Nassarb,

T. Jung et al. / Computer Networks 57 (2013) 1630–1643 1643

[8] M. Nassar, S. Martin, G. Leduc, O. Festor, Using decision trees forgenerating adaptive spit signatures, in: Proc. of the 4th InternationalConference on Security of Information and Networks (SIN 2011),2011.

[9] M. Nassar, R. State, O. Festor, Monitoring sip traffic using supportvector machines, in: Proceedings of the 11th InternationalSymposium on Recent Advances in Intrusion Detection, RAID ’08,Springer-Verlag, Berlin, Heidelberg, 2008, pp. 311–330.

[10] M. Nassar, R. State, O. Festor, Labeled VoIP data-set for intrusiondetection evaluation, in: Proceedings of the 16th EUNICE/IFIP WG6.6, 2010.

[11] J. Quittek, S. Niccolini, S. Tartarelli, M. Stiemerling, M. Brunner, T.Ewald. Detecting SPIT calls by checking human communicationpatterns, in: IEEE International Conference on Communications (ICC2007), June 2007.

[12] K. Rieck, S. Wahl, P. Laskov, P. Domschitz, K.-R. Müller, Self-learningsystem for detection of anomalous sip messages, in: Principles,Systems and Applications of IP Telecommunications, 2ndInternational Conference, IPTComm, 2008, pp. 90–106.

[13] R. Schlegel, S. Niccolini, S. Tartarelli, M. Brunner, SPIT preventionframework, in: IEEE GLOBECOM’06, 2006, pp. 1–6.

[14] D. Shin, J. Ahn, C. Shim, Progressive multi gray-leveling: a voice spamprotection algorithm, IEEE Network 20 (2006) 18–24.

[15] Y. Soupionis, G. Marias, S. Ehlert, Y. Rebahi, S. Dritsas, M.Theoharidou, G. Tountas, D. Gritzalis, A. Bergmann, T. Golubenco,M. Hoffmann, SPAM over Internet Telephony Detection sERvice FinalReport, September 2008. <http://projectspider.org/documents/Spider_D4.2_public.pdf>.

[16] Y. Soupionis, G. Tountas, D. Gritzalis, Audio CAPTCHA for SIP-basedVoIP, in: Emerging Challenges for Security, Privacy and Trust, IFIPAdvances in Information and Communication Technology, vol. 297,2009, pp. 25–38.

[17] J. Sterbenz, D. Hutchison, E.K. Çetinkaya, A. Jabbar, J.P. Rohrer, M.Schöller, P. Smith, Resilience and survivability in communicationnetworks: strategies, principles, and survey of disciplines, ComputerNetworks 54 (2010) 1245–1265.

[18] A. Wald, Sequential tests of statistical hypotheses, Annals ofMathematical Statistics 16 (1945) 117–186.

[19] A. Wald, J. Wolfowitz, Optimum character of the sequentialprobability test, Annals of Mathematical Statistics 19 (1948) 326–339.

[20] C.-C. Wu, K.-T. Chen, Y.-C. Chang, C.-L. Lei, Detecting VoIP trafficbased on human conversation patterns, in: Henning Schulzrinne,Radu State, Saverio Niccolini (Eds.), Principles, Systems andApplications of IP Telecommunications. Services and Security forNext Generation Networks, Lecture Notes in Computer Science, vol.5310, Springer, Berlin/Heidelberg, 2008, pp. 280–295.

[21] Y.-S. Wu, S. Bagchi, N. Singh, R. Wita, Spam detection in voice-over-ip calls through semi-supervised clustering, in: Proceedings of the2009 Dependable Systems Networks, 2009, pp. 307–316.

[22] H. Yan, K. Sripanidkulchai, H. Zhang, Z.-Y. Shae, D. Saha,Incorporating active fingerprinting into spit prevention systems,in: Third Annual Security Workshop (VSW’06), 2006.

[23] G. Zhang, S. Ehlert, T. Magedanz, D. Sisalem, Denial of service attackand prevention on SIP VoIP infrastructures using DNS flooding, in:Principles, Systems and Applications of IP Telecommunications, 1stInternational Conference, IPTComm, 2007.

[24] G. Zhang, S. Fischer-Hübner, S. Ehlert, Blocking attacks on SIP VoIPproxies caused by external processing, Telecommunication Systems45 (1) (2010) 61–76.

Tobias Jung is currently a postdoc researcherat the Department of Computer Science andElectrical Engineering at the University ofLiege, Belgium. He received his Ph.D. inComputer Science from the University ofMainz, Germany, and has previously been apostdoc researcher at the University of Texasat Austin. His research interests are predictionand intelligent control, with an emphasis onreinforcement learning and related tech-niques.

Sylvain Martin received his degree in Com-puter Science in 2001 with the highest dis-tinction (summa cum laude) and joined theResearch Unit in Networking in University ofLiege a couple of weeks later, on the ARTHURproject of walloon region, and then as fellowresearcher of FNRS (Belgian science fund). Hepresented his Ph.D. in 2007, that investigatesthe potential of active networks, and espe-cially their implementation on Intel IXP net-work processors. Since then, he worked as apost-doctoral researcher on the European

projects 4WARD (at University of Basel), ANA and ResumeNet (Universityof Liege). He is currently employed as a FNRS post-doctoral researcher,with interest in programmable network devices, autonomic networks,

clean slate architectures, peer-to-peer systems and application of AItechniques to network control.

Mohamed Nassar is currently a postdocresearcher at Qatar University. He has a Ph.D.degree (2009) from Nancy University, France.He has worked as a research engineer at InriaFrance then at Ericsson, Ireland. His researchinterests are networks and services manage-ment and security.

Damien Ernst (M’98) received the M.Sc. andPh.D. degrees from the University of Liège,Liège, Belgium, in 1998 and 2003, respec-tively. He is currently an Associate Professorat the University of Liège where he is affiliatedwith the Systems and Modeling ResearchUnit. He is also the holder of the EDF-Luminuschair on Smart Grids. His main researchinterests include power system control andreinforcement learning.

Guy Leduc is a full professor in the EECSdepartment of the University of Liège, Bel-gium, and is since 1997 the head of theResearch Unit in Networking (RUN). Hegraduated as an electrical (electronics) engi-neer in 1983 and got his Ph.D. in computerscience in 1991. His research field is computernetworks, and his main research interests areNetwork Coordinate Systems, overlays, trafficengineering, resilience, multimedia, conges-tion control, and autonomic/active/program-mable networks. His research unit is or has

been involved in European projects such as mPlane on measurements,ECODE on cognitive networking, ResumeNet on resilient networking, ANAon autonomic networking, TOTEM on an open-source toolbox for traffic

engineering, and the e-NEXT European network of excellence. He is aformer chairman of the IFIP Technical Committee (TC6) on Communica-tions Systems, a steering committee member of the IFIP NetworkingConference, and an area editor of IEEE Transactions on Network andService Management and Elsevier Computer Communications journals.