on the performance of secure user-centric voip communication

Computer Networks 70 (2014) 330–344

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/locate /comnet

On the performance of secure user-centric VoIP communication

http://dx.doi.org/10.1016/j.comnet.2014.06.0091389-1286/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author. Tel.: +33 299847581.E-mail addresses: [email protected] (P.A. Frangoudis),

[email protected] (G.C. Polyzos).

Pantelis A. Frangoudis a,⇑, George C. Polyzos b

a INRIA Rennes-Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Franceb Mobile Multimedia Laboratory, Department of Informatics, School of Information Sciences and Technology, Athens University of Economics and Business,11362 Athens, Greece

a r t i c l e i n f o

Article history:Received 10 January 2014Received in revised form 12 June 2014Accepted 16 June 2014Available online 26 June 2014

Keywords:Wireless networkingVoIPVirtual private networksNetwork securityUser-centric networking

a b s t r a c t

Motivated by the increased Wi-Fi coverage in metropolitan areas and the emergence ofuser-centric wireless access schemes, we focus on the provision of secure, user-centricvoice services and explore their potential performance-wise, by designing a VoIPcommunications scheme tailored to open-access wireless environments, but also withwider applicability, and experimenting with it to estimate its upper bounds on VoIPcapacity, under constraints posed by user-centrism; operation at low-cost and on user-controlled equipment, minimal dependence on centralized entities, and tackling specificsecurity challenges. We identify quality degradation factors and quantify their importanceby simple analysis and experimentation, showing that typical user Wi-Fi equipment cansustain a satisfactory number of concurrent secure VoIP sessions with acceptable Qualityof Experience and, at the same time, protection from malicious user activity can be offeredto access providers, while a level of roaming privacy can be guaranteed.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

The traditional view of communications has recentlybeen disrupted by the evident user empowerment in allaspects of the communication process. Traditionally, theoperator-centric view dominated, where users had a pas-sive role as service consumers. This view seems to changeand the factors that have led to this shift are numerous.

Hand-in-hand with the revolution in current Internetusage trends, where we witness a vast increase in the vol-ume and popularity of user-generated content, a new com-munication paradigm, where users have a central role, hasemerged. With the advent of low-cost, ubiquitous, andeasy to install and configure wireless equipment, but mostimportantly, protocols that operate in unlicensedspectrum, users can effectively acquire the dual role of

becoming service consumers and providers at the sametime. This fact has the potential of bringing up new disrup-tive technologies, where users can enjoy low-cost wirelessconnectivity via the infrastructure provided by a heteroge-neous crowd of micro-operators.

In this work, building on advances in user-centric wire-less access [1,2], we focus on providing user-centric voice(and multimedia) communication services in such an envi-ronment. We assume an underlying access substratewhere nomadic users enjoy community-based Internetaccess by potentially anonymous, and most probablyuntrusted, Wi-Fi hotspot operators and wish to set upend-to-end multimedia communication in a user-centricway, in the sense that critical operations such as signalingand security management are carried out on user-con-trolled software and equipment, with minimal dependenceon centralized infrastructures. It should be noted that,although our work is positioned in the context of commu-nity-based Wi-Fi access, our approach to setting up end-to-end VoIP communication is more generic and could beapplied to other wireless access infrastructures as well.

http://crossmark.crossref.org/dialog/?doi=10.1016/j.comnet.2014.06.009&domain=pdf

http://dx.doi.org/10.1016/j.comnet.2014.06.009

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.comnet.2014.06.009

http://www.sciencedirect.com/science/journal/13891286

http://www.elsevier.com/locate/comnet

P.A. Frangoudis, G.C. Polyzos / Computer Networks 70 (2014) 330–344 331

We face both security and performance challenges. Trustcannot be assumed among access providers and roamingusers, thus data confidentiality cannot be guaranteed. Atthe same time, access providers need to be protected frommalicious activities on behalf of roaming users, to whichthey may be held legally liable. On the other hand, perfor-mance of VoIP over user-centric wireless networks is a keyissue, in part due to the unpredictable nature of wirelesscommunications, where delay sensitive applications likeInternet telephony are known to suffer. Poor signal condi-tions, but also contention for access to the medium andinterference brought by spectrum scarcity, dense and anar-chic Wi-Fi deployment, and poorly-configured wirelessequipment, account for that. The need for security has alsoan adverse effect: important space and processing overheadis imposed by mechanisms applied to protect communica-tions. These performance penalties are intensified by thelimitations and challenges of a user-centric solution; mini-mal dependence on centralized infrastructure and operationon low-cost, resource-constrained, user-provided equip-ment, i.e., embedded WLAN devices and mobile terminals.

We approach the above challenges performance-wise,by designing and experimenting with a secure voice com-munications scheme which makes use of tunneling tech-nologies, and evaluating service quality adopting aQuality-of-Experience (QoE)-oriented stance. Our purposeis to estimate the maximum number of simultaneous VoIPcalls of acceptable quality – as a user would perceive it –that a typical home WLAN can sustain, by measuringhow the use of security mechanisms and Wi-Fi operationaffect voice quality. The contributions of this work aresummarized below:

� We present solutions for VoIP communication servicesset up in purely user-centric fashion and operating fullyon user equipment, designed for, but not limited to, auser-centric/community-based wireless access environ-ment, and also discuss various design alternatives.� We study the tradeoff between security and perfor-

mance under the constraints posed by low-end userequipment. To the best of our knowledge, we are the firstto provide a combined study of the overheads imposed bysecurity mechanisms both at the processing and the pro-tocol level under the premise that both call endpoints areconnected to Wi-Fi links and cryptographic operationsare carried out on low-cost, off-the-shelf user equip-ment, thus shedding light on the practical performancelimits of secure user-centric VoIP communication.

This article is structured as follows: Section 2 provides areview of the state of the art in relevant research fields.Section 3 presents a user-centric secure VoIP servicedesigned for user-centric wireless access, but also dis-cusses design alternatives. In Section 4 we focus on theperformance evaluation of this service; we use a simpleanalytic model to estimate upper bounds on VoIP capacityfor our tunneling-based architecture (Section 4.2), describeour experimental methodology and testbed (Section 4.3)and present performance results (Section 4.4). In Section 5we summarize and discuss our conclusions.

2. Related work

2.1. User-centric wireless access

User-provided wireless access schemes have recentlyreceived research and commercial attention. Based on theprivate contributions of individuals who operate Wi-Fiequipment, architectures and systems are being proposedwith the aim of building resource sharing communitiesto achieve wide wireless coverage [1]. In this section weoverview approaches in this direction, since our designassumes that such a scheme will be used for Wi-Fi access.

Large wireless communities such as guifi.net [3] and theAthens Wireless Metropolitan Network [4], counting thou-sands of participants, are nowadays a reality. Various com-munication services, and, notably, intra-community VoIP,are run on top of them, while at the same time they actas testbeds for research and experimentation [5]. Origi-nally, research in the area focused on the technical aspectsof building wireless community networks, but attentionsoon shifted towards socioeconomic and incentive aspects.A critical aspect of user-provided wireless networks is todesign mechanisms to encourage contribution while limit-ing attacks by selfish users who aim at free riding. In ourprior work [2], we proposed a fully decentralized schemeto this end. Our multimedia communications architecturewas designed with this scheme in mind.

FON [6] has followed a different architectural approach.It acts as a mediator for the development of a Wi-Fi sharingcommunity, centrally dealing with user authentication andaccounting. The role of such mediators as community pro-viders is modeled by Biczók et al. [7,8]. They analyze theirinteractions with users and ISPs in global-scale wirelesscommunity networks and explore the space of availableparameters (e.g., roaming cost, ISP’s profit share) to deter-mine the benefits of each player when joining the commu-nity. The authors present interesting results regarding therole of ISPs: Arguing that ISP endorsement is importantfor the global scaling of wireless communities, they findthat depending on parameters set by the mediator, theywill either fully support or abandon (i.e., prohibit Wi-Fisharing) the community. This conclusion appears to be clo-sely related to the terms of use adopted by ISPs regardingbroadband connection sharing over Wi-Fi.

Two significant issues pertinent to wireless communi-ties are studied by Manshaei et al. [9]. First, they studyhow initial community network coverage and user payoffsand fees affect the evolution of the community. Second,they focus on the competition between licensed wirelessaccess providers and community-based ones, which is animportant step towards answering whether wireless com-munities can be a viable alternative (or complement) tolicensed cellular networks.

Ai et al. [10] focus on a critical usability aspect; theirscheme only requires software updates at the client sideand no firmware upgrades at the AP side. However, theirdesign depends on a central server.

The legal aspects of user-provided wireless accessshould not be neglected. MacSíthigh [11] discusses howthe adoption of open wireless access is hindered by a

332 P.A. Frangoudis, G.C. Polyzos / Computer Networks 70 (2014) 330–344

diverse set of legal provisions. Legal issues can influenceuser participation and the adoption of community wirelesstechnologies and should be carefully considered in systemdesign. Often, the terms of service of ISPs explicitly statethat sharing one’s broadband Internet connection is pro-hibited, although sharing-friendly ISPs exist [12]. Also,sharing one’s Internet access with anonymous visitorsmay hold the host liable for potential malicious activitiesby them. Our technical approach to this problem (see Sec-tion 3) is based on tunneling: Users visiting untrusted Wi-Fi hotspots can tunnel their Internet traffic through VirtualPrivate Network (VPN) gateways. On the one hand, theyare protected from eavesdropping and, on the other hand,Wi-Fi owners are not held responsible for illegal activitiesby visitors, since the attacker can be traced back to the VPNgateway. Sastry et al. [13] and Heer et al. [14,15] presentsimilar solutions.

2.2. VoIP capacity in wireless networks

A significant body of research addresses performanceissues of VoIP services over wireless networks, and IEEE802.11 in particular. Garg and Kappes [16] identified thateven for low-bitrate codecs, such as G.729a (8 Kbps), thenumber of VoIP sessions that a Wi-Fi cell can simulta-neously sustain is surprisingly small. This is due to the factthat a VoIP stream is typically composed of small packetsand the overhead for their transmission is very high. Theypropose a simple analytic model to calculate VoIP capacity,which we adapt for our secure peer-to-peer VoIP scheme(see Section 4.2) to estimate an upper bound on the num-ber of VoIP sessions we can sustain.

In the same spirit, Hole and Tobagi [17] propose a sim-ilar simple mathematical model of VoIP capacity and use aQoE-driven evaluation methodology. They validate thefindings of Garg and Kappes using simulation and expandon various codec settings and scenarios, putting morefocus on delay constraints and quality requirements ofvoice sessions.

With the advent of IEEE 802.11e [18], which added traf-fic prioritization extensions, efforts have been put in thedirection of improving VoIP capacity by appropriately tun-ing parameters such as the duration of a transmissionopportunity (TXOP), minimum contention window value(CWmin) and packet buffer size [19,20].

2.3. Security vs. performance

Communications increasingly necessitate security mea-sures, especially when carried out over untrusted wirelessnetworks, as is the case for roaming users visiting foreignWi-Fi hotspots. Typically, traffic is protected by means ofVPN mechanisms, such as IPsec [21] or OpenVPN [22].However, security mechanisms incur performance over-head, which is particularly important for delay- andloss-sensitive VoIP communication. In our work, we areparticularly interested on the effects of security on VoIPperformance, especially when both call endpoints areattached to Wi-Fi links and security functionality is imple-mented on resource-constrained home Wi-Fi equipment;

such scenarios have not received significant researchattention, to the best of our knowledge.

Voice over IPsec in wireline networks is experimentallystudied by Barbieri et al. [23], where a header compressionmethod called cIPsec is also proposed and evaluated. IPsecencryption and packetization overhead are studied viaanalysis and simulation by Xenakis et al. [24], while Milt-chev et al. [25] experimentally compare the performanceof IPsec and application layer security protocols.

A work more closely related to ours is due to Nascimen-to et al. [26], who present experiments on the effects ofIPsec the quality of VoIP over wireless networks. In partic-ular, they measure the maximum number of simultaneousVoIP sessions that can be supported in a single wirelesscell, comparing two different encryption algorithms (AESand 3DES) and two different wireless technologies (IEEE802.11b and Bluetooth). Our work however focuses on adifferent architecture and configuration.

2.4. VoIP quality assessment

Numerous models for assessing the quality of voice ser-vices have been proposed. Jelassi et al. [27] provide adetailed review of relevant approaches.

Subjective methodologies [28] involve experimentswith human subjects who rate voice (or conversational)quality. Objective methodologies such as PESQ (PerceptualEvaluation of Speech Quality) [29], on the other hand,attempt to estimate user-perceived quality by objectivemeasurements, without involving human subjects: an ori-ginal input signal is compared to a degraded output signalas a result of its transmission through a communicationsystem. PESQ derives quality ratings using psychoacousticfundamentals.

While fairly accurate, methodologies based on signalcomparisons fail to capture the effects of various operatingparameters (e.g., codec settings, dejitter buffer implemen-tation) and phenomena in the end-to-end path (networkdelay, packet loss, delay variation). Parametric modelshave thus been proposed, aiming at estimating user-per-ceived voice quality and its dependence on such parame-ters. A widely used parametric model for conversationalvoice quality estimation is the ITU-T E-model [30]. It isbased on modeling the results from a large number of sub-jective tests, on a wide range of transmission parameters.The output E-model is a scalar quality rating value knownas the ‘‘Rating Factor, R’’.

Since the E-model involves fairly complex expressions,Cole and Rosenbluth [31] have proposed a methodologyto reduce the E-model to transport-level metrics andderive a simplified E-model approximation. We haveadopted this methodology, since our purpose is to quantifythe effects of various phenomena pertinent either to thenetwork topology (i.e., the fact that we focus on a scenariowhere both call endpoints are attached to Wi-Fi links), net-work devices in the end-to-end path (resource-constrainedresidential Wi-Fi routers), or to security services we apply(high-overhead tunneling mechanisms) and correlatethem with perceived voice quality; codec settings andother environmental factors were considered fixed or notstudied in our work. In Appendix A, following the approach


of Cole and Rosenbluth, we derive an updated approxima-tion of the E-model, based on the most recent version ofthe standard.

2.5. Our prior work

In our prior work [2], we briefly reported on experimen-tal results obtained for a user-centric, tunneling-basedVoIP service, albeit in a different setup, i.e., with a differentWi-Fi standard (IEEE 802.11b) and different tunnelingtechnologies (L2TP/IPsec), and our evaluation mainlyfocused on the performance effects of the Wi-Fi sharingprotocol we had proposed. Here, we study more genericsettings, providing design alternatives and a more detailedexperimental analysis, with an in-depth look into the per-formance implications of various quality degradation fac-tors, and backed up by a theoretical discussion onVoWLAN capacity.

1 It should be noted that, at the same time, this device may be offeringInternet access to other stations, e.g., other wireless community members.

3. Service architecture

3.1. Environment, design goals and threat model

We target an open-access wireless networking scenario,where users enjoy Internet connectivity over communityWi-Fi hotspots, expecting VoIP and mobile multimedia toemerge as popular applications and complement GSM/3Gservices in citywide areas, where Wi-Fi and cellular nowa-days have comparable coverage. We design a peer-to-peersecure VoIP service tailored to such a networking environ-ment, drawing our inspiration and some basic assumptionsfrom our prior work [2], where we presented a decentral-ized community-based Wi-Fi sharing scheme based on ser-vice reciprocity. Our design is more generic, however, andcould be applied independently of the underlying accesssolution: the same principles would apply for usersattached to commercial Wi-Fi hotspots or Campus WLANs,since access is decoupled from other services, and one canconsider that what differentiates access solutions from oneanother are the service negotiation and accounting proce-dures, with services (e.g., VoIP) being deployed indepen-dently on top of the access substrate.

Summarizing our design goals, we focused on (i) opera-tion at low cost, (ii) purely on user-provided equipment,(iii) with minimal dependence on centralized infrastruc-ture, (iv) offering secure communication and (v) privacyenhancements. We quantify the tradeoff among security,low cost and performance in Section 4.

We build on the basic assumption that between anaccess provider (visited AP) and a consumer (roaminguser), no trust is assumed and no form of cooperationand interaction is expected apart from the basic Wi-Fi pro-tocol exchange and routing user traffic to/from the Inter-net. We further assume that a user’s trust anchor is theequipment installed at his premises, i.e., his home Wi-Firouter. There is the risk for a visitor that the traffic for-warded by the AP on his behalf is intercepted, and, onthe other hand, the visitor may engage in malicious acts,masking behind the provider’s home network.

3.2. A tunneling-based scheme

To tackle these threats, a design based on tunneling is anatural approach: after being granted wireless access, theuser sets up a VPN tunnel to his home network andsecurely relays all his Internet traffic through it, protectingit from eavesdropping. For security purposes, it is reason-able to assume that a visited AP only forwards a user’s traf-fic from/to a specific IP address (or a restricted numberthereof), which serves as the visitor’s trusted gateway.There are various technical means to achieve this: forinstance, this address can be explicitly negotiated betweenthe AP and the client after wireless access has been suc-cessfully established, or the AP can simply allow trafficonly to/from a single destination across a user session, thatof the first outgoing IP packet.

This approach mandates that the user operates atrusted VPN gateway. To lower infrastructure cost, offer auser-centric feel, and considering the capabilities of mod-ern home Wi-Fi equipment, we have proposed that thisVPN functionality be built into the firmware of the user’shome router.1 Obviously, for improved performance, and ifthe user can afford it, he can operate the VPN gateway onseparate equipment, and even outside his home network.

With this approach, confidentiality of the user’s traffic isensured, while the service provider cannot be held liablefor illegal activities carried out by the visitor, since poten-tial attacks, download or distribution of illegal content, orother malicious behavior will appear to have been per-formed from the visitors’s home network (or the networkwhere his VPN gateway is located).

3.3. Rendezvous and call setup

Given that visitors have set up secure tunnels to theirhome networks, we now show how a voice (or multime-dia) call can be set up between two users accessing theInternet via foreign Wi-Fi hotspots. In order to spare usersthe need for extra equipment acting as a VPN gateway, wehave built this functionality into the AP’s firmware. Ourbasic goal is to demonstrate that a secure multimedia ses-sion can be set up in a user-centric manner, with minimaldependence on centralized infrastructures. Fig. 1 showsthe proposed scheme.

A major challenge one has to tackle is for the caller todiscover where to initiate the call to. Call endpoints arereachable via their home VPN gateways. Thus, the callerneeds to discover the IP address of the callee’s home. Publichome IP addresses are typically assigned by the users’ ISPsand, more often than not, they are dynamically allocatedfrom the ISP’s DHCP pool. Here we present a solution forpeer discovery based on the exchange of GSM SMS textmessages, based on the following assumptions:

� At any time, a user is aware of the IP address of hishome VPN gateway, which he can communicate to itspeer. This is necessary, not only for user discovery, butalso to set up a tunnel in the first place.

Fig. 1. A user-centric secure VoIP service for roaming users.


� A user who wishes to initiate a call to the other endknows his peer’s GSM mobile phone number. This is areasonable assumption, given that either the two usersknow each other from prior contact and haveexchanged such information, or the caller is aware ofthe callee’s phone number via an external channel. Inany case, this model follows the traditional GSM para-digm. Also, it is reasonable to assume that cellularphone owners have active subscriptions with GSMoperators, and are thus reachable over GSM.

This is a point where we relax our decentralizationassumption: we rely on the centralized GSM infrastructure,but only for reasons of service discovery, and exploit anexternal communication channel that is ubiquitous andalready in use by peers.

Peer discovery works as follows: suppose that users U1and U2 wish to establish a voice call. U1 and U2 are asso-ciated with APs V1 and V2 respectively and tunnel all theirInternet traffic to their home gateways, H1 and H2 respec-tively. In order to initiate the call, U1 sends an SMS to U2,informing him of his home gateway’s (H1) IP address. TheSMS can also convey other session parameters, such as theport on which his gateway can receive a media stream,codec parameters, and security-related information. (SeeSection 3.6 for a discussion on achieving end-to-end secu-rity.) Then, U2 responds directly with the voice stream,which is first tunneled to H2, then routed to H1, and,finally, tunneled to U1.

It should be noted that the voice application at the U1side should be prepared to receive an incoming stream atthe agreed upon port. Also, the appropriate state shouldhave been set up at both home gateways so that incomingpackets with the peer’s home gateway’s IP as the sourceare forwarded to the appropriate tunnel endpoint. (A homeVPN gateway may be managing multiple tunnels to roam-ing stations at the same time.)

Various alternative technologies exist for implementingtunnels. In our measurement-based evaluation we haveused OpenVPN [22], a popular and easy to deploy SSL/TLS-based solution. In our earlier experiments [2], though,we had applied an L2TP/IPsec-based scheme [32], which ismore complex (protocol-wise) and with higher per-packetspace overhead. VPN tunneling imposes an important data

and processing overhead, the effects of which on voicequality are studied in Section 4.

3.4. Roaming privacy

One should note that our specific tunneling-baseddesign offers users a level of location privacy. Since eachcall endpoint only discloses the IP address of his home net-work, it is not possible for the other end to infer theapproximate location of his peer. This comes with a perfor-mance penalty, though, since each packet needs to beencrypted and decrypted twice. This overhead is quantifiedin Section 4.

3.5. Design alternatives

Rendezvous between two peers who wish to communi-cate can be carried out using various mechanisms and pro-tocols and each has advantages and disadvantages in itsown right. In any case, users need to discover each other’sservice parameters, namely the IP address and port wherethey listen for incoming multimedia traffic and potentiallynegotiate security parameters to achieve an end-to-endencrypted channel. In our system, we have opted for callinitiation based on the exchange of GSM SMS messages,but other options are possible.

3.5.1. Dynamic DNSIn a DNS-based alternative, instead of initiating a call

using the callee’s mobile phone number, if we assume thata user’s home IP address can be resolved from a DNS name,a caller can directly stream his data to his peer’s home. Theprocedure then follows the same way. It should be noticedthat since VPN gateways would normally reside in users’home networks, typically accessible via a dynamic IPaddress, user equipment should dynamically update thename-IP address binding. Many dynamic DNS servicesexist [33] and most home network equipment have thecapability to perform such updates built into theirfirmware.

Relying on DNS hides one vulnerability similar to whenusing GSM SMS text messages, since it is still implied that(i) users trust the Dynamic DNS name service, and (ii)name servers that peers use to resolve each other’s hostnames are trusted. These conditions have to do with howcalls are set up: the caller tunnels a request to resolvethe callee’s name to his home name server through hisVPN gateway. If this name server is under the control ofan adversary, it is possible that the callee’s name resolvesto a host controlled by the adversary, who could then eas-ily launch a man-in-the-middle attack. An end-to-endsecurity scheme where users authenticate each other isnecessary to combat such threats.

3.5.2. Using the session initiation protocol (SIP)A multimedia call could also be established using the

Session Initiation Protocol (SIP) [34]. Peer discovery, how-ever, is again an important issue. Users are identified usingSIP URIs, which mandate the existence of SIP registrarswhere users are listed and, typically, SIP proxies in eachuser’s domain. In the decentralized scenario we study, this


would mean that either each peer should maintain its ownSIP server or that some peers should host this service forthe community and others should trust these peers andagree upon using their service. Typically, the caller com-municates with a SIP proxy in its domain, which usesDNS to resolve a SIP proxy in the callee’s domain. DNS pro-cedures can also be used to discover service parameters(such as the transport protocol to use or the port the otherserver listens to). Locating SIP servers is specified in RFC3263 [35].

Applying a SIP-based solution has some advantages:first, it is a standards-compliant approach, with SIP beingheavily used and tested. Also, many SIP phones or imple-mentations of SIP user agent software are readily available.On the downside, the need for every peer to operate SIPservers increases management effort and home gatewaycomplexity, while it is still necessary to rely on DNS.

3.5.3. Peer-to-peer SIPThe problem of locating call endpoints in a peer-to-peer

manner is being studied within the P2PSIP IETF WorkingGroup [36]. For reasons of scalability, resource limitationsthat do not allow for the installation of SIP servers, but alsofor reasons of trust and reliability, e.g., when an organiza-tion or individual does not want to rely on external central-ized service infrastructure, a distributed alternative to SIPis being standardized. The core concept is to rely on aP2P network (a Distributed Hash Table (DHT) such asChord [37]) built on the equipment of the users of the sys-tem to distribute the core functionality of SIP, such as userregistration, proxying, and locating users. The REsourceLOcation And Discovery (RELOAD) protocol has been spec-ified by the Working Group for signaling in a P2PSIP net-work. Cirani et al. [38], on the other hand, have proposeda discovery architecture similar in nature, but without uti-lizing the RELOAD protocol. Again, instead of using SIPservers to locate call endpoints, a DHT is used as the IPaddress/port lookup structure.

Interestingly, the security aspects of P2PSIP architec-tures have recently received attention. For a thoroughreview of relevant issues, the reader is referred to the workof Tuceda et al. [39].

Applying the above approaches in a decentralized, user-centric open access scheme requires that either a networkof special nodes form the DHT, or, some of the home gate-ways join it. This, in turn, implies the formation of a com-munity where the incentives for participation are unclear:in order to build a reliable DHT on top of user Wi-Fi equip-ment, cooperation is necessary on the peers’ behalf, whowould have to maintain a set of records and respond tolookup requests for multimedia calls they are not involvedwith. A mechanism to stimulate cooperation at the DHTlevel would thus be necessary.

3.6. End-to-end security and avoiding man-in-the-middleattacks

Providing end-to-end security in a peer-to-peer mannerin the multimedia communications architecture that wehave proposed is yet to be tackled. One would notice thatalthough the two endpoints of the multimedia call have

set up secure VPN connections with their trusted homegateways, the path between the two gateways is stillunsecured.

If we assume that the IP address of the home gateway ofthe caller has been communicated to the callee via a GSMSMS, it is straightforward for the GSM operator to sniff onthe multimedia call, performing a simple man-in-the-mid-dle (MITM) attack: it can modify the contents of the SMSpointing to a gateway of its own. The callee responds withthe multimedia stream that comes unsecured from theVPN gateway of the callee to the gateway that belongs tothe GSM operator. Then, the operator forwards the trafficto the caller and none of the call endpoints is aware thattheir traffic is being intercepted.

To combat such an attack and achieve end-to-endsecure and private peer-to-peer multimedia communica-tion, the unprotected path of the call has to be secured. Asimple means is that of conveying the caller’s public keyto the callee during the call setup phase (GSM SMSexchange), and using it to exchange a shared key for trafficencryption. However, the callee has to verify that the pub-lic key of the caller has not been changed by an adversary(e.g., the GSM operator) performing a MITM attack. Thus,after communication has been set up and assuming a voiceor video call, the two parties can use some form of vocalacknowledgment to verify that the public key exchangedis the appropriate. This way, an end-to-end secure VoIPor video call can be set up without resorting to trusted cer-tification authorities.

Methods and systems that implement this type of in-band key exchange have been proposed in the literature.The ZRTP [40] protocol uses public key cryptography with-out resorting to a centralized PKI and, thus, without theneed for certificates. During call setup, the two parties per-form Diffie–Hellman key exchange in the media path andset up a shared secret. Then, a Short Authentication String(SAS) based on the shared secret is derived, which appearson the displays of the participants’ devices. To ensure thata man-in-the-middle attack is not being performed, bothusers should confirm the same SAS value.

All ZRTP messages are multiplexed with the mediastream, making it independent of the signaling protocolused for call setup (e.g., SIP). The fact that it is purelypeer-to-peer makes is suitable for application to our archi-tecture. Vocal verification of a key exchanged between thetwo parties engaging in a phone conversation is a conceptpresented as early as 1996 [41].

One should note the performance implications of apply-ing such end-to-end security schemes to our servicedesign. While the two peers maintain VPN tunnels withtheir home networks to protect themselves from untrustedvisited APs and suffering the overhead of tunneling, theyalso have to encrypt their data end-to-end before tunnel-ing them home. Therefore, there is redundant use of secu-rity mechanisms. Along the mobile user-home gatewaypath data are encrypted twice: application data are firstencrypted with the shared key established between users(call endpoints) when the media session was initiated.Then, data are encrypted again and tunneled home, wherethey are decapsulated, sent to the peer’s home gateway,encrypted and tunneled to the peer, where they are


decapsulated. The resulting data are eventually deliveredto the application, where they are decrypted using theshared key. It is important that future research effortsemphasize on removing redundancy across security layers,and optimizing the end-to-end media delivery path.

4. Performance evaluation

4.1. Motivation and approach

The inherent performance limitations of VoIP overWLAN, coupled with shifting demanding security opera-tions to low-end user equipment, raise questions aboutthe capacity of the proposed user-centric communicationschemes. Depending on the scenario, there is a criticalnumber of concurrent VoIP connections, beyond whichuser experience dramatically deteriorates for all connec-tions, and communication collapses.2 We focus on identify-ing this threshold for different scenarios and study itsdependence on various parameters. Such parameters, whichaffect voice quality, and, in turn, VoIP capacity, range fromthe operation of wireless communication technologies andthe capabilities of user equipment, to codec configuration(e.g., codec type, packet payload size) and the applicationof security mechanisms.

Being able to identify capacity limits and the effects ofquality degradation factors provides understanding aboutthe capabilities and limits of user-centric VoIP communi-cation, and is also important for network planning andmanagement purposes. For example, in our design, auser-provided Wi-Fi router can serve as a trust anchor(VPN gateway) for multiple users, who can simultaneouslytunnel their VoIP traffic through it. Based on the estimatedVoIP capacity, the owner of this device can apply appropri-ate VPN admission control schemes to limit load and main-tain satisfactory quality levels for the VoIP calls which aretunneled through it. Also, he can opt for upgrading itsequipment based on the target number of users that it issupposed to serve, or take any appropriate managementdecisions to provide QoE guarantees and improvements.Such mechanisms and actions are outside the scope of thisarticle.

We selected to apply a mix of analysis and experimen-tation to evaluate the performance of our scheme: we firstestimate upper bounds on the number of VoIP calls thatcan be supported in a community-based Wi-Fi accessscheme using a simple throughput model, and then pres-ent experimental results, where we derive the same capac-ity metric carrying out user experience measurements forvarious settings, quantifying the effects of different qualitydegradation factors.

These two methodologies have complementary advan-tages: on the one hand, with a simple analytic model, wecan study the effects of multiple parameters, getting a bet-ter understanding of how the underlying networking tech-nologies are related with VoIP capacity, without resorting

2 In the following sections, the term VoIP capacity denotes the maximumnumber of simultaneous VoIP connections that can be handled in a specificscenario.

to time consuming testbed measurements. On the otherhand, since the theoretical performance model cannot cap-ture significant aspects, such as the effects of CPU-expen-sive VPN operations on user experience, and cannotdirectly provide delay and loss estimates based on whichQoE is calculated, we perform testbed experiments, at thesame time validating our theoretical results.

4.2. Upper bounds on VoWLAN capacity

4.2.1. AnalysisPrior work [16] reveals that even though the bitrate of a

VoIP call may be small, the overhead imposed by the IEEE802.11 PHY and MAC mechanisms and packet headers issuch that an unexpectedly low number of concurrent VoIPsessions can be sustained by a Wi-Fi cell. In this section, weadapt the simple analytical model proposed by Garg andKappes [16] for our scenario which involves two wirelesslast hops, to estimate an upper bound on the number ofconcurrent voice calls of acceptable quality. Here we focuson the effects of the PHY and MAC layers, as well as thepacket overhead imposed by VPN mechanisms.

The IEEE 802.11 MAC protocol typically applies the Dis-tributed Coordination Function (DCF), a CSMA/CA mecha-nism for arbitrating channel access. DCF dictates that fora station to transmit a packet, it should sense the mediumidle for a specified time duration called DCF interframespace (DIFS). If so, the station enters a contention phase,where it senses the medium for a random number of slots.For each idle slot, it decrements a counter and transmis-sion starts when the timer reduces to zero. If the mediumis sensed busy at a contention slot, the station ‘‘freezes’’ itscounter and needs to sense the medium idle again for aDIFS period before reentering the contention phase. Thenumber of slots the medium should be sensed idle isdrawn uniformly at random within a Contention Window(CW). It is possible that the backoff counters of two stationsreach zero at the same slot. The stations will transmit andcollision will occur. In a collision event, which is identifiedby the lack of an acknowledgment during a short periodafter the transmission, called the short interframe space(SIFS), stations retry transmission by drawing a new back-off counter value, doubling the size of the CW, after theyhave waited for a DIFS period. A retry limit is specified,after which a frame is considered dropped.

The above procedure imposes significant MAC-layeroverhead, which is more evident when small packets aretransmitted, as is typically the case for VoIP services. Also,for each transmission, there is some physical layer over-head: a preamble used for synchronization and the Physi-cal Layer Convergence Protocol (PLCP) header transmittedat a low rate precede each frame. Two preamble typesare specified. The short preamble is 72 bits and is transmit-ted at 1 Mbps, but is not compatible with legacy IEEE802.11 systems, therefore the long preamble is typicallyused in IEEE 802.11b. When using the short preamble,the PLCP header (48 bits) is sent at 2 Mbps and the totalPHY layer overhead adds up to 96 ls. In contrast, usingthe long preamble (144 bits), the header (48 bits) is trans-mitted at 1 Mbps and the total PHY layer overhead is192 ls. In a Wi-Fi BSS operating in pure IEEE 802.11g mode


(when no legacy IEEE 802.11b devices exist) the physicallayer overhead adds up to 20 ls.

For IEEE 802.11b and IEEE 802.11g, fixed PHY, MACand network (IP) layer overhead is presented in Table 1,where frame transmissions, except for the preambleand the PLCP header, are carried out at 11 and54 Mbps, respectively. Depending on the higher-layerprotocols used, overhead due to protocol headersincreases.

A simple throughput model for our setup is given [16]by

S ¼ TP

TP þ TOverhead� R; ð1Þ

where TP is the time required to transmit the audio pay-load at rate R and TOverhead is the time overhead due toPHY, MAC and other higher-layer protocols. Our systemhandles constant bitrate, bidirectional flows. Let Ra denotethe bitrate of a voice call. For G.729a, Ra ¼ 16 Kbps (8 Kbpsfor each direction). The total number of simultaneous voicesessions is thus given by

nmax ¼TP � R

ðTP þ TOverheadÞ � Ra

� �; ð2Þ

where

TOverhead¼2�ðTDIFSþTDCFþTSIFSþTACKþTHþ2�TPHYÞþTP: ð3Þ

TH stands for the overhead imposed by MAC, IP, UDP, RTPand potential security related headers (OpenVPN or L2TP/IPsec). Note that TOverhead is the time overhead for two wire-less transmissions, therefore Eq. (3) also includes the ‘‘seri-alization’’ time (TP) for retransmitting the packet payloadover the second wireless hop.

The only component in the above equation that varieswith the number of stations and traffic is the overheadimposed by the DCF mechanism. In IEEE 802.11b, andunder various assumptions, Garg and Kappes [16] haveapproximated it as TDCF ¼ 8:5� TSLOT þ TW � Pc , whereTW ¼ TPHY þ TH þ TDIFS þ TSIFS þ TP is the time wasted for apacket that has suffered collision and Pc is the collisionprobability calculated as Pc ¼ 0:03. In the following results,

Table 1Fixed overheads for the transmission of an IP datagram over IEEE 802.11. A variab

Overhead Size (bit) IEEE 802

PHY 192Slot time (TSLOT ) 20MAC header 272 24.7SIFS 10DIFSa 50ACK 112 10IP header 160 14.5UDP header 64 5.8RTP header 96 8.7Security (OpenVPN)b 680 61.8Security (L2TP/IPsec)b 736 66.9

a TDIFS ¼ TSIFS þ 2� TSLOT :b The space overhead due to tunneling varies, since padding is added to unenc

byte unencrypted IP datagrams carrying 20 bytes of audio payload, a UDP and a

we make the following simplifying approximations toderive an upper bound on VoIP capacity:

� The time a station spends waiting for the medium tobecome idle is composed only of idle slots countedduring the backoff phase. Namely, we ignore the casewhen a station freezes its backoff counter whensensing a transmission and the DIFS time that it has tosense the channel idle before restarting its backoffprocedure. Since the transmission of a VoIP packetoccurs rarely (e.g., once every 20 ms) and occupieslittle time (approximately 34 ls for a 20-byte audiopacket, including IP/UDP/RTP protocol headers,using IEEE 802.11g), it is reasonable to assume thatwhen operating below capacity, and given that VoIPtraffic sources are not synchronized, the probabilitythat a packet is generated during an ongoingtransmission is low. With this approximation, we getTDCF � CWmin

2 � TSLOT ¼ 7:5� TSLOT for IEEE 802.11g. Itshould be noted that this approximation becomes lessaccurate as the size of packets increases (e.g., due tothe space overhead imposed by security mechanisms).� There are no collisions when operating below capacity.

We numerically evaluate Eq. (2) and compare the max-imum number of simultaneous voice connections that canbe achieved when using unencrypted VoIP streams withthe case when they are secured with OpenVPN or L2TP/IPsec. In the latter cases, the total overhead due to protocolheaders is shown in Table 1. (Details on the structure of aVPN-secured packet are discussed in Section 4.3.3.) Wealso compare the use of G.729a with G.711, in each casesending 20 ms of audio payload per packet (20 vs. 160bytes). G.711 trades bandwidth requirements for slightlybetter audio quality. For example, for plain unencryptedG.729a (bitrate Ra ¼ 16 Kbps) over IEEE 802.11g(R ¼ 54 Mbps), where TH ¼ 272þ160þ64þ96 bits

54�106 bps¼ 10:96 ls,

TPHY ¼ 20 ls, TDCF � 7:5� TSLOT ¼ 67:5 ls, TSIFS ¼ 10 ls,TDIFS ¼ 28 ls, TACK ¼ 2 ls, and TP ¼ 2:96 ls (total overheadadding up to TOverhead ¼ 319:88 ls), the approximated VoIP

capacity is nmax ¼ TP�RðTPþTOverheadÞ�Ra

j k¼ 30 calls. The respective

number for G.711 is 27 calls. Our results are summarized inTable 2.

le delay is also introduced due to the DCF mechanism.

.11b (@11 Mbps) (ls) IEEE 802.11g (@54 Mbps) (ls)

20951028231.21.812.613.6

rypted data based on their size. The above values refer to the case for 60-n RTP header.

Table 2Estimated maximum number of simultaneous VoIP calls in our wireless-to-wired-to-wireless scenario.

G.729a G.711

IEEE 802.11ba – Unencrypted 7 6IEEE 802.11b – OpenVPN 6 5IEEE 802.11b – L2TP/IPsec 6 5IEEE 802.11gb – Unencrypted 30 27IEEE 802.11g – OpenVPN 28 25IEEE 802.11g – L2TP/IPsec 28 25

a For IEEE 802.11b, we use the approximation of Garg and Kappes [16],where TDCF ¼ 8:5� TSLOT þ Pc � TW .

b For IEEE 802.11g, we use TDCF ¼ CWmin2 � TSLOT .

10 20 30 40 500

50

100

150

200VoIP capacity for increasing audio payload

Num

ber o

f VoI

P se

ssio

ns

Audio payload (ms) per packet

G.729a−11bG.729a−VAD−11bG.711−11bG.711−VAD−11bG.729a−11gG.729a−VAD−11gG.711−11gG.711−VAD−11g

Fig. 2. Evolution of VoIP capacity as the audio payload per packetincreases, for the G.729a and G.711 codecs, also with the application ofVAD for silence suppression. In the experiments that follow, we have setthe audio payload per packet to 20 ms.

4


4.2.2. Silence suppressionA technique that can significantly improve capacity at

the expense of implementation complexity is silence sup-pression by means of Voice Activity Detection (VAD). Inthis case, packets are not transmitted if no voice activityis detected, thus reducing bandwidth demands. When thestart of a silence period is detected, the transmitter signalsthe receiver with special packets so that the latter gener-ates comfort noise. During a talk-spurt, packets are sent ata constant rate.

The performance benefits of using VAD cannot be easilyquantified, because they heavily depend on the actualvoice activity during a session. A simplifying assumptionwould be that VAD reduces required bandwidth by 50%.However, studies indicate that a voice activity detectorcreates silence and activity periods of different durations[42,43], thus the Voice Activity Factor (VAF),3 which deter-mines the required bandwidth, may be less than 0.5.

ITU-T Recommendation P.59 [44] describes a method togenerate artificial conversational speech based on a 4-stateMarkov model. It reports the temporal parameters of con-versational speech, averaging the results of prior studieswhich where based on the analysis of speech samples(among which the popular work of Brady [42]). In particu-lar, the reported average duration of a talk-spurt is 1.004 sand the respective duration of a silence period is 1.587,yielding a VAF of a ¼ 0:38. Eq. (2) then becomes

nmax ¼TP � R

ðTP þ TOverheadÞ � a� Ra

� �: ð4Þ

As shown in Fig. 2, under the assumptions that (i)a ¼ 0:38, (ii) the G.729a codec is used, (iii) each packet car-ries 20 ms of audio payload, and (iv) IEEE 802.11g at54 Mbps is used, applying VAD techniques improves VoIPcapacity from 30 to 81 simultaneous calls. For the samesettings, IEEE 802.11b (11 Mbps) capacity could improvefrom 7 to 18 calls.

4.2.3. The effects of the payload sizeA further option to increase VoIP capacity is to increase

the size of the audio payload, for instance by packing mul-tiple samples in a single packet. However, increasing audiopayload can arguably make quality worse, due to increased

3 The Voice Activity Factor, denoted as a, is the ratio of the time thesource is active to the total time.

delay, loss and jitter. Most commercial implementationsuse small payload sizes (10–30 bytes). Fig. 2 shows theincrease in voice capacity as the audio payload sizeincreases in our scenario which involves two wirelesshops. Note that while G.729a applies compression suchthat 10 ms of audio are encoded using 10 bytes, forG.711, 10 ms of audio expand to 80 bytes.

4.3. Experimental methodology

Our calculations in the previous section cannot capturethe effects of cryptographic operations on VoIP capacityand involve simplifications about MAC layer behavior. Toaddress these issues, but also to offer a user-centric viewof how voice quality is perceived and study how transportlevel characteristics are associated with the achieved QoE,we performed a set of testbed experiments.

Again, we wish to estimate the VoIP capacity of oursecure tunneling-based architecture when critical securityfunctionality is built into typical home Wi-Fi devices.

The distinctive characteristics of our approach are that:

� Both call endpoints are assumed to be attached to(potentially community-operated) Wi-Fi APs.� Users tunnel their traffic to trusted VPN gateways. In

our case, these gateways are collocated with each user’shome Wi-Fi router (built into its firmware).

Thus, the requirements for a typical home WLAN AP arethe following:

� Route mobile client’s traffic using NAT.4

In a typical WLAN setting, wireless clients are given private IPaddresses via DHCP and the AP uses NAT to route traffic from/to theclients. In our architecture, most APs are expected to be connected with asingle DSL line and, thus, have only one public IP address.

Fig. 3. Testbed setup for our VoIP quality measurements. Time synchro-nization between measurement endpoints is carried out over Ethernet.


� Act as the home VPN gateway for its owner,5 while thelatter is visiting other (untrusted) APs. Encryption mech-anisms and packet expansion due to additional headerscause processing and data overhead.

Our experimental methodology and testbed setup haveto consider the above requirements and study their com-bined effect on VoIP performance.

4.3.1. Testbed descriptionVoIP call endpoints are laptops running Linux 2.6.38.

We have implemented a set of measurement tools to gen-erate constant-bitrate bidirectional UDP streams to emu-late VoIP traffic, and have selected OpenVPN [22] toimplement VPN tunnels, due to its configuration simplicityand popularity.

Since we wish to test the capabilities of standard low-cost WLAN equipment, we have used Linksys WRT54GLwireless routers running the Openwrt [45] Kamikaze 8.09distribution (Linux 2.4.35). Linksys WRT54GL APs arebased on the Broadcom BCM5352 chipset, which includesa 200 MHz MIPS processor, 16 MB RAM and 4 MB of per-manent storage (flash) where the firmware is also stored.A Broadcom IEEE 802.11b/g Wi-Fi adapter (controlled bya proprietary driver) is also included.

Transmission rate was fixed at 54 Mbps to disable rateadaptation, and the RTS/CTS mechanism was disabled.Our testbed operated in pure IEEE 802.11g mode (i.e., thephysical layer overhead was 20 ls; TSLOT ¼ 9 ls andCWmin ¼ 15). To avoid the effects of interference and con-tention for channel access, the APs operated in differentnon-overlapping channels (1 and 11) and we verified viasite surveys that no other Wi-Fi networks were operatingat our testbed location during our experiments.

To estimate the quality of voice calls we needed accu-rate measurements of network delay. We achieved thisby comparing the timestamps generated at the transmitterand the receiver end for each voice packet. Transmittersand receivers were synchronized using NTP: one of thetwo laptops was operating an NTP server and the gigabitEthernet interfaces of the two hosts were connectedback-to-back, so that VoIP (wireless) traffic was isolatedand the required millisecond-level accuracy was achieved.In our experiments, packets were sent at fixed 20 ms inter-vals, so synchronization is considered fairly accurate.

Our experimental testbed setup is shown in Fig. 3.

4.3.2. VoIP quality assessmentWe emulated voice conversations by setting up bidirec-

tional UDP flows between two laptop PCs. We imple-mented our own traffic generators, sending 50 packetsper second with 20 bytes of audio payload each and 12bytes for the RTP header. This traffic pattern correspondsto the G.729a codec, which is used by many availableWi-Fi VoIP phones. The 20 bytes of packet payload contain20 ms of voice. Each host was connected to a different IEEE

5 Note that, at the same time, multiple tunnels with different endpointsmay be active: more than one users may be using the same gateway at thesame time, since a home AP may be shared by more than one (e.g., familymembers).

802.11g WLAN AP and each voice call lasted for at least120 s. We assume that at the receiver end there is a dejitterbuffer to ensure smooth playout at the expense of a con-stant 60 ms delay.

We initiated parallel VoIP calls between the two laptopsand collected delay and loss information for each packet atthe receiver end for one of the two call directions. Since ourtestbed is symmetric, the same results were observed forthe opposite direction.

Our results reflect the perceived voice quality for a sin-gle call in the presence of simultaneous calls. For the VPNexperiments, we also set up VPN tunnels between the lap-tops and the APs each one was connected to.

To estimate VoIP QoE, we used the evaluation method-ology of Cole and Rosenbluth [31] to reduce ITU-T’s E-model to transport level metrics (see Section 2.4 andAppendix A) and derive a score that represents the subjec-tive quality of a voice call based only on network delay, jit-ter and packet loss information, which are directlymeasurable in our testbed. For the codec configurationdescribed above, this score (R-factor) is given by Eq. (A.1).For a call of acceptable quality, the R-factor value shouldbe greater than 70.

4.3.3. Security parametersThe home wireless router operates also as a VPN gate-

way. In our prior work [2] we experimented with anL2TP/IPsec solution, building the Openswan [46] IPsecimplementation into the router’s firmware. The L2TP pro-tocol was used for implementing tunnels and IPsec ESP(Encapsulating Security Payload) [47] was used to securethem [32]. IPsec operated in transport mode using theAES-CBC algorithm (128bit keys) for data encryption.Preshared keys were used for authentication.

The above solution was complex protocol-wise, but alsoas far as configuration and maintenance were concerned.As to its space overhead, the original IP packet (IP, UDPand RTP headers and voice payload, adding up to 60 bytes)is encapsulated in a PPP frame (4-byte header), which iscarried within an L2TP tunnel, thus an 8-byte L2TP headerand an 8-byte UDP header are prepended to it. The


resulting packet will be encrypted using the AES-CBC algo-rithm and encapsulated in an ESP header and trailer. Theinput data of the encryption algorithm also include theESP ‘‘pad length’’ and ‘‘next header’’ fields (1 byte each)and their total length is 82 bytes. Before encryption, theyare padded to become a multiple of the 16-byte AES-CBCblock size, raising their size to 96 bytes. ESP packet con-tents also include the AES initialization vector (16 bytes),sequence number (4 bytes) and SPI index (4 bytes), as wellas an integrity check value of 12 bytes (HMAC-MD5).Finally, the 132-byte packet is prepended with an IPheader, adding up to a total of 152 bytes (compared tothe 60 bytes of the unencrypted voice packet). Fig. 4 showsthe format of a tunneled VoIP packet using L2TP/IPsec.

Note that, in practice, NAT traversal would be used forIPsec communication, since users are typically in privateLANs, and this would add to the space overhead (additionalUDP encapsulation for traversing NAT).

Instead of L2TP/IPsec, in this article we opted for anOpenVPN-based solution, maintaining the same securitylevel (128-bit AES-CBC, but also certificate-based SSL/TLSsession authentication and key exchange and HMAC-SHA1 packet authentication) and exploiting the popularity,configuration simplicity and wide availability of OpenVPN.In this case, the size of a 60-byte IP datagram expands to145 bytes (compared to 152 bytes when using L2TP/IPsec).The structure of a packet tunneled over UDP using Open-VPN [48] and the above combination of encryption andauthentication schemes is shown in Fig. 5. Note that thesize of the data to encrypt is 64 bytes (audio payload plusthe 4-byte ID field). Since this is a multiple of the AES-CBCblock size, a 16-byte padding is necessary.

4.4. Results

In this section we present our measurement results. Thefigures presented here include end-to-end network delay,packet loss and dejitter buffer loss statistics for three typesof experiments; (i) plain unencrypted VoIP transport,

Fig. 4. Structure of a VoIP packet encry

Fig. 5. Structure of a VoIP packet encr

(ii) emulation of the space overhead introduced by Open-VPN tunneling, and (iii) VPN-secured VoIP transport.

For each experiment, we also calculate user-perceivedVoIP QoE as a function of the above three quantities andfor the codec settings we have assumed. Each data pointis the mean of the measured values for 5 iterations of thesame experiment, i.e., 5 VoIP calls under the same condi-tions, presented with 95% confidence intervals. For eachiteration, the R-factor was calculated using the mean val-ues for delay, network loss and dejitter buffer loss.

For comparison, we also mention our prior resultsobtained in an IEEE 802.11b testbed.

4.4.1. IEEE 802.11g performanceOur first experiment does not involve any security

mechanisms. Instead, it measures transport level charac-teristics and quantifies user QoE for an unencryptedend-to-end VoIP session. The major quality degradationcomponents are due to IEEE 802.11 MAC and PHY layermechanisms. This experiment serves as a baseline casefor the measurements to follow. We find that 30 concur-rent VoIP sessions can be sustained, with mean R-score>75 (Fig. 9). Beyond that point, network delay reachesmore than 100 ms (Fig. 6), which, combined with theintrinsic encoder delay (25 ms) and the playout delayintroduced by the dejitter buffer, but also with a noticeablenetwork (Fig. 7) and dejitter buffer (Fig. 8) loss makes callquality unacceptable. Results of this experiment are shownas circle points in the figures.

The results of this experiment exactly match the upperbound derived using a simple analytic model in Section 4.2(see Table 2).

4.4.2. VPN space overheadIn this experiment we quantify QoE degradation due to

the space overhead imposed by VPN mechanisms. We donot carry out any cryptographic operations. On thecontrary, we transmit packets of fixed size, equal to thesize of 60-byte VoIP datagrams tunneled using OpenVPN.

pted/tunneled using L2TP/IPsec.

ypted/tunneled using OpenVPN.

0 5 10 15 20 25 30 350

50

100

150

200

250

300

# VoIP calls

Del

ay (m

s)

One−way network delay(Does not include codec and dejitter buffer delay)

Plain IEEE 802.11gVPNVPN w/o crypto

Fig. 6. One-way network delay for a VoIP session in the presence ofincreasing numbers of parallel wireless-to-wired-to-wireless calls. Threescenarios are shown: The first curve (circles) shows the case for a plainunencrypted G.729a call. The second (squares) presents a scenario wherewe emulated VoIP calls secured by OpenVPN, without performingencryption and decryption. The third curve (triangles) shows the delayfor an OpenVPN-secured VoIP call using AES-128.

0 5 10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25

# VoIP calls

Pack

et lo

ss ra

tio

Packet loss in the end−to−end network pathPlain IEEE 802.11gVPNVPN w/o crypto

Fig. 7. Network packet loss for a VoIP session in the presence ofincreasing numbers of parallel wireless-to-wired-to-wireless calls. Sce-narios are the same as with Fig. 6.

0 5 10 15 20 25 30 350

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

# VoIP calls

Pack

et lo

ss ra

tio

Ratio of discarded packets at the dejitter (playout) buffer

Plain IEEE 802.11gVPNVPN w/o crypto

Fig. 8. Percentage of packets dropped at the receiver dejitter buffer due toexcessive jitter. Scenarios are the same as with Fig. 6.

0 5 10 15 20 25 30 350

10

20

30

40

50

60

70

80

90

100

VoIP QoE as the number of concurrentwireless−to−wired−to−wireless calls increases

# VoIP calls

R−f

acto

r (IT

U−T

E−m

odel

)

Plain IEEE 802.11gVPNVPN w/o cryptoQoE threshold (R = 70)

Fig. 9. User QoE expressed in terms of the E-model’s R-factor. Acceptablequality calls score 70 or more (solid straight line).


As discussed in Section 4.3.3, the size of each such IP dat-agram is 145 bytes. There is a noticeable drop in VoIPcapacity in this case: Space overhead accounts for a 30%decrease in VoIP capacity, which is not fully captured byour analytic model. Instead of the measured maximumnumber of 21 concurrent high-quality VoIP sessions (With21 ongoing calls, R-score is still above the quality thresh-old, as shown in Fig. 9), we had estimated that 28 callscould be sustained. This is because our analysis underesti-mated the delay imposed by the IEEE 802.11 DCF. Aspacket size increases, the probability that a node sensesthe medium busy while being in its backoff stage increases.This causes its backoff counter to freeze, moving TDCF

beyond the optimistic CWmin2 � TSLOT value that we had

assumed.

4.4.3. Combined VPN space and processing overheadOur 3rd experiment focuses on the combined effect of

VPN space and processing overhead. Note that we haveassumed that each call endpoint operates a separate VPNgateway at its premises, building this functionality in theWi-Fi router’s firmware. Other options which would offerperformance enhancements are possible (e.g., dedicatinga more powerful device to this purpose or negotiating anend-to-end secure tunnel, which would involve a singleencryption and decryption operation per packet), but oursolution is more generic and spares the need for separateVPN equipment.

This time, performance is severely impacted, but a rea-sonable number of parallel high-quality VoIP sessions is


still possible to support. In particular, 8 such calls can besimultaneously supported, maintaining very high quality(mean R-score approximately 80), as Fig. 9 indicates. How-ever, adding a single secure VoIP call causes all sessions tocollapse: one-way network delay increases to more than220 ms, while 5–10% of the transmitted packets are lost.Dejitter buffer loss was small in all our experiments. Packetloss obviously does not occur due to congested wirelesslinks, since for the same traffic pattern without performingVPN-related cryptographic operations (see Section 4.4.2),loss was negligible. Rather, increased processing require-ments at the APs incur queuing delays, which cause bufferoverflows and thus packet loss.

As a final note, the AP processor is much slower thanthat of the laptops we used in our experiments. Tests wecarried out using the OpenSSL speed utility revealed thatthe cryptographic throughput of encrypting 64-byte blocksusing the AES-CBC algorithm (128 bits) on a LinksysWRT54GL was 2 MBps, compared to 90 MBps on an Inteli3 quad-core CPU (making use of a single core for the speedtest, though) which we used in our experiments. The lap-top was capable of encrypting/decrypting approximately1.4 M packets/s, while 9 concurrent VoIP sessions gener-ated and received 900 packets/s. Therefore, for 9 concur-rent VoIP sessions, the AP seems to be the bottleneck,even though in our testbed the laptop was handling thesame amount of traffic to encrypt/decrypt.

4.4.4. Comparison with IEEE 802.11bIn our prior work [2] we presented results from a similar

set of experiments, albeit using IEEE 802.11b, but also L2TP/IPsec for implementing tunnels. When no security mecha-nism was in place, in our wireless-to-wired-to-wirelessscenario, 7 calls could be sustained. This is in line withour theoretical upper bound of Section 4.2. We also exper-imentally verified the result of Garg and Kappes [16] that 14calls can be admitted given our codec settings in a single-hop wireless-to-wired scenario, using QoE metrics. Theuse of L2TP/IPsec further reduced the number of sustain-able VoIP sessions to 5. As a final note, Nascimento et al.[26] presented experiments showing that, in a different set-ting, where VoIP is encrypted end-to-end using AES-CBCand IPsec ESP in tunnel mode, an IEEE 802.11b WLAN cansupport up to 8 simultaneous G.711 calls (with 30 ms ofaudio per packet) with acceptable QoE. We validated thisresult analytically. Their experimental setting, however,does not correspond to our architecture design, and cannotappropriately capture the processing overhead due to theVPN operations performed on user Wi-Fi routers.

5. Conclusion

In this article, we have delved into performance aspectsof secure, user-centric, voice communication over wirelessaccess. To this end, we have presented a tunneling-basedcommunications scheme, at the same time discussing var-ious design alternatives. Our work has touched upon thetradeoffs posed by the needs for low-cost operation, secu-rity and performance, quantifying the overhead imposedby security mechanisms and operation on low-end userequipment.

We found that operating VPN gateways in typical off-the-shelf Wi-Fi equipment is the major quality degradationcomponent: in our configuration, the number of high-qual-ity VoIP sessions which can be simultaneously sustaineddrops from 30 to 8 when security mechanisms are put ineffect, and this is mainly due to the increased delay andobserved packet loss imposed by the CPU-intensiveencryption and decryption operations at the APs. Ourarchitecture is thus CPU-, rather than bandwidth-con-strained, which is a promising result, as further improvingVoIP capacity depends to a significant extent on increasingthe processing power at VPN gateways, and using tech-niques such as silence suppression to reduce the load ontheir crypto engines. Furthermore, the observation thatpacket loss due to excessive jitter was small, especially inlow traffic load scenarios, since we assumed a static dejit-ter buffer which added a fixed 60 ms delay, motivates theinvestigation of adaptive buffering techniques to reducethe introduced delay and improve user experience.

Overall, our results show that, performance-wise, evenin the constrained environments that we focus, withuser-centric Wi-Fi access solutions with adequate coveragein place, low-cost, secure, user-driven alternatives formobile multimedia could be viable.

Acknowledgements

This work was carried out in part while P.A. Frangoudiswas with the Mobile Multimedia Laboratory, Athens Uni-versity of Economics and Business, and during the tenureof an ERCIM ‘‘Alain Bensoussan’’ Fellowship Programmeat INRIA Rennes-Bretagne Atlantique. The research leadingto these results has received funding from the EuropeanUnion Seventh Framework Programme (FP7/2007–2013)under Grant agreement No. 246016.

Appendix A. Updated expressions for approximating theE-model

The E-model, specified in ITU-T G.107 [30], is based onan algorithm with which the individual transmissionparameters are transformed into different individual‘‘impairment factors,’’ whose effect on user experience isadditive. Its output, the R-factor, is given by

R ¼ R0 � Is � Id � Ie�eff þ A:

R0 is the basic signal-to-noise ratio at the receiver end,including noise sources such as circuit noise and roomnoise, and Is is the combination of impairments that occursimultaneously with the voice signal (e.g., low loudness,non-optimal sidetone, quantization noise). Id representsdelay impairments and Ie�eff is the packet-loss dependenteffective equipment impairment factor, which includesimpairments caused by low bitrate codecs and packet loss.The E-model also includes an ‘‘advantage of access’’ factorA to account for cases where user expectations compensatefor impairments: For example, a user may accept somedecrease in quality in exchange for mobile connectivity.We drop the advantage factor in this discussion.


ITU-T G.107 defines default values for all input param-eters to the E-model algorithm, and recommends usingthese values for all parameters which are not varied duringits application. Choosing default values for all parametersother than delay and equipment impairments, the ratingfactor R is reduced to

R ¼ 93:2� Id � Ie�eff :

Ie�eff is given by

Ie�eff ¼ Ie þ ð95� IeÞPpl

Ppl

BurstRþ Bpl

;

where Ie is the codec-specific equipment impairment fac-tor, BurstR is the packet loss burstiness, Ppl is the packetloss percentage and Bpl is a factor expressing the robustnessof the codec to packet loss. Recommendation ITU-T G.113[49] provides default values for Ie and Bpl for variouscodecs. For G.729a, and assuming random packet loss(BurstR ¼ 1), Ie ¼ 11 and Bpl ¼ 19. Thus, the above equationbecomes

Ie�eff ¼ 11þ 84Ppl

Ppl þ 19:

To derive a simplified expression of Id, Cole and Rosenb-luth [31] used the default values for all other parametersthan network and codec-induced delay in the analyticexpression for Id and produced a set of points, on whichthey fit a simpler curve. We reproduced their approachusing the recent version of the E-model. We calculated Id

as a function of mouth-to-ear delay for delay values upto 400 ms, using the reference implementation of themodel publicly available, and used the segmented [50] soft-ware package to fit a piecewise linear expression to thedata. The delay impairment factor is thus approximated by

Id ¼ 0:337þ 0:024dþ ð0:097d� 16:467ÞHðd� 169:4Þ

where d is the total delay in the mouth-to-ear path andHðxÞ ¼ 1 if x > 0; 0 otherwise.

Combining the expressions of Ie�eff and Id, we derive anexpression of the R-factor for the G.729a codec and assum-ing (non-bursty) random packet loss:

R ¼ 81:863� 84Ppl

Ppl þ 19� 0:024d� ð0:097d

� 16:467ÞHðd� 169:4Þ: ðA:1Þ

Note that Ppl includes packet losses in the network path,but also packets that are discarded due to excessive jitter,i.e.,

Ppl ¼ 100� ðenwk þ ð1� enwkÞedjtÞ;

where enwk is the network packet loss ratio and edjt the ratioof packets discarded at the dejitter buffer. Furthermore, thedelay components are network latency (dnwk), as well asthe delays associated with the codec and dejitter bufferconfiguration. For G.729a, with each IP packet carryingtwo 10 ms audio blocks, the algorithmic and packetizationdelay add up to 25 ms. If we also assume a 60 ms dejitterdejitter buffer, d ¼ dnwk þ 85 (in ms).

References

[1] P.A. Frangoudis, G.C. Polyzos, V.P. Kemerlis, Wireless CommunityNetworks: An Alternative Approach for Broadband NomadicNetwork Access, vol. 49 (5), 2011, pp. 206–213.

[2] E.C. Efstathiou, P.A. Frangoudis, G.C. Polyzos, Controlled Wi-Fisharing in cities: a decentralized approach relying on indirectreciprocity, IEEE Trans. Mobile Comput. 9 (8) (2010) 1147–1160.

[3] guifi.net – Open, Libre and Neutral Telecommunications Network[online]. <http://guifi.net>.

[4] Athens Wireless Metropolitan Network [online]. <http://awmn.net>.[5] EU FP7 ICT Project CONFINE: Community Networks Testbed for the

Future Internet [online]. <http://confine-project.eu/>.[6] FON [online]. <http://en.fon.com>.[7] G. Biczók, L. Toka, A. Vidacs, T.A. Trinh, On incentives in global

wireless communities, in: Proc. 1st ACM Workshop on User-provided Networking, 2009, pp. 1–6.

[8] G. Biczók, L. Toka, A. Gulyás, T.A. Trinh, A. Vidács, Incentivizing theglobal wireless village, Comput. Networks 55 (2) (2011) 439–456.

[9] M.H. Manshaei, J. Freudiger, M. Félegyházi, P. Marbach, J.-P. Hubaux,On wireless social community networks, in: Proc. IEEE INFOCOM,2008, pp. 1552–1560.

[10] X. Ai, V. Srinivasan, C.-K. Tham, Wi-Sh: a simple, robust credit basedWi-Fi community network, in: Proc. IEEE INFOCOM 2009, Rio deJaneiro, Brazil, 2009.

[11] D. MacSíthigh, Law in the Last Mile: Sharing Internet Access ThroughWiFi, SCRIPTed 6 (2) (2009) 355, available from: <http://www.law.ed.ac.uk/ahrc/script-ed/vol6-2/macsithigh.asp>.

[12] Electronic Frontier Foundation – Wireless-Friendly ISPs [online].<http://www.eff.org/pages/wireless-friendly-isps>.

[13] N. Sastry, J. Crowcroft, K. Sollins, Architecting citywide ubiquitousWi-Fi access, in: Proc. ACM HotNets VI, 2007.

[14] T. Heer, S. Li, K. Wehrle, PISA: P2P Wi-Fi internet sharingarchitecture, in: Proc. 7th IEEE International Conference on Peer-to-Peer Computing (P2P ’07), 2007, pp. 251–252.

[15] T. Heer, S. Gotz, E. Weingartner, K. Wehrle, Secure Wi-Fi sharing atglobal scales, in: Proc. International Conference onTelecommunications (ICT ’08), 2008, pp. 1–7.

[16] S. Garg, M. Kappes, Can I add a VoIP call?, in: Proc. IEEE InternationalConference on Communications (ICC ’03), 2003, pp. 779–783.

[17] D. Hole, F. Tobagi, Capacity of an IEEE 802.11b Wireless LANsupporting VoIP, in: Proc. IEEE International Conference onCommunications (ICC ’04), 2004, pp. 196–201.

[18] IEEE 802.11 WG, IEEE Standard for information technology –telecommunications and information exchange between systems –local and metropolitan area networks – specific requirements Part11: wireless LAN Medium Access Control (MAC) and Physical Layer(PHY) Specifications Amendment 8: Medium Access Control (MAC)Quality of Service Enhancements, IEEE 802.11e-2005, The Instituteof Electrical and Electronics Engineers, Inc., New York, USA, 2005.

[19] I. Dangerfield, D. Malone, D. Leith, Understanding 802.11e voicebehaviour via testbed measurements and modeling, in: Proc. 5thInternational Symposium on Modeling and Optimization in Mobile,Ad Hoc and Wireless Networks (WiOpt ’07) Workshops, 2007.

[20] K. Stoeckigt, H. Vu, VoIP capacity – analysis, improvements, andlimits in IEEE 802.11 wireless LAN, IEEE Trans. Vehicular Technol. 59(9) (2010) 4553–4563.

[21] S. Kent, S. Keo, Security Architecture for the Internet Protocol, RFC4301, IETF, December 2005.

[22] OpenVPN: Open Source VPN [online]. <http://openvpn.net>.[23] R. Barbieri, D. Bruschi, E. Rosti, Voice over IPsec: analysis and

solutions, in: Proc. 18th Annual Computer Security ApplicationsConference (ACSAC ’02), 2002, p. 261.

[24] C. Xenakis, N. Laoutaris, L. Merakos, I. Stavrakakis, A genericcharacterization of the overheads imposed by IPsec and associatedcryptographic algorithms, Comput. Networks 50 (17) (2006) 3225–3241.

[25] S. Miltchev, S. Ioannidis, A. Keromytis, A study of the relative costs ofnetwork security protocols, in: Proc. USENIX 2002 Annual TechnicalConference, 2002, pp. 41–48.

[26] A. Nascimento, A. Passito, E. Mota, E. Nascimento, L. Carvalho, Can Iadd a secure VoIP call?, in: Proc. 7th IEEE International Symposiumon a World of Wireless, Mobile and Multimedia Networks(WoWMoM ’06), 2006, pp. 435–437.

[27] S. Jelassi, G. Rubino, H. Melvin, H. Youssef, G. Pujolle, Quality ofexperience of VoIP service: a survey of assessment approaches andopen issues, IEEE Commun. Surveys Tutorials 14 (2) (2012) 491–513.

[28] ITU-T Study Group 12, ITU-T Recommendation P.800: methods forsubjective determination of transmission quality, International

http://refhub.elsevier.com/S1389-1286(14)00240-0/h0195



http://guifi.net

http://awmn.net

http://confine-project.eu/

http://en.fon.com



http://www.law.ed.ac.uk/ahrc/script-ed/vol6-2/macsithigh.asp

http://www.law.ed.ac.uk/ahrc/script-ed/vol6-2/macsithigh.asp

http://www.eff.org/pages/wireless-friendly-isps




http://openvpn.net









Telecommunication Union, available from: <http://www.itu.int/rec/T-REC-P.800/> (August 1996).

[29] ITU-T Study Group 12, ITU-T Recommendation P.862: perceptualevaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networksand speech codecs, International Telecommunication Union,available from: <http://http://www.itu.int/rec/T-REC-P.862/>(February 2001).

[30] ITU-T Study Group 12, ITU-T Recommendation G.107: the E-model, acomputational model for use in transmission planning, InternationalTelecommunication Union, available from: <http://www.itu.int/rec/T-REC-G.107> (February 2014).

[31] R.G. Cole, J.H. Rosenbluth, Voice over IP performance monitoring,ACM Comput. Commun. Rev. 31 (2) (2001) 9–24.

[32] B. Patel, B. Aboba, W. Dixon, G. Zorn, S. Booth, Securing L2TP usingIPsec, RFC 3193, IETF, November 2001.

[33] DynDNS: Managed DNS, Outsourced DNS & Anycast DNS [online].<http://dyn.com/dns/>.

[34] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R.Sparks, M. Handley, E. Schooler, SIP: Session Initiation Protocol, RFC3261, IETF, June 2002.

[35] J. Rosenberg, H. Schulzrinne, Session Initiation Protocol (SIP):Locating SIP Servers, RFC 3263, IETF, June 2002.

[36] P2PSIP IETF Working Group [online]. <http://tools.ietf.org/wg/p2psip/>.

[37] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, H. Balakrishnan, Chord:a scalable peer-to-peer lookup service for internet applications, in:Proc. ACM SIGCOMM ’01, 2001, pp. 149–160.

[38] S. Cirani, R. Pecori, L. Veltri, A peer-to-peer secure VoIP architecture,in: L. Salgarelli, G. Bianchi, N. Blefari-Melazzi (Eds.), TrustworthyInternet, Springer, Milan, 2011, pp. 105–115.

[39] D. Touceda, J. Sierra, A. Izquierdo, H. Schulzrinne, Survey of attacksand defenses on P2PSIP communications, IEEE Commun. SurveysTutorials 14 (3) (2011) 750–783.

[40] P. Zimmermann, A. Johnston, J. Callas, ZRTP: Media Path KeyAgreement for Unicast Secure RTP, April 2011.

[41] P. Zimmermann. PGPfone Owner’s Manual [online]. <http://philzimmermann.com/docs/pgpfone10b7.pdf>.

[42] P. Brady, A statistical analysis of on-off patterns in 16 conversations,Bell Syst. Tech. J. 47 (1) (1968) 73–91.

[43] H. Lee, C. Un, A study of on-off characteristics of conversationalspeech, IEEE Trans. Commun. 34 (6) (1986) 630–637.

[44] ITU-T Study Group 12, Recommendation P.59: ArtificialConversational Speech, International Telecommunication Union,available from: <http://www.itu.int/rec/T-REC-P.59/> (March 1993).

[45] OpenWRT Linux Distribution [online]. <http://openwrt.org>.

[46] Openswan [online]. <http://www.openswan.org>.[47] S. Kent, R. Atkinson, IP Encapsulating Security Payload (ESP), RFC

2406, IETF, November 1998.[48] J.C. Snader, VPNs Illustrated: Tunnels, VPNs, and IPsec, Addison-

Wesley Professional, 2005.[49] ITU-T Study Group 12, Recommendation G.113: transmission

impairments due to speech processing, International Telecom-munication Union, available from: <http://www.itu.int/rec/T-REC-G.113> (November 2007).

[50] V.M. Muggeo, segmented: an R package to fit regression models withbroken-line relationships, R News 8 (1) (2008) 20–25.

Pantelis A. Frangoudis received his PhD(2012) in Computer Science from theDepartment of Informatics, Athens Universityof Economics and Business. He holds a BSc(2003) and an MSc (2005) in Computer Sci-ence from the same department. Currently, heis a post-doctoral researcher at INRIA Rennes-Bretagne Atlantique. His research interestsinclude wireless networking, Internet multi-media and network security.

George C. Polyzos, Computer Science Profes-sor at AUEB, founded and is leading theMobile Multimedia Laboratory. He was CSEProfessor at UCSD, Computer Systems Labo-ratory co-director, Center for Wireless Com-munications Steering Committee member,and San Diego Supercomputer Center SeniorFellow. Obtained his EE Diploma (NTUA), andMASc (EE) and PhD (CS) from the University ofToronto. His current research interestsinclude Internet architecture and protocols,mobile multimedia communications, wirelessnetworks, ubiquitous computing, and net-work security.

http://www.itu.int/rec/T-REC-P.800/


http://http://www.itu.int/rec/T-REC-P.862/

http://www.itu.int/rec/T-REC-G.107




http://dyn.com/dns/

http://tools.ietf.org/wg/p2psip/

http://tools.ietf.org/wg/p2psip/











http://philzimmermann.com/docs/pgpfone10b7.pdf

http://philzimmermann.com/docs/pgpfone10b7.pdf






http://openwrt.org

http://www.openswan.org








on the performance of secure user-centric voip communication

Documents