secure communication based on ambient audio

13
Secure Communication Based on Ambient Audio Dominik Schu ¨rmann and Stephan Sigg, Member, IEEE Computer Society Abstract—We propose to establish a secure communication channel among devices based on similar audio patterns. Features from ambient audio are used to generate a shared cryptographic key between devices without exchanging information about the ambient audio itself or the features utilized for the key generation process. We explore a common audio-fingerprinting approach and account for the noise in the derived fingerprints by employing error correcting codes. This fuzzy-cryptography scheme enables the adaptation of a specific value for the tolerated noise among fingerprints based on environmental conditions by altering the parameters of the error correction and the length of the audio samples utilized. In this paper, we experimentally verify the feasibility of the protocol in four different realistic settings and a laboratory experiment. The case studies include an office setting, a scenario where an attacker is capable of reproducing parts of the audio context, a setting near a traffic loaded road, and a crowded canteen environment. We apply statistical tests to show that the entropy of fingerprints based on ambient audio is high. The proposed scheme constitutes a totally unobtrusive but cryptographically strong security mechanism based on contextual information. Index Terms—Pervasive computing, data encryption, random number generation, signal analysis, synthesis, and processing, signal processing, location-dependent and sensitive Ç 1 INTRODUCTION A N important factor in the set of security risks is typically the human impact. People are occasionally careless or incompletely understanding the underlying technology. This is especially true for wireless communica- tion. For instance, the communication range or the number of potential communication partners might be underesti- mated. This is natural since humans typically base trust on the situation or context they perceive [1]. Nevertheless, the range of a communication network most likely bridges devices in various contexts. As context, proximity and trust are related [1], a security scheme that utilizes common contextual features among communicating devices might provide a sense of security which is perceived as natural by individuals and reduce the number of human errors related to security. Consider, for instance, a meeting with coworkers of a specific project. Naturally, workers trust the others based on working agreements. Every group member needs the permission to access common information like mobile phone numbers or shared files. Communication between group members, however, should be guarded against access from external devices or individuals. The meeting room defines the borders which shall not be crossed by any confidential data. Context information that is unique inside these borders, such as ambient audio, can be exploited as the seed to generate a common secret for the secure information exchange and authentication. Mobile phones can then synchronize their ID-cards ad hoc without user interaction and secured by their physical proximity. Similarly, access to shared files on computers of coworkers and communication links among coworkers can be secured. Another reason why security cautions might be discarded occasionally is the effort required and inconvenience to establish a secure connection. This is especially true between devices that communicate seldom or for the first time. We propose a mechanism to unobtrusively (zero inter- action) establish an ad hoc secure communication channel between unacquainted devices which is conditioned on the surrounding context. In particular, we consider audio as a source of spatially centered context. We exploit the similarity of features from ambient audio by devices in proximity to create a secure communication channel exclusively based on these features. At no point in the protocol the secret itself or information that could be used to derive audio feature values is made public. In order to do so, we generate synchronized audio fingerprints from ambient sounds and utilize error correcting codes to account for noise in the feature vector. On each commu- nicating device the feature vector is then used to create an identical key. The proposed protocol is noninteractive, unobtrusive, and does not require specific or identical hardware at communication partners. The remainder of this document is structured as follows: In Section 2 we introduce related work on context-based security mechanisms and security with noisy input data. Section 3 discusses the algorithmic background required for ambient audio-based key generation and implementation details. In Section 4, we discuss the noise and entropy of audio-fingerprints achieved in an offline experiment with sampled audio sequences. We show that the similarity in 358 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013 . D. Schu ¨rmann is with TU Braunschweig, 38118 Braunschweig, Germany. E-mail: [email protected]. . S. Sigg is with the Information Systems Architecture Science Research Division, National Institute of Informatics (NII), Tokyo Academic Park, 2-2 Aomi, Koto-ku, 135-0064 Tokyo, Japan. E-mail: [email protected]. Manuscript received 6 June 2011; revised 24 Oct. 2011; accepted 2 Dec. 2011; published online 22 Dec. 2011. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TMC-2011-06-0300. Digital Object Identifier no. 10.1109/TMC.2011.271. 1536-1233/13/$31.00 ß 2013 IEEE Published by the IEEE CS, CASS, ComSoc, IES, & SPS

Upload: s

Post on 08-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Secure Communication Based on Ambient Audio

Secure CommunicationBased on Ambient Audio

Dominik Schurmann and Stephan Sigg, Member, IEEE Computer Society

Abstract—We propose to establish a secure communication channel among devices based on similar audio patterns. Features from

ambient audio are used to generate a shared cryptographic key between devices without exchanging information about the ambient

audio itself or the features utilized for the key generation process. We explore a common audio-fingerprinting approach and account for

the noise in the derived fingerprints by employing error correcting codes. This fuzzy-cryptography scheme enables the adaptation of a

specific value for the tolerated noise among fingerprints based on environmental conditions by altering the parameters of the error

correction and the length of the audio samples utilized. In this paper, we experimentally verify the feasibility of the protocol in four

different realistic settings and a laboratory experiment. The case studies include an office setting, a scenario where an attacker is

capable of reproducing parts of the audio context, a setting near a traffic loaded road, and a crowded canteen environment. We apply

statistical tests to show that the entropy of fingerprints based on ambient audio is high. The proposed scheme constitutes a totally

unobtrusive but cryptographically strong security mechanism based on contextual information.

Index Terms—Pervasive computing, data encryption, random number generation, signal analysis, synthesis, and processing, signal

processing, location-dependent and sensitive

Ç

1 INTRODUCTION

AN important factor in the set of security risks istypically the human impact. People are occasionally

careless or incompletely understanding the underlyingtechnology. This is especially true for wireless communica-tion. For instance, the communication range or the numberof potential communication partners might be underesti-mated. This is natural since humans typically base trust onthe situation or context they perceive [1]. Nevertheless, therange of a communication network most likely bridgesdevices in various contexts.

As context, proximity and trust are related [1], a security

scheme that utilizes common contextual features among

communicating devices might provide a sense of security

which is perceived as natural by individuals and reduce the

number of human errors related to security.Consider, for instance, a meeting with coworkers of a

specific project. Naturally, workers trust the others based on

working agreements. Every group member needs the

permission to access common information like mobile

phone numbers or shared files. Communication between

group members, however, should be guarded against

access from external devices or individuals. The meeting

room defines the borders which shall not be crossed by any

confidential data. Context information that is unique inside

these borders, such as ambient audio, can be exploited as

the seed to generate a common secret for the secureinformation exchange and authentication.

Mobile phones can then synchronize their ID-cards adhoc without user interaction and secured by their physicalproximity. Similarly, access to shared files on computers ofcoworkers and communication links among coworkers canbe secured.

Another reason why security cautions might be discardedoccasionally is the effort required and inconvenience toestablish a secure connection. This is especially true betweendevices that communicate seldom or for the first time.

We propose a mechanism to unobtrusively (zero inter-action) establish an ad hoc secure communication channelbetween unacquainted devices which is conditioned on thesurrounding context. In particular, we consider audio as asource of spatially centered context. We exploit thesimilarity of features from ambient audio by devices inproximity to create a secure communication channelexclusively based on these features. At no point in theprotocol the secret itself or information that could be usedto derive audio feature values is made public. In order to doso, we generate synchronized audio fingerprints fromambient sounds and utilize error correcting codes toaccount for noise in the feature vector. On each commu-nicating device the feature vector is then used to create anidentical key. The proposed protocol is noninteractive,unobtrusive, and does not require specific or identicalhardware at communication partners.

The remainder of this document is structured as follows:In Section 2 we introduce related work on context-basedsecurity mechanisms and security with noisy input data.Section 3 discusses the algorithmic background required forambient audio-based key generation and implementationdetails. In Section 4, we discuss the noise and entropy ofaudio-fingerprints achieved in an offline experiment withsampled audio sequences. We show that the similarity in

358 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

. D. Schurmann is with TU Braunschweig, 38118 Braunschweig, Germany.E-mail: [email protected].

. S. Sigg is with the Information Systems Architecture Science ResearchDivision, National Institute of Informatics (NII), Tokyo Academic Park, 2-2Aomi, Koto-ku, 135-0064 Tokyo, Japan. E-mail: [email protected].

Manuscript received 6 June 2011; revised 24 Oct. 2011; accepted 2 Dec. 2011;published online 22 Dec. 2011.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TMC-2011-06-0300.Digital Object Identifier no. 10.1109/TMC.2011.271.

1536-1233/13/$31.00 � 2013 IEEE Published by the IEEE CS, CASS, ComSoc, IES, & SPS

Page 2: Secure Communication Based on Ambient Audio

audio fingerprints is sufficient for authentication but cannotbe utilized as secure key directly. In particular, we utilizefuzzy-cryptography schemes to account for noise in theinput data. Section 5 presents four case studies in differentenvironments that exploit the feasibility of the approach invarious settings. The general feasibility of the approach isdemonstrated in Section 5.1 in a controlled office environ-ment. Section 5.2 then shows that the audio context can beseparated between two offices even when a synchronizedaudio source is located in both places. Additionally, westudied the feasibility of ambient audio-based key genera-tion at the side of a heavily trafficked road in Section 5.4 andin a canteen environment in Section 5.3. The entropy of theambient audio-based characteristic binary sequences gen-erated by our method is discussed in Section 6. In Section 7,we draw our conclusion.

2 RELATED WORK

In the literature, several authors consider spontaneousauthentication or the establishing of a secure communica-tion channel among mobile and ad hoc devices based onenvironmental stimuli [2], [3], [4], [5]. So far, shakingprocesses from accelerometer data and RF-channel mea-surements have been utilized as unique context source thatcontains shared characteristic information.

This concept was presented 2001 by Holmquist et al. [4].The authors propose to utilize the accelerometer of theSmart-It [6] device to extract characteristic features fromsimultaneous shaking processes of two devices. Later,Mayrhofer and Gellersen presented an authenticationmechanism based on this principle [7]. The authorsdemonstrated, that an authentication is possible whendevices are shaken simultaneously by a single person,while an authentication was unlikely for a third persontrying to mimic the correct movement pattern remotely.Also, Mayrhofer derived in [8] that the sharing of secretkeys is possible with a similar protocol. The proposedprotocol that can be utilized with arbitrary context featuresrepeatedly exchanges hashes of key-sub-sequences until acommon secret is found. In this instrumentation, exponen-tially quantized fast Fourier transformation (FFT) coeffi-cients of a sequence of accelerometer samples are utilized.In contrast, Bicher et al. describe an approach in whichnoisy acceleration readings can be utilized to establish asecure communication channel among devices [9], [3]. Theyutilize a hash function that maps similar accelerationpatterns to identical key sequences. However, their ap-proach suffers from the required exact synchronizationamong devices so that the authors computed the correcthash-values offline. Additionally, the hash function utilizedrequired that the keys computed exactly match and that theneighborhood around these keys is precisely defined. Whenpatterns are located at the border of one of the region’sneighborhoods, the tolerance for noise in the input is biasedin the direction of the center of this region. Additionally,key generation by simultaneous shaking is not unobtrusive.

We utilize an error correction scheme to account fornoise in the input data which can be fine-tuned for anyHamming distance desired which is centered around thenoisy characteristic sequences generated instead of anartificially defined center value. We implement a Network

Time Protocol (NTP)-based synchronization mechanismthat establishes sufficient synchronization among nodes.

Another sensor class utilized for context-based deviceauthentication is the RF-channel. Varshavsky et al. present atechnique to authenticate colocated devices based on RF-measurements since channel measurements from devices innear proximity are sufficiently similar to authenticatedevices against each other [5]. Hershey et al. utilize physicallayer features to derive secret keys for a pair of devices [10].In the absence of interference and nonlinear components,transmitter and receiver experience identical channel re-sponse [11]. This information is utilized to generate a secretkey among a node pair. Since channel characteristics arespatially sharply concentrated and not predictable at aremote location [12], an eavesdropper is not capable ofguessing information about the secret. This scheme wasvalidated in an indoor environment in [13]. Although weconsider the keys generated by this scheme as strong, it doesnot preserve spatial properties. A device at arbitrary distancecould pretend to be a nearby communication partner.

Kunze and Lukowicz recently demonstrated, that audioinformation indeed suffices to derive spatial information[14]. They combine audio readings with accelerometer datato classify locations of mobile devices. In their work, thenoise emitted by a vibrating mobile phone was utilized todistinguish among 35 specific locations in three differentrooms with over 90 percent accuracy.

Instead, we utilize purely ambient noise to establish asecure communication channel among devices in spatialproximity. We record NTP-synchronized audio samples attwo locations, generate a characteristic audio fingerprintand map this fingerprint to a unique secret key with thehelp of error correcting codes.

The last step is necessary since the similarity betweenfingerprints is typically not sufficient to establish a securechannel. With fuzzy-cryptography schemes, the generationof an identical key based on noisy input data [15] ispossible. Li and Chang analyze the usage of biometric ormultimedia data as part of an authentication process andpropose a protocol [16]. Due to the use of error-tolerantcryptographic techniques, this protocol is robust againstnoise in the input data. The authors utilize a secure sketch[17] to produce public information about an input withoutrevealing it. The input can then be recovered given anothervalue that is close to it. A similar study is presented by Miaoet al. [18]. The authors establish a key distribution based ona fuzzy vault [19] using data measured by devices wornon the human body. The fuzzy vault scheme, also utilizedin [20], enables the decryption of a secret with any key thatis substantially similar to the key used for encryption.

3 AD HOC AUDIO-BASED ENCRYPTION

Originally, audio-fingerprinting was proposed to classifymusic or speech. In our work binary fingerprints fromambient audio are used to establish an encrypted connec-tion based on the surrounding audio context. Due todifferences between fingerprints generated by participatingdevices, a cryptographic protocol is needed that tolerates aspecific amount of noise in these keys.

We propose the following scheme. A set of deviceswilling to establish a common key conditioned on ambient

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 359

Page 3: Secure Communication Based on Ambient Audio

audio take synchronized audio samples from their localmicrophones. Each device then computes a binary char-acteristic sequence for the recorded audio: An audio-fingerprint (cf. Section 3.1). This binary sequence isdesigned to fall onto a code-space of an error correctingcode (cf. Section 3.2). In general, a fingerprint will not matchany of the codewords exactly. Fingerprints generated fromsimilar ambient audio resemble but due to noise andinaccuracy in the audio-sampling process, it is unlikely thattwo fingerprints are identical. Devices therefore exploit theerror correction capabilities of the error correcting codeutilized to map fingerprints to codewords (cf. Section 3.3).For fingerprints with a Hamming distance within the errorcorrection threshold of the error correcting code theresulting codewords are identical and then utilized assecure keys (cf. Section 3.4). This scheme is in principle notlimited in the number of devices that participate. Whendevices are synchronized in their local times, they agree ona point in time when audio shall be recorded and proceedwith the fingerprint creation and error correction autono-mously as described above. All similar fingerprints willmap to an identical codeword. As detailed in Section 5.3,the Hamming distance tolerated in fingerprints rises withincreasing distance of devices.

The following sections provide an overview over audio-fingerprinting, our fuzzy commitment implementation,problems we experienced and possible solutions.

3.1 Audio Fingerprinting

Audio fingerprinting is an approach to derive a character-istic pattern from an audio sequence [21]. Generally, thefirst step involves the extraction of features from a piece ofaudio. These features are usually isolated in a time-frequency analysis after application of Fourier or Cosinetransforms. Some authors also utilize wavelet-transforms[22], [23], [24]. Common applications include the retrieval ofa specific music file in an audio database [25], duplicatedetection in such a database [26] as well as identification ofmusic based on short samples [27]. The capabilities ofdetecting similar audio sequences in the presence of heavysignal distortion are prominently demonstrated by applica-tions such as query by humming [28]. The authors utilizeautocorrelation, maximum likelihood, and Cepstrum ana-lysis to describe the pitch of an audio sequence as a Parsonsencoded music contour [29]. Similar audio sequences aredetected by approximate string matching [30]. McNab et al.added rhythm information by analyzing note duration tomatch the beginning of a song [31]. A similar approach ispresented by Prechelt and Typke [32]. They achieved moreaccurate results for query by whistling since the frequencyrange of whistling is much lower than for humming orsinging. In 2002, Chai and Vercoe computed a roughmelodic contour by counting the number of equivalenttransitions in each beat [33]. Notes are detected byamplitude-based note segmentation. Later, Shiffrin et al.showed that songs can be described by Markov-chains [34]where states represent note transitions. Retrieval of songs isthen achieved by the HMM Forward algorithm [35] so thatno database query is required. In 2003, Zhu et al. addressedpractical problems of recently proposed approaches such asthe accuracy of the derived description by utilizing adynamic time-warping mechanism [36].

Most of these studies are based on music-specificproperties such as rhythm information, pitch or melodiccontour. Since such features might be missing in ambientaudio, these methods are not applicable in our case.Haitsma and Kalker presented in [37] an approachapplicable for the classification of general audio sequencesby extracting a binary representation of audio from changesin the energy of successive frequency bands. This systemwas later shown to be highly robust to noise and distortionin audio data [25]. Due to its reported robustness, severalauthors employ slightly modified versions of this approach[27]. Lebosse et al, for instance, add further redundantsubsamples taken from the beginning and the end of anoverlapping time window in order to reduce the number ofbits in the fingerprint representation [38]. Alternatively,Burges et al. enhance the former approach by utilizing adistortion discriminant analysis [39]. Generally, time framestaken from the audio source are mapped successively onsmaller time windows in order to generate a condensedcharacteristic representation of the audio sequence. Analternative approach based on spectral flatness of a signal isproposed Herre et al. [40].

Also, Yang presented a method to utilize characteristicenergy peaks in the signal spectrum in order to extract aunique pattern [41]. A general framework that supports thisscheme was later presented by Yang [42]. Building on theseideas, a similar algorithm was then successfully appliedcommercially by Wang on a huge data base of audiosequences [43], [44].

To create audio fingerprints for our studies, we split anaudio sequence S with length jSj ¼ l and sample rate r upinto n frames F1; . . . ; Fn of identical length d ¼ jFij ¼ r � ln .On each frame a discrete Fourier transformation (DFT)weighted by a Hanning window (HW) is applied:

8i 2 f0; . . . ; n� 1g;Si ¼ DFT HWðFiÞð Þ:

ð1Þ

The frames are divided into m nonoverlapping frequencybands of width

b ¼ maxfreqðSiÞ �minfreqðSiÞm

: ð2Þ

On each band the sum of the energy values is calculatedand stored to an energy matrix E with energy per frame perfrequency band

8j 2 f0; . . . ;m� 1g;Sij ¼ bandfilterb�j;b�ðjþ1ÞðSiÞ;

ð3Þ

Eij ¼Xk

Sij½k�: ð4Þ

Using the matrix E, a fingerprint f is generated, where8i 2 f1; . . . ; n� 1g; 8j 2 f0; . . . ;m� 2g each bit describesthe difference between the energy on frequency bandsbetween two consecutive frames:

fði; jÞ ¼1; ðEði; jÞ �Eði; jþ 1ÞÞ � ðEði� 1; jÞ

�Eði� 1; jþ 1ÞÞ > 00; otherwise:

8<: ð5Þ

360 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

Page 4: Secure Communication Based on Ambient Audio

The complete algorithm is detailed in the appendix,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TMC.2011.271.

For each synchronization, we sampled l ¼ 6:375 secondsof ambient audio at a sample rate of r ¼ 44;100 Hz. We splitthe audio stream into n ¼ 17 frames of d ¼ 0:375 secondseach and divide every frame into m ¼ 33 frequency bands,to obtain a 512 bit fingerprint. Due to the extensiverecording duration, the generated fingerprints show greatrobustness in real world experiments (cf. Section 4 andSection 5). We used a FFT with fixed values on the length ofthe segments as detailed above.

This audio-fingerprinting scheme utilized in our studiesutilizes energy differences between frequency bands, asproposed by Haitsma and Kalker [25]. However, we take amore general approach of classifying ambient audio insteadof music. Commonly, in the literature, the characteristicinformation is found in a smaller frequency band and alogarithmic scaling is suggested to better represent proper-ties of the human auditory system. Since our system is notrestricted to musical recordings, we expect that allfrequency bands are equally important. Therefore, wedivide frames into frequency bands at a linear scale ratherthan a logarithmic one. Additionally, we do not useoverlapping frames since this has not shown improvementsin our case. Also, the entropy and therefore the securityfeatures of the generated fingerprint is likely to becomeimpaired with overlapping frames [45], [46].

3.2 Audio Fingerprints as Cryptographic Keys

To use the audio fingerprints directly as keys for a classicencryption scheme the concurrence of fingerprints gener-ated from related audio sequences has to be 1 with aconsiderably high probability [47]. Since we experienced asubstantial difference in the audio-fingerprints created(cf. Section 4) we consider the application of fuzzy-cryptography schemes. Note that a perfect match infingerprints is unlikely since devices are spatially separated,not exactly synchronized and utilize possibly differentaudio hardware.

The proposed cryptographic protocol shall be feasibleunattended and ad hoc with unacquainted devices. For aneavesdropper in a different audio context it shall becomputationally infeasible to use any intercepted data todecrypt a message or parts of it. Additionally, we want tocontrol the threshold for the tolerated offset betweenfingerprints based on contextual conditions of differentphysical locations.

With fuzzy encryption schemes, a secret & is used to hidethe key � in a set of possible keys K in such a way that onlya similar secret & 0 can find and decrypt the original key �correctly. In our case, the secrets which ought to be similarfor all communicating devices in the same context are audiofingerprints.

A Fuzzy Commitment scheme can, for instance, beimplemented with Reed-Solomon codes [48]. The followingdiscussion provides a short introduction to these codes.

Given a set of possible words A of length m and a set ofpossible codewords C of length n, Reed-Solomon codesRSðq;m; nÞ are initialized as:

A ¼ IFmq ; ð6Þ

C ¼ IFnq ; ð7Þ

with q ¼ pk; p prime; k 2 IN. These codes are mapping aword a 2 A of length m uniquely to a specific codewordc 2 C of length n:

a ����!Encodec: ð8Þ

This step adds redundancy to the original words withn > m, based on polynomials over Galois fields [48].

Decoding utilizes the error correction properties of theReed-Solomon-based encoding function to account fordifferences in the fingerprints created. The decodingfunction maps a set of codewords from one group C ¼fc; c0; c00; . . .g � C to one single original word. It is

~c ����!Decodea 2 A: ð9Þ

The value

t ¼ n�m2

j k; ð10Þ

defines the threshold for the maximum number of bitsbetween codewords that can be corrected in this manner todecode correctly to the same word a [49]. In the followingalgorithms the fingerprints f and f 0 are used in conjunctionwith codewords to make use of this error correctionprocedure. Dependent on the noise in the created finger-prints, t can then be chosen arbitrarily.

3.3 Commit and Decommit Algorithms

We utilize Reed-Solomon error correcting codes in thefollowing scheme to generate a common secret amongdevices. A fingerprint f is used to hide a randomly chosenword a as the basis for a key in a set of possible wordsa 2 A. This is a commit method. A decommit method isconstructed in such a way that only a fingerprint f 0 withmaximum Hamming distance

Hamðf; f 0Þ � t; ð11Þ

can find a again. We use Reed-Solomon RSðq;m; nÞ codes,with q ¼ 2k, k 2 IN, and n < 2k, for our commit anddecommit methods. After initialization, a private word a 2A is randomly chosen. It is then encoded following theReed-Solomon scheme to a specific codeword c. For asubtract-function � in C ¼ IFn

2k , the difference to thefingerprint is calculated as

� ¼ f � c: ð12Þ

Then, a SHA-512 hash [50] h(a) is generated from a.Afterwards, the tuple ð�; h(a)Þ containing the differenceand the hash is made public. Note that the transmission ofh(a) is optional and is only required to check whether thedecommitted a0 on the receiver side equals a. However,provided a sufficiently secure hash function, an eaves-dropper does not learn additional information about thekey a within reasonable time provided that she is ignorantof a fingerprint sufficiently similar to f .

The decommitment algorithm uses the public tupleð�; h(a)Þ together with the secret fingerprint f 0 to verify

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 361

Page 5: Secure Communication Based on Ambient Audio

the similarity between f and f 0 and to obtain a shared worda. A codeword c0 is calculated by subtracting f 0 by � in IFn

2k

c0 ¼ f 0 � �: ð13Þ

Afterwards c0 is decoded to a0 as

a0 2 A ����Decodec0 2 C: ð14Þ

From h(a) ¼ h(a0) we can conclude a ¼ a0 with highprobability. This procedure is capable of correcting up to t(cf. (10)) differing bits between the fingerprints. Thedecommitment was then successful and differences be-tween f and f 0 are t at most. The decommitted word a0 isprivately saved.

Participants can use their private words to derive keysfor encryption. A simple example for using a ¼ a0 ¼ða0; . . . ; am�1Þ to generate an encryption key for theAdvanced Encryption Standard (AES) [51] is to sum overblocks of values of a. For example, when m ¼ 256 we wouldsum over blocks with the length 8 and take these valuesmodulo 28 � 1 to represent characters for a string with thelength 32, that can be used as a key �:

Let � ¼ ð�0; . . . ; �31Þ; whereas

�i ¼X7

j¼0

cði�8Þþj

!mod 28 � 1:

In our study, for fingerprints of 512 bits we apply Reed-Solomon codes with RSðq ¼ 210;m; n ¼ 512Þ. Given amaximum acceptable Hamming distance t� (cf. (10))between fingerprints we can then set m flexibly to definethe minimum required fraction u of identical bits infingerprints as

t� ¼ ð1� uÞ � nd e; ð15Þ

m ¼ n� 2 � t�: ð16Þ

Experimentally, we found u ¼ 0:7 as a good tradeoff forcommon audio environments to allow a sufficient amountof differences among the used fingerprints to pair devicessuccessfully while at the same time providing sufficientcryptographic security against an eavesdropper in adifferent audio context (cf. Section 4)

m ¼ 512� 2 � ð1� 0:7Þ � 512d e¼ 204:

ð17Þ

We therefore use Reed-Solomon codes with

RSð210; 204; 512Þ: ð18Þ

The commit and decommit algorithms are furtherdetailed in the appendix, available in the online supple-mental material.

3.4 Synchronizing Communicating Devices

Since audio is time dependent, a tight synchronizationamong devices is required. In particular, we experiencedthat fingerprints created by two devices were sufficientlysimilar only when the synchronization offset among deviceswas within tens of milliseconds. For synchronization, anysufficiently accurate time protocol such as the NTP [52],

[53], the Precision Time Protocol (PTP) [54] or a similar timeprotocol can be utilized. Also, synchronization with GPStime might be a valid option.

When two participants, Alice and Bob, are willing tocommunicate securely with each other, Alice starts theprotocol by requesting a pairing with Bob. Then, theysynchronize their absolute system times using a suffi-ciently accurate time protocol. Afterwards, Alice sends astart time �start to Bob. When their clocks reach �start, therecording of ambient audio is initiated and audiofingerprinting is applied.

In our case studies, synchronization of devices was acritical issue. Since the approach bases the binary finger-prints on energy differences of subsamples of 0.375 secondswidth, a misalignment of several hundreds of millisecondsresults in completely different fingerprints. For best results,the start times of the audio recordings should not differmore than about 0.001 seconds. We successfully tested thiswith a remote NTP server and also with one of the deviceshosting the server.

Still, since NTP is able to synchronize clocks with anerror of several milliseconds [52], [55], some error in thesynchronization of audio samples remains. For instance, theusage of sound subsystems, like GStreamer [56], to recordambient audio introduces new delays.

Fig. 1 illustrates this aspect in the frequency spectrum oftwo NTP-synchronized recordings.

As a solution, we had the decommiting node create 200additional fingerprints by shifting the audio sequence inboth directions in steps of 0.001 seconds. The device thentried to create a common key with each of these fingerprintsand uses the first successful attempt. In this way, we couldcompensate for an error of about 0.2 seconds in the clocksynchronization among nodes.

3.5 Security Considerations and Attack Scenarios

Privacy leakages translate to leaking partial informationabout the used audio fingerprints. This can simplify theattack when further details of the ambient audio of Aliceand Bob is available.

Possible attacks on fuzzy-cryptography are reviewed byScheirer and Boult [57]. In particular, fuzzy commitment isevaluated regarding information leakage by Ignatenko andWillems [58]. It was found that the scheme can leakinformation about the secret key. However, this is attribu-table to helper data, a bit sequence at random distance to the

362 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

Fig. 1. Synchronization offset of NTP synchronized audio recordings.

Page 6: Secure Communication Based on Ambient Audio

secret key, which is made public in traditional fuzzycommitment schemes. In our case, we do not utilize helperdata and only optionally provide the hash of a data sequencewith similar purpose.

The publicly available distance � between f and c might,however leak information when either the fingerprints f ,the code-sequences c 2 C or the random word a 2 A are notdistributed uniformly at random or have insufficiententropy. Generally, it is important that

1. the random function to generate a has a sufficientlyhigh entropy

2. the codewords c 2 C are independently and uni-formly distributed over all possible bit sequences oflength n

3. The entropy of the generated fingerprints is high.

We address these issues in the following:

1. The choice of a 2 A has to be done by using arandom source with sufficient entropy. In Linux-based systems /dev/urandom should provideenough entropy for using the output for crypto-graphic purposes [59]. For generating h(a) a one-way-function has to be chosen to make sure that noassumptions on a can be made based on h(a). Weutilize SHA-512 which is certified by the NIST andwas extensively evaluated [50].

2. We are using 512 bit fingerprints and the Reed-Solomon codeRSð210; 204; 512Þ. Consequently, sets ofwords and codewords are defined as A ¼ IF204

210 andC ¼ IF512

210 . A word a out of 210204 ¼ 1; 024204 possiblewords is randomly chosen and encoded to c.

3. In order to test the entropy of generated fingerprintswe applied the dieHarder [60] set of statistical tests.Generally, we could not find any bias in thefingerprints created from ambient audio. Section 6discusses the test results in more detail.

A relevant attack scenario valid in our case is that theattacker is in the same audio context as Alice and Bob. Inthis case, no security is provided by the proposed protocol.Although this is a plausible threat, it can hardly be avoidedthat the leaking of contextual information poses a thread toa protocol that is designed to base the secure key generationexclusively on exactly this information. This principle isessential for the desired unobtrusive and ad hoc operation.An overview over possible attack scenarios when theattacker is not inside the same context is listed below.

3.5.1 Brute Force

The set of possible words A has to be large enough. Itshould be computationally infeasible to test every combina-tion to get the used word a. The probability to guess theright a is 1;024�204 in our implementation. Note that evenwith u ¼ 0:6, this probability is still 1;024�102.

3.5.2 Denial-of-Service (DoS)

An attacker could stress the communication while Alice andBob are using the fuzzy pairing. The pairing would fail ifð�; h(a)Þ is not transmitted correctly. DoS preventionsshould be implemented to provide an accurate treatment.As part of these preventions a maximum number of

attempts to pair two devices should be defined. Generally,this type of attack is only possible when ð�; h(a)Þ or � istransmitted. As mentioned in Section 3.3, with a carefulchoice of the fingerprint mechanism the exchange of datacan be avoided.

3.5.3 Man-in-the-Middle

An Eavesdropper Eve could be located in such a way, thatshe can intercept the wireless connection but is not locatedin the same physical context as Alice and Bob. When Eveintercepts the tuple ð�; h(a)Þ, she must generate an audio-fingerprint f that is sufficiently close to the fingerprints fand f 0 of Alice and Bob to intercept successfully. With noknowledge on the audio context, a brute force attack is thenrequired. This has to be done while Alice and Bob arecurrently in the phase of pairing. Therefore, Eve is limitedby a strict time frame. Again, this attack can be preventedby avoiding the transmission of ð�; h(a)Þ or �.

3.5.4 Audio Amplification

An Eavesdropper Eve could be located in physicalproximity where the ambient audio used by Alice andBob to generate their fingerprints is replicated. Eve canutilize a directional microphone to amplify these audiosignals. In fact, this is a security threat which increases thechance that Eve can reconstruct the fingerprint partly tohave a greater probability of guessing the secure secret.Since our scheme inherently relies on contextual informa-tion we cannot completely eliminate this threat. However,we show in Section 5.2 that the acoustic properties in tworooms are at least sufficiently different to prevent a devicewith access to the dominant audio source to be successful inmore than 50 percent of all cases.

4 FINGERPRINT-BASED AUTHENTICATION

In a controlled environment we recorded several audiosamples with two microphones placed at distinct positionsin a laboratory. The samples were played back by a singleaudio source. Microphones were attached to the left andright ports of an audio card on a single computer withaudio cables of equal lengths. They were placed at 1.5, 3,4.5, and 6 m distance to the audio source. For each setting,the two microphones were always located at nonequaldistances. In several experiments, the audio source emittedthe samples at quiet, medium, and loud volume. The audiosamples utilized consisted of several instances of music, aperson clapping her hands, snapping her fingers, speaking,and whistling. Dependent on the specific sample, the mean

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 363

TABLE 1Approximate Mean Loudness Experienced for

Several Sample Classes at 1.5 m Distance

Page 7: Secure Communication Based on Ambient Audio

dB for these loudness levels varied slightly. The loudnesslevels for several sample classes experienced in 1.5 mdistance are detailed in Table 1.

For these samples recorded by both microphones wecreated audio fingerprints and compared their Hammingdistances pair-wise. We distinguish between fingerprintscreated for audio sampled simultaneously and nonsimulta-neously. Overall, 7,500 distinct comparisons betweenfingerprints are conducted in various environmental set-tings. From these, 300 comparisons are created forsimultaneously recorded samples.

Fig. 2 depicts the median percentage of identical bits in thefingerprints for audio samples recorded simultaneously andnonsimultaneously for several positions of the microphones

and for several loudness levels. The error bars depict thevariance in the Hamming distance.

First, we observe that the similarity in the fingerprintsis significantly higher for simultaneously sampled audio inall cases. Also, notably, the similarity in the fingerprints ofnonsimultaneously recorded audio is slightly higher than50 percent, which we would expect for a random guess.The small deviation is a consequence of the monotonouselectronic background noise originated by the recordingdevices consisting of the microphones and the audiochipsets.

Additionally, the distance of the microphones to theaudio source has no impact on the similarity of fingerprints.Similarly, we cannot observe a significant effect of the

364 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

Fig. 2. Hamming distance observed for fingerprints created for recorded audio samples at distinct loudness levels and distances between

microphones and the audio source.

Page 8: Secure Communication Based on Ambient Audio

loudness level. This confirms our expectation since for the

fingerprinting approach not the absolute energy on

frequency bands but changes in energy over time were

considered (cf. Section 3.1). Therefore, changes in the

loudness level as, for instance, by altering the distance to

the audio source or by changing the volume of the audio,

have minor impact on the fingerprints.Table 2 depicts the maximum and minimum Hamming

distance among all experiments. We observe that one of thecomparisons of fingerprints for nonsimultaneously re-corded audio yielded a maximum similarity of 0.6484. Thisvalue is still fairly separated from the minimum bit-similarity observed for fingerprints from simultaneouslyrecorded samples. Also, this event is very seldom in the7,200 comparisons since the mean is sharply concentratedaround the median with a low variance. Therefore, byrepeating this process for a small number of times, wereduce the probability of such an event to a negligible value.For instance, only about 3.8 percent of the comparisonsbetween fingerprints from nonmatching samples havea similarity of more than 0.58; only 0.4583 percent have asimilarity of more than 0.6. Similarly, only 2.33 percent ofthe comparisons of synchronously sampled audio have asimilarity of less than 0.7.

With these results, we conclude that an authenticationbased on audio fingerprints created from synchronizedaudio samples in identical environmental contexts isfeasible. However, since it is unlikely that the fingerprintsmatch in all bits, it is not possible to utilize the audiofingerprints directly as a secret key to establish a securecommunication channel among devices. We thereforeconsidered error correcting codes to account for the noisein the fingerprints created.

5 CASE STUDIES

We implemented the described ambient audio-basedsecure communication scheme in Python and conductedcase studies in four distinct environments. The experimentsfeature differing loudness levels, different backgroundnoise figures as well as distinct common situations. InSection 5.1, we observe how the proposed method canestablish an ad hoc secure communication based on audiofrom ongoing discussions in a general office environment.Since an adversary able to sneak into the audio context of agiven room might be better positioned to guess the securekey, we demonstrate in Section 5.2 that even for anadversary device that is able to establish a similardominant audio context in a different room by listeningto the same FM-radio-channel, the gap in the createdfingerprints is significant. In these two experiments, weutilized artificial audio sources in a sense that they werespecifically placed to create the ambient audio context. InSections 5.3 and 5.4, we describe experiments in commonenvironments where ambient audio was utilized exclu-sively. In Section 5.3, we placed devices at distinct locationsin a canteen and studied the success probability based onthe distance between devices. In Section 5.4, we study thefeasibility of establishing a secure communication channelwith road traffic as background noise. Fig. 3 summarizesall settings considered. To capture audio we utilized thebuild-in microphones of the computers. The only exceptionis the reference scenario Fig. 3a in which simple off-the-shelf external microphones have been utilized. For bothdevices, the manufacturer and audio device types differed.Table 3 details further configuration of the scenariosconducted and the hardware utilized.

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 365

TABLE 2Percentage of Identical Bits between Fingerprints

Fig. 3. Environmental settings of the case studies conducted.

TABLE 3Configuration of the Four Scenarios Considered

Page 9: Secure Communication Based on Ambient Audio

5.1 Office Environment

In our first case study, we position two laptops in an officeenvironment. Ambient audio was originated from indivi-duals speaking inside or outside of the office room. Weconducted several sets of experiments with differingpositions of laptop computers and audio sources asdepicted in Fig. 3a. We distinguish four distinct scenarios:

1. 3a1: Both devices inside the office at locations aand b. One to two individuals speaking atlocations 1 to 4.

2. 3a2: One device inside and one outside the office infront of the open office door at locations a and c. Oneto two individuals speaking at locations 1 and 5.

3. 3a3: Both devices in the corridor in front of the officeat locations c and d. One to two individuals speakingat locations 5 to 11.

4. 3a4: One device inside and one outside the office infront of the closed office door at locations a and c.One to two individuals speaking (damped butaudible behind closed door) at locations 1 and 5.

In all cases, the devices were synchronized over NTP.For each synchronization, one device indicated at whichpoint in time it would initiate audio recording. Both devicesthen sample ambient audio at that time and create acommon key following the protocol described in Section 3.1.For each scenario the key synchronization process wasrepeated 10 times with the persons located at differentlocations. From these persons, either person 1, person 2 orboth were talking during the synchronization attempts inorder to provide the audio context.

The settings 3a1 and 3a3 represent the situation of twofriendly devices willing to establish a secure communica-tion channel. The setting 3a2 could constitute the situationin which a person passing by is accidentally witnessing thecommunication and part of the audio context. In setting 3a4,the communication partners might have closed the officedoor intentionally in order to keep information secure frompersons outside the office.

In scenario 3a1, where both devices share the same audiocontext a fraction of 0.9 of all synchronization attempts havebeen successful. Also, for scenario 3a3, the fraction ofsuccessful synchronization attempts was as high as 0.8.Consequently, when both devices are located in the sameaudio context, a successful synchronization is possible withhigh probability.

For scenario 3a2, where the device outside the ajar doorcould partly witness the audio context, we had a successprobability of 0.4. Although this means that less than everysecond approach was successful, this is clearly not acceptablein most cases. Still, this low success probability it isremarkable since the person speaking in the office or on thecorridor was clearly audible at the respective other location.

In scenario 3a4, however, when the audio context wasseparated by the closed door, no synchronization attemptwas successful. Remarkably in this case, the person speak-ing was, although hardly comprehensible, still audible atthe other side of the door.

Finally, we attempted to establish a synchronization inthe scenarios 3a1, 3a2, and 3a3 when only background noisewas present. This means that no sound was emitted from a

source located in the same location as one of the devices.Some distant voices and indistinguishable sounds couldoccasionally be observed. After a total of 12 tries in thesethree scenarios, not a single one resulted in a successfulsynchronization between devices. We conclude that adominant noise source or at least more dominant back-ground noise needs to be present in the same physicalcontext as the devices that want to establish a common key.

5.2 Context Replication with FM-Radio

A straightforward security attack for audio-based encryp-tion could be for the attacker to extract information about theaudio context and use this in order to guess the secret keycreated. We studied this threat by trying to generate a secretkey between two devices in different rooms but with similaraudio contexts. In particular, we placed two FM-radios,tuned to the same frequency in both rooms (cf. Fig. 3b).

The audio context was therefore dominated by thesynchronized music and speech from the FM-radiochannel. No other audio sources have been present in therooms so that additional background noise was negligible.We conducted two experiments in which the devices werefirst located in the same room and then in different roomswith the same distance to the audio source. The loudnesslevel of the audio source was tuned to about 50 dB in bothrooms. Fig. 4 depicts the median bit-similarity achievedwhen the devices were placed in the same room and indifferent rooms, respectively.

We observe that in both cases the variance in the biterrors achieved is below 0.01 percent. When both devicesare placed in the same room, the median Hamming distancebetween fingerprints is only 31.64 percent. We account thishigh similarity and the low variance to the fact thatbackground noise was negligible in this setting since theFM-radio was the dominant audio source.

When the devices are placed in different rooms, thevariance in bit error rates is still low with 0.008 percent. Themedian Hamming distance rose in this case to 36.52 percent.

Consequently, although the dominant audio source inboth settings generated identical and synchronized content,the Hamming distance drops significantly when bothdevices are in an identical room. With sufficient tuning ofthe error correction method conditioned on the Hammingdistance, an eavesdropper can be prevented from stealing

366 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

Fig. 4. Median percentage of bit errors in fingerprints generated by twomobile devices in an office setting. The audio context was dominated byan FM radio tuned to the same channel.

Page 10: Secure Communication Based on Ambient Audio

the secret key even though information on the audio contextmight be leaking.

5.3 Canteen Environment

We studied the accuracy of the approach in the canteen ofthe TU Braunschweig (cf. Fig. 3c). At different tables, laptopcomputers have been placed. For each configuration weconducted 10 attempts to establish a unique key based onthe fingerprints. We conducted all experiments during 11:30and 14:00 on a business day in a well-populated canteen.The ambient noise in this experiment was approximately at60 dB. Apart from the audible discussion on each table,background noise was characterized by occasional highpitches of clashing cutlery.

Fig. 5 depicts the results achieved. The figure shows themedian percentage of bit errors between the fingerprintsgenerated by both devices.

We observe that generally the percentage of identical bitsin the fingerprint decreases with increasing distance. Withabout 2 m distance the percentage of identical bits is stillquite similar to the similarity achieved when devices areonly 30 cm apart. This is also true when one of the devicesis placed at the next table. However, with a distance ofabout 4 meters and above, the percentage of bit errors arewell separated so that also the error correction could betuned such that a generation of a unique key is not feasibleat this a distance.

5.4 Outdoor Environment

In this instrumentation the two computers were located atthe side of a well trafficked road. The study has beenconducted during the rush hour between 17:00 and 19:00 ata regular working day. The road was frequented bypedestrians, bicycles, cars, lorries, and trams. The datawere measured not far off a headlight so that trafficoccasionally stopped with running motors in front of themeasurements. The loudness level was about 60 dB for bothdevices. The setting is depicted in Fig. 3d. We graduallyincreased the distance among devices. Devices have beenplaced with a distance between their microphones of 0.5, 3,5, 7, and 9 m at one side of the road. Additionally, for oneexperiment devices are placed at opposite sides of the road.For each configuration 10 to 13 experiments have beenconducted. The results are depicted in Fig. 6. The figuredepicts the median Hamming distance and variance for therespective configurations applied.

Not surprisingly, we observe that the Hamming distancebetween fingerprints generated by both devices is lowestwhen devices are placed next to each other. With increasingdistance, the Hamming distance increases slightly but thenstays similar also for greater distances.

At the opposite side of the road, however, the Hammingdistance drops more significantly. When both devices are atthe same side of the road, the probability to guess the secretkey is high even for greater distances between the devices.We believe that this property is attributable to the verymonotonic background noise generated by the vehicles onthe road. The audio context is therefore similar also ingreater distances.

Only when one of the devices is located at the oppositeside of the road, a more significant distinction between thegenerated fingerprints is possible. This may account to thedifferent reflection of audio off surrounding buildings andto the fact that vehicles on the other lane generate a differentdominant audio footprint.

Generally, these results suggest that audio-based keygeneration is hardly feasible in this scenario. Audio-basedgeneration of secret keys is not well suited in an environ-ment with very monotonic and unvaried background noise.Although a light protection from intruders on the differentside of the road is possible, the radius in which similarfingerprints are generated on one side of the road isunacceptably high.

6 ENTROPY OF FINGERPRINTS

Although these results suggest that it is unlikely for a devicein another audio context to generate a fingerprint which issufficiently similar, an active adversary might analyze thestructure of fingerprints created to identify and explore apossible weakness in the encryption key. Such a weaknessmight be constituted by repetitions of subsequences or byan unequal distribution of symbols. A message encryptedwith a key biased in such a way may leak more informationabout the encrypted message than intended.

We estimated the entropy of audio fingerprints gener-ated for audio-sub-sequences by applying statistical tests onthe distribution of bits. In particular, we utilized thedieHarder [60] set of statistical tests. This battery of testscalculates the p-value of a given random sequence with

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 367

Fig. 5. Median percentage of bit errors in fingerprints generated by twomobile devices in a canteen environment. Fig. 6. Median percentage of bit errors in fingerprints from two mobile

devices beside a heavily trafficked road.

Page 11: Secure Communication Based on Ambient Audio

respect to several statistical tests. The p-value denotes theprobability to obtain an input sequence by a truly randombit generator [61]. All tests are applied to a set offingerprints of 480 bits length. We utilized all samplesobtained in Sections 4 and 5.

From 7,490 statistical-test-batches consisting of 100

repeated applications of one specific test each, only 173,

or about 2.31 percent resulted in a p-value of less than 0.05.1

Each specific test was repeated at least 70 times. The p-

values are calculated according to the statistical test of

Kuiper [61], [62].Fig. 7 depicts for all test series conducted the fraction of

tests that did not pass a sequence of 100 consecutive runs at

> 5% for Kuiper KS p-values [61] for all 107 distinct tests in

the DieHarder battery of statistical tests. Generally, we

observe that for all test runs conducted, the number of tests

that fail is within the confidence interval with a confidence

value of � ¼ 0:03. The confidence interval was calculated

for m ¼ 107 tests as

1� � 3 �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1� �Þ � �

m

r: ð19Þ

Alternatively, we could not observe any distinction betweenindoor and outdoor settings (cf. Figs. 7a and 7b) andconclude that also the increasing noise figure and differenthardware utilized2 does not impact the test results. Sincemusic might represent a special case due to its structuredproperties and possible repetitions in an audio sequence,we considered it separately from the other samples. Wecould not identify a significant impact of music on theoutcome of the test results (cf. Fig. 7c).

Additionally, we separated audio samples of one audioclass and used them exclusively as input to the statisticaltests. Again, there is no significant change for any of theclasses (cf. Fig. 7d).

We conclude that we could not observe any bias infingerprints based on ambient audio. Consequently, theentropy of fingerprints based on ambient audio can beconsidered as high. An adversary should gain no significantinformation from an encrypted message eavesdropped.

7 CONCLUSION

We have studied the feasibility to utilize contextual informa-tion to establish a secure communication channel among

368 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013

Fig. 7. Illustration of P-Values obtained for audio fingerprints by applying the DieHarder battery of statistical tests.

1. All results are available at http://www.ibr.cs.tu-bs.de/users/sigg/StatisticalTests/TestsFingerprints_110601.tar.gz.

2. Overall, the microphones utilized (two internal, two external) wereproduced by three distinct manufacturers.

Page 12: Secure Communication Based on Ambient Audio

devices. The approach was exemplified for ambient audioand can be similarly applied to alternative features or contextsources. The proposed fuzzy-cryptography scheme is adap-table in its noise tolerance through the parameters of the errorcorrecting code utilized and the audio sample length.

In a laboratory environment, we utilized sets of record-ings for five situations at three loudness levels and fourrelative positions of microphones and audio source. Wederived in 7,500 experiments the expected Hammingdistance among audio fingerprints. The fraction of identicalbits is above 0.75 for fingerprints from the same audiocontext and below 0.55 otherwise. This gap in the Hammingdistance can be exploited to generate a common secretamong devices in the same audio context. We detailed aprotocol utilizing fuzzy-cryptography schemes that doesnot require the transmission of any information on thesecure key. The common secret is instead conditioned onfingerprints from synchronized audio recordings. Thescheme enables ad hoc and unobtrusive generation of asecure channel among devices in the same context. Weconducted a set of common statistical tests and showed thatthe entropy of audio fingerprints based on energy differ-ences in adjacent frequency bands is high and sufficient toimplement a cryptographic scheme.

In four case studies, we verified the feasibility of theprotocol under realistic conditions. The greatest separationbetween fingerprints from identical and nonidentical audio-contexts was observed indoor with low background noiseand a single dominant audio source. In such an environment,we could distinguish devices in the same and in differentaudio contexts. It was even possible to clearly identify adevice that replicated dominant audio from another roomwith an equally tuned FM-radio at similar loudness level.

In a case study conducted in a crowded canteenenvironment, we observed that the synchronization qualitywas generally impaired due to the absence of a dominantaudio source. However, it was still possible to establish aprivacy area of about 2 m inside which the Hammingdistance of fingerprints was distinguishably smaller thanfor greater distances. The worst results have been obtainedin a setting conducted beside a heavily trafficked road. Inthis case, when the noise component becomes dominantand considerably louder, the synchronization quality wasfurther reduced. Additionally, due to the increased loud-ness level, a similar synchronization quality was possiblealso at distances of about 9 m. We conclude that in thisscenario, a secure communication channel based purely onambient audio is hard to establish.

We claim that the synchronization quality in scenarioswith more dominant noise components can be furtherimproved with improved features and fingerprint algo-rithms. Currently, most ideas are lent from fingerprintingalgorithms and features designed to distinguish betweenmusic sequences. Although algorithms have been adaptedto better capture characteristics of ambient audio, webelieve that features and fingerprint generation to classifyambient audio might be further improved. Additionally, theconsideration of additional contextual features such as lightor RF-channel-based should improve the robustness of thepresented approach.

In our implementation we faced difficulties to achievesufficiently accurate (in the order of few milliseconds) time

synchronization among wireless devices. In our currentstudies, we tested several sample windows of NTP-synchro-nized recordings in order to achieve a feasible implementa-tion on standard hardware. However, a more exact timesynchronization would further reduce the accuracy andcomputational complexity of the approach.

ACKNOWLEDGMENTS

This work was supported by a fellowship within the

Postdoctoral Programme of the German Academic Ex-

change Service (DAAD).

REFERENCES

[1] C. Dupuy and A. Torre, Local Clusters, Trust, Confidence andProximity, Series Clusters and Globalisation: The Developmentof Urban and Regional Economies, pp. 175-195, Edward Elgar,2006.

[2] R. Mayrhofer and H. Gellersen, “Spontaneous Mobile DeviceAuthentication Based on Sensor Data,” Information SecurityTechnical Report, vol. 13, no. 3, pp. 136-150, 2008.

[3] D. Bichler, G. Stromberg, M. Huemer, and M. Loew, “KeyGeneration Based on Acceleration Data of Shaking Processes,”Proc. Ninth Int’l Conf. Ubiquitous Computing, J. Krumm, ed., 2007.

[4] L.E. Holmquist, F. Mattern, B. Schiele, P. Schiele, P. Alahuhta, M.Beigl, and H.W. Gellersen, “Smart-Its Friends: A Technique forUsers to Easily Establish Connections Between Smart Artefacts,”Proc. Third Int’l Conf. Ubiquitous Computing, 2001.

[5] A. Varshavsky, A. Scannell, A. LaMarca, and E. de Lara, “Amigo:Proximity-Based Authentication of Mobile Devices,” Int’lJ. Security and Networks, vol. 4, pp. 4-16, 2009.

[6] H.-W. Gellersen, G. Kortuem, A. Schmidt, and M. Beigl, “PhysicalPrototyping with Smart-Its,” IEEE Pervasive Computing, vol. 4,pp. 10-18, 2004.

[7] R. Mayrhofer and H. Gellersen, “Shake Well Before Use:Authentication Based on Accelerometer Data,” Proc. Fifth Int’lConf. Pervasive Computing, pp. 144-161, 2007.

[8] R. Mayrhofer, “The Candidate Key Protocol for Generating SecretShared Keys from Similar Sensor Data Streams,” Proc. FourthEuropean Conf. Security and Privacy in Ad-Hoc and Sensor Networks,pp. 1-15, 2007.

[9] D. Bichler, G. Stromberg, and M. Huemer, “Innovative KeyGeneration Approach to Encrypt Wireless Communication inPersonal Area Networks,” Proc. IEEE GlobeCom, 2007.

[10] J. Hershey, A. Hassan, and R. Yarlagadda, “UnconventionalCryptographic Keying Variable Management,” IEEE Trans.Comm., vol. 43, no. 1, pp. 3-6, Jan. 1995.

[11] G. Smith, “A Direct Derivation of a Single-Antenna ReciprocityRelation for the Time Domain,” IEEE Trans. Antennas andPropagation, vol. 52, no. 6, pp. 1568-1577, June 2004.

[12] M.G. Madiseh, M.L. McGuire, S.S. Neville, L. Cai, and M. Horie,“Secret key Generation and Agreement in Uwb CommunicationChannels,” Proc. IEEE GlobeCom, 2008.

[13] S.T.B. Hamida, J.-B. Pierrot, and C. Castelluccia, “An AdaptiveQuantization Algorithm for Secret Key Generation Using RadioChannel Measurements,” Proc. Third Int’l Conf. New Technologies,Mobility and Security, 2009.

[14] K. Kunze and P. Lukowicz, “Symbolic Object LocalizationThrough Active Sampling of Acceleration and Sound signatures,”Proc.Ninth Int’l Conf. Ubiquitous Computing, 2007.

[15] P. Tuyls, B. Skoric, and T. Kevenaar, Security with Noisy Data.Springer-Verlag, 2007.

[16] Q. Li and E.-C. Chang, “Robust, Short and Sensitive Authentica-tion Tags Using Secure Sketch,” Proc. Eighth Workshop Multimediaand Security, pp. 56-61, 2006.

[17] Y. Dodis, R. Ostrovsky, L. Reyzin, and A. Smith, “FuzzyExtractors: How to Generate Strong Keys from Biometrics andOther Noisy Data,” Proc. Int’l Conf. the Theory and Applicationsof Cryptographic Techniques (EUROCRYPT ’04), pp. 79-100, 2004.

[18] F. Miao, L. Jiang, Y. Li, and Y.-T. Zhang, “Biometrics Based NovelKey Distribution Solution for Body Sensor Networks,” Proc. IEEEAnn. Int’l Conf. Eng. in Medicine and Biology Soc. (EMBC ’09),pp. 2458-2461, 2009.

SCHURMANN AND SIGG: SECURE COMMUNICATION BASED ON AMBIENT AUDIO 369

Page 13: Secure Communication Based on Ambient Audio

[19] A. Juels and M. Sudan, “A Fuzzy Vault Scheme,” Proc. IEEE Int’lSymp. Information Theory, p. 408, 2002.

[20] Y. Dodis, J. Katz, L. Reyzin, and A. Smith, “Robust FuzzyExtractors and Authenticated Key Agreement from Close Secrets,”Proc. 26th Ann. Int’l Conf. Advances in Cryptology (CRYPTO ’06),pp. 232-250, 2006.

[21] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A Review ofAlgorithms for Audio Fingerprinting,” The J. VLSI Signal Proces-sing, vol. 41, no. 3, pp. 271-284, 2005.

[22] S. Baluja and M. Covell, “Waveprint: Efficient Wavelet-BasedAudio Fingerprinting,” Pattern Recognition, vol. 41, no. 11,pp. 3467-3480, 2008.

[23] L. Ghouti and A. Bouridane, “A Robust Perceptual AudioHashing Using Balanced Multiwavelets,” Proc. IEEE Fifth Int’lConf. Acoustics, Speech, and Signal Processing, 2006.

[24] S. Sukittanon and L. Atlas, “Modulation Frequency Features forAudio Fingerprinting,” Proc. IEEE Second Int’l Conf. Acoustics,Speech, and Signal Processing, 2002.

[25] J. Haitsma and T. Kalker, “A Highly Robust Audio FingerprintingSystem,” Proc. Third Int’l Conf. Music Information Retrieval, Oct.2002.

[26] C. Burges, D. Plastina, J. Platt, E. Renshaw, and H. Malvar, “UsingAudio Fingerprinting for Duplicate Detection and ThumbnailGeneration,” Proc. IEEE Int’l Conf. Acoustics, Speech, and SignalProcessing (ICASSP ’05), vol. 3, pp. iii/9-iii12, Mar. 2005.

[27] C. Bellettini and G. Mazzini, “A Framework for Robust AudioFingerprinting,” J. Comm., vol. 5, no. 5, pp. 409-424, 2010.

[28] A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith, “Query byHumming,” Proc. ACM Multimedia, 1995.

[29] D. Parsons, The Directory of Tunes and Musical Themes. CambridgeUniv., 1975.

[30] R.A. Baeza-Yates and C.H. Perleberg, “Fast and PracticalApproximate String Matching,” Proc. Third Ann. Symp. Combina-torial Pattern Matching, 1992.

[31] R.J. McNab, L.A. Smith, I.H. Witten, C.L. Henderson, and S.J.Cunningham, “Towards the Digital Music Library: Tune Retrievalfrom Acoustic Iinput,” Proc. ACM Int’l Conf. Digital Libraries, 1996.

[32] L. Prechelt and R. Typke, “An Interface for Melody Input,” ACMTrans. Computer Human Interactions, vol. 8, pp. 133-149, 2001.

[33] W. Chai and B. Vercoe, “Melody Retrieval on the Web,” Proc.ACM/SPIE Conf. Multimedia Computing and Networking, 2002.

[34] L. Shifrin, B. Pardo, and W. Birmingham, “HMM-Based MusicalQuery Retrieval,” Proc. Joint Conf. Digital Libraries, 2002.

[35] L. Rabiner, “A Tutorial on Hidden Markov Models and SelectedApplications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2,pp. 257-286, Feb. 1989.

[36] Y. Zhu and D. Shasha, “Warping Indexes with Envelope Trans-forms for Query by Humming,” Proc. ACM SIGMOD Int’l Conf.Management of Data, 2003.

[37] J. Haitsma and T. Kalker, “Robust Audio Hashing for ContentIdentification,” Proc. Int’l Workshop Content-Based MultimediaIndexing (CBMI), 2001.

[38] J. Lebosse, L. Brun, and J.-C. Pailles, “A Robust AudioFingerprint’s Based Identification Method,” Proc. Third IberianConf. Pattern Recognition and Image Analysis, Part I (IbPRIA ’07),pp. 185-192, 2007.

[39] C. Burges, J. Platt, and S. Jana, “Distortion Discriminant Analysisfor Audio Fingerprinting,” IEEE Trans. Speech and Audio Proces-sing, vol. 11, no. 3, pp. 165-174, May 2003.

[40] J. Herre, E. Allamanche, and O. Hellmuth, “Robust Matching ofAudio Signals Using Spectral Flatness Features,” Proc. IEEEWorkshop Applications of Signal Processing to Audio and Acoustics,pp. 127-130, 2001.

[41] C. Yang, “MACS: Music Audio Characteristic Sequence Indexingfor Similarity Retrieval,” Proc. IEEE Workshop Applications of SignalProcessing to Audio and Acoustics, pp. 123-126, 2001.

[42] C. Yang, “Efficient Acoustic Index for Music Retrieval withVarious Degrees of Similarity,” Proc. 10th ACM Int’l Conf.Multimedia (MULTIMEDIA ’02), pp. 584-591, http://doi.acm.org/10.1145/641007.641125, 2002.

[43] A. Wang, “The Shazam Music Recognition Service,” Comm. ACM,vol. 49, no. 8, pp. 44-48, 2006.

[44] A. Wang, “An Industrial Strength Audio Search Algorithm,” Proc.Int’l Conf. Music Information Retrieval (ISMIR), 2003.

[45] F. Hao, “On Using Fuzzy Data in Security Mechanisms,” PhDdissertation, Queens College, Cambridge, Apr. 2007.

[46] A.C. Ibarrola and E. Chavez, “A Robust Entropy-Based Audio-Fingerprint,” Proc. Int’l Conf. Multimedia and Expo (ICME ’06),2006.

[47] B. Schneider, Applied Cryptography: Protocols, Algorithms, and SourceCode in C, second ed. John Wiley and Sons, 1996.

[48] I. Reed and G. Solomon, “Polynomial Codes over Certain FiniteFields,” J. Soc. for Industrial and Applied Math., vol. 8, pp. 300-304,1960.

[49] A. Juels and M. Wattenberg, “A Fuzzy Commitment Scheme,”Proc. Sixth ACM Conf. Computer and Comm. Security, pp. 28-36,1999.

[50] Secure Hash Standard (SHS), FIPS PUB 180-4, National Institute ofStandards and Technology, Oct. 2008.

[51] Advanced Encryption Standard (AES), FIPS PUBS 197, NationalInstitute of Standards and Technology, 2001.

[52] D. Mills, J. Martin, J. Burbank, and W. Kasch, “Network TimeProtocol Version 4: Protocol and Algorithms Specification,” IETFRFC 5905, http://www.ietf.org/rfc/rfc5905.txt, June 2010.

[53] D.L. Mills, “Improved Algorithms for Synchronising ComputerNetwork Clocks,” IEEE/ACM Trans. Networking, vol. 3, no. 3,pp. 245-254, June 1995.

[54] S. Meier, H. Weibel, and K. Weber, “IEEE 1588 Syntonization andSynchronization Functions Completely Realized in Hardware,”Proc. IEEE Symp. Int’l Precision Clock Synchronization for Measure-ment, Control and Comm. (ISPCS ’08), 2008.

[55] D.L. Mills, “Precision Synchronisation of Computer NetworkClocks,” ACM Computer Comm. Rev., vol. 24, no. 2, pp. 28-43, Apr.1994.

[56] “GStreamer Documentation,” http://gstreamer.freedesktop.org/documentation, Oct. 2010.

[57] W.J. Scheirer and T.E. Boult, “Cracking Fuzzy Vaults andBiometric Encryption,” Proc. Biometrics Symp., 2007.

[58] T. Ignatenko and F.M.J. Willems, “Information Leakage in FuzzyCommitment Schemes,” IEEE Trans. Information Forensics andSecurity, vol. 5, no. 2, pp. 337-348, June 2010.

[59] K. Fenzi and D. Wreski, “Linux Security HOWTO,” http://www.ibiblio.org/pub/linux/docs/howto/other-formats/pdf/Security-HOWTO.pdf, Jan. 2004.

[60] R.G. Brown, “Dieharder: A Random Number Test Suite,” http://www.phy.duke.edu/~rgb/General/dieharder.php, 2011.

[61] N. Kuiper, “Tests Concerning Random Points on a Circle,” Proc.Koinklijke Nederlandse Akademie van Wetenschappen, vol. 63, pp. 38-47, 1962.

[62] M. Stephens, “The Goodness-of-Fit Statistic v n: Distribution andSignificance Points,” Biometrika, vol. 52, pp. 309-321, 1965.

Dominik Schurmann received the bachelor’s ofscience degree from TU Braunschweig, Ger-many, in 2010. His research interests includeunobtrusive security in distributed systems andcryptographic algorithms in general.

Stephan Sigg received a diploma in computerscience from the University of Dortmund, Ger-many, in 2004 and the PhD degree in 2008 fromthe chair for communication technology at theUniversity of Kassel, Germany. He is currentlywith the Information Systems ArchitectureScience Research Division at the NationalInstitute of Informatics (NII), Japan. His researchinterests include the analysis, development, andoptimization of algorithms for pervasive comput-

ing systems. He is a member of the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

370 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 12, NO. 2, FEBRUARY 2013