ambientsense: a real-time ambient sound recognition system ...mlabrador/share/... · sound is a...

AmbientSense: A Real-Time Ambient SoundRecognition System for Smartphones

Mirco Rossi∗

[email protected]

Sebastian Feese∗

[email protected]

Oliver Amft†

[email protected]

Nils Braune∗

[email protected]

Sandro Martis∗

[email protected]

Gerhard Tröster∗

[email protected]

Abstract—This paper presents design, implemen-tation, and evaluation of AmbientSense, a real-timeambient sound recognition system on a smartphone.AmbientSense continuously recognizes user context byanalyzing ambient sounds sampled from a smartphone’smicrophone. The phone provides a user with real-time feedback on recognised context. AmbientSenseis implemented as an Android app and works in twomodes: in autonomous mode processing is performedon the smartphone only. In server mode recognition isdone by transmitting audio features to a server andreceiving classification results back. We evaluated bothmodes in a set of 23 daily life ambient sound classes anddescribe recognition performance, phone CPU load,and recognition delay. The application runs with a fullycharged battery up to 13.75 h on a Samsung GalaxySII smartphone and up to 12.87 h on a Google NexusOne phone. Runtime and CPU load were similar forautonomous and server modes.

I. Introduction

Sound is a rich source of information that can be usedto infer a person’s context in daily life. Almost everyactivity produces some characteristic ambient sound pat-terns, e.g. speaking, walking, eating, or using the com-puter. Most locations have usually a specific sound patterntoo, e.g. restaurants or streets. Real-time sound contextinformation could be used for various mobile applications.For example, a smartphone can automatically go intoan appropriate profile while in a meeting, or provideinformation customized to the location of the user. Ithas been shown that pattern recognition can be usedon sound data to automatically infer user activities [1],environments [2], [3] and social events [4]. Inferring usercontext using microphones has advantages compared toother modalities: microphones are cheap, available in manywearable devices (such as smartphones), and work even ifpartly cloaked [5].

To date, sound-based context inference has often beenanalyzed offline or by simulations on a PC. New smart-phones with high computational power and Internet con-nectivity could enable users to rely on sound-based contextinferences for mobile and real-time applications. However,for a smartphone implementation several design choicesare needed. E.g., the sound recognition could be realized

∗Wearable Computing Lab., ETH Zurich†ACTLab, Signal Processing Systems, TU Eindhoven

by processing directly on the phone, or by offloading thecontext processing to a server. Moreover, recognition delayand CPU load are key to determine practicality of theapproach.

In this work, we propose AmbientSense, a real-timeambient sound recognition system that works on a smart-phone. We present the system design and its implemen-tation in two modes: in autonomous mode, the system isexecuted on the smartphone only, while in server mode therecognition is done by transmitting sound-based featuresto a server and receiving classification results in return.In our evaluation, we compare the two modes regardingrecognition accuracy, runtime, CPU usage, and recognitiontime when running on different smartphone models.

II. Related Work

While ambient sound classification is an active researcharea, no detailed evaluations of real-time smartphone im-plementations exist that could confirm essential systemdesign parameters. Sound has been shown to be a keymodality for recognizing activities of daily living in lo-cations such as bathroom [1], office and workshop [6],kitchen [7], and public spaces [2]. Mesaros et al. evaluatedan ambient sound classification system for a set of over 60classes [3]. These works did not investigate techniques forreal-time operation on resource limited hardware.

Only a few works addressed the implementation of real-time sound classification on wearable devices. NomadicRadio [8] was the first hardware implementation incor-porating real-time sound classification. Nomadic Radiois a personal contextual notification system to providetimely information while minimizing the number of user’sinterruptions. It’s auditory contextual recognition systemconsists of a real-time speech detection to infer if the useris in a conversation and thus should not be disturbed. Thefirst wearable system for classifying several ambient soundpatterns was proposed by Stäger et al. [6] and is calledSoundbutton. Soundbutton is a low-power audio classi-fication system implemented as a dedicated hardware.It enables to recognize everyday sounds like microwave,coffee grinder, and passing cars. However, because ofrecognition vs. energy tradeoffs recognition accuracy isdecreased and cannot compete with the performance of thealgorithms presented above. More recently three systems

978-1-4244-9529-0/13/$31.00 ©2013 IEEE

International Workshop on the Impact of Human Mobility in Pervasive Systems and Applications 2013, San Diego (18 March 2013)

230

using smart phone-based solutions were proposed: Miluzzoet al. [7] presented a framework for efficient mobile sensingincluding sound, which was used to recognize the presentsof human voices in real-time. Lu et al. [9] presented asound speaker identification system on smartphone. Sev-eral system parameters (sampling rate, GMM complexity,smoothing window size, and amount of training dataneeded) were benchmarked to identify thresholds thatbalance computation cost with performance. Finally, Luet al. [10] proposed SoundSense, a smartphone implemen-tation of a sound recognition system for voice, music andclustering ambient sounds. SoundSense divides ambientsound of a user in a number of sound clusters which haveto be manually labeled by the user.

The novelty of our work is in integrating design, imple-mentation, and evaluation of a smartphone-based systemfor recognizing ambient sound classes in real-time, whilemaintaining the recognition rate similar to offline analysesas presented above. We evaluated the implementation forboth - running the system on a smartphone only and withthe support of a server - concerning runtime, CPU usageand recognition time.

III. AmbientSense Architecture

The system samples ambient sound data and producesa context recognition result using auditory scene modelsevery second. The models are created in a training phasebased on an annotated audio data training set. This sectiondetails the AmbientSense system architecture. Figure 1illustrates the main components of AmbientSense.

Ambi

entS

ense

arc

hite

ctur

e ru

nnin

g m

odes

autonomous mode server mode

JSON objectover

HTTP Request

microphone

audio datatraining

set

front-end processing

classification

training

recogniton

SVMtrain

SVMclassify

auditoryscene

modelsframing featureextraction merging norm.

recognition

Fig. 1: AmbientSense architecture illustrating the maincomponents of the system. The main components areeither implemented on the smartphone or on the server,depending on the running mode (detailed in Section IV).

Front-end processing: This component targets toextract auditory scene-dependent features from the audiosignal. It has as input either continuous sound captured

from a microphone, or in a database stored audio data.Audio data with a sample rate of 16 kHz sampled at16 bit is used. In a first step, the audio data is framedby a sliding window with a window size of 32ms with50% of overlap (framing). Each window is smoothed witha Hamming filter. In a consecutive step, audio featuresare extracted from every window (feature extraction). Weused the Mel-frequency cepstral coefficients (MFCC), themost widely used audio features in audio classification.These features showed good recognition results for ambientsounds [1], [2]. We extracted the first 13 Mel-frequencycepstral coefficients, removing the first one resulting ina feature vector of 12 elements. In a next step, thefeature vectors extracted within a second of audio dataare combined by computing the mean and the variance ofeach feature vector element (merging). This results in anew feature vector fsi with elements i = 1, .., 24 and foreach second s of audio data. In a final step the featurevectors are normalized with F si =

fsi −miσi

, where mi arethe mean values and σi are the standard deviation valuesof all feature vectors of the training set (norm.).The front-end processing outputs every second s a new feature vectorF s.

Classification: Gaussian Mixture Models (GMM) isthe standard modelling technique for acoustic classifica-tion [2]. However, Chechik et al. showed that SupportVector Machine (SVM) can achieve comparable recogni-tion accuracies while reducing the computational complex-ity [11]: the experiment time (training and recognizingphase) was 40 times faster for SVM compard to GMM.Thus, for the recognition SVM classifier with a Gaussiankernel was used which includes the cost parameter C andthe kernel parameter γ (as described in [12]). Both param-eters were optimized with a parameter sweep as describedlater in the evaluation section V-A. The one-against-onestrategy was used and an additional probability estimatemodel was trained, which is provided by the LibSVMlibrary [12].

Training and testing phase: In a training phase thefeature vectors of the training set including all auditoryscene classes are computed and the SVMTrain componentcreates all auditory scene models which are stored forthe recognition. In the testing phase the feature vectorsare generated from the continuous audio data capturedfrom the microphone. The SVMClassify component usesthe stored auditory scene models to classify the featurevectors. Every second a recognition of the last second ofaudio data is created.

Training set: We tested our system on a set of 23ambient sound classes listed in Table I. The classes wereselected to reflect daily life locations and activities thatare hard to identify using existing GPS and activityrecognition techniques. We collected a training set of audiodata by recording for each class six audio samples fromdifferent sound sources (e.g. recordings of six differenttypes of coffee machines). To record the audio samples we

231

used the internal microphone of an Android Google NexusOne smartphone. The samples were recorded in the city ofZurich and in Thailand in different buildings (e.g. office,home, restaurant), and locations (e.g. beach, streets). Foreach recording we positioned the smartphone closely to thesound-generating source or in the middle of the location,respectively. Each sample has a duration of 30 seconds andwas sampled with a sampling frequency of 16kHz at 16bit.The audio samples are stored including the annotated labelin the training set database.

beach crowd football shaverbird dishwasher sinkbrushing teeth dog speechbus forest streetcar phone ring toilet flushchair railway station vacuum cleanercoffee machine raining washing machinecomputer keyboard restaurant

TABLE I: The 23 daily life ambient sound classes used totest our system.

Running modes: Two running modes have been im-plemented. In the autonomous mode the whole recognitionsystem runs independently on the smartphone withoutany Internet connection required. The recognition resultsare continuously displayed on the phone as an Androidnotification. In the server mode, ambient sound capturingand the front-processing is done on the phone. The result-ing features are sent to a server where the classificationtakes place. After classification, the recognized result issent back to the phone, where it is displayed as anAndroid notification. This mode requires either Wi-Fi or3G Internet connectivity, but enables to use more complexrecognition algorithms on the server.

IV. Implementation

The AmbientSense system was implemented in an An-droid smartphone setting. The main components (seeSection III) were implemented in Java SE 7 and arerunning on an Android smartphone or PC environment.For the implementation we used the FUNF open sensingframework [13] to derive MFCC features and the LibSVMLibrary [12] for the SVM modeling and classification. Inthe rest of this section the specific Android and serverimplementations are explained in detail.

Android implementation details: Figure 2 shows anillustration of the user interface (UI). The application caneither be set to build classifier models using training datastored on the SD card, or to classify ambient sound in thetwo running modes. The UI runs as an Activity, which isa class provided by the Android framework. An Activityis forced into a sleep mode by the Android runtime everytime the UI of the Activity is not in the foreground. For themain components of the recognition system a continuousprocessing is needed. Thus, the main components of theapplication were separated from the UI and implementedin an Android IntentService class. This class provides a

(a) Tab for recognition. (b) Tab for training.

Fig. 2: AmbientSense user interface on an Android smart-phone. During recognition the ambient sound class isdisplayed on the top as an Android Notification.

background task in a separate thread continuously runningeven in sleep mode, when the screen is locked, or theinterface of another application is in front. The recognitionresult is shown as an Android Notification, which popsup on top of the display (see Figure 2). We used theNotification class of the Android framework to implementthe recognition feedback. With this, a minimal and non-intrusive way of notifying the user about the current stateis possible, while the ability to run the recognition part inthe background is kept.

Server mode implementation details: In servermode, the phone sends every second the computed featurevector to the server for classification. This is implementedon the phone side with an additional IntentService han-dling the HTTP requests as well as the notifications onthe screen. Therefore, communication with the server isrunning asynchronously enabling a non-blocking capturingof audio data and front-end processing. The feature vectoris sent in a Base64 (included in the Android standardlibraries) encoded JSON1 object as a JSONArray. On theserver side, we used the Apache HttpCore NIO2 library,which provides HTTP client (as used by Android itself)and server functionality. The server listens for requests ona specified port, parses the data, computes the recognitionand returns the recognized value.

V. Evaluation

We evaluated the autonomous- and server mode andcompared both modes concerning recognition accuracy,runtime, CPU usage, and recognition time. For all the testswe used two Android smartphone devices: the SamsungGalaxy SII and the Google Nexus One. Table II shows

1http://www.json.org/java/index.html2http://hc.apache.org/httpcomponents-core-ga/httpcore-

nio/index.html

232

the specifications of both phones. In all the tests a systemwith the 23 pre-trained classes was used. The collecteddaily life data-set has been used for system training (seeSection III). For the autonomous mode Wi-Fi and 3Gwere deactivated, whereas for the server mode Wi-Fi wasactivated and 3G was deactivated on the smartphone. Theserver part was installed on an Intel Athlon 64 PC, with4GB RAM and an Ethernet 100Mbit connection runningon a Windows 7-64bit version.

Samsung Galaxy SII Google Nexus One

CPU 1.2 GHz Dual-core ARMCortex-A9

1 GHz ARM Cortex A8

RAM 1024 MB 512 MBBattery 1650 mAh Lithium-ion 1400 mAh Lithium-ionWi-Fi IEEE802.11n, 300Mbit/s IEEE802.11g, 54Mbit/s

TABLE II: Specifications of the two testing devices: Sam-sung Galaxy SII and Google Nexus One.

A. Recognition Accuracy

The recognition accuracy was calculated by a six-fold-leave-one-audio-sample-out cross-validation. The SVM pa-rameter C and γ were optimized with a parameter sweep.For C, a range of values from 2−5 to 215 was evaluated,while the range for γ was given between 2−15 to 25.This resulted in a recognition accuracy of 58.45% withthe parameter set C = 27 = 128 and γ = 2−1 = 0.5,which meets the result of previous work with a similarnumber of ambient sound classes [2]. Figure 3 displaysthe confusion matrix of the 23 classes. 12 classes showedan accuracy higher than 80%, whereas 3 classes showedaccuracies below 20%. Since the recognition algorithm isidentical for both running modes, the recognition accuracyholds for both modes.

Fig. 3: Confusion matrix of the 23 auditory sound classes.

B. Runtime

Runtime was tested by measuring application runningtime for five repetitions, each starting from fully chargedphones. During all the tests, the application was contin-uously running and generating a recognition result everysecond, the display was turned off and no other tasks orapplications were running. The time was measured untilthe phone switched off due to low battery. Figure 4 showsthe average measured runtime of both running modes andfor both testing devices. The Galaxy SII showed an averageruntime of 13.75h for the autonomous mode and 11.63hfor the server mode, whereas the Nexus One showed anaverage runtime of 11.93h for the autonomous mode and12.87h for the server mode.

Galaxy SII Nexus One0

5

10

15

runti

me

[h]

Autonomous Mode Server Mode

Fig. 4: Average runtime of the testing devices in au-tonomous and server mode.

C. CPU Usage

The CPU usage of the two modes was measured withthe Android SDK tools. Figure 5 shows that the GalaxySII used less CPU power in server mode. Since in servermode SVM recognition is not calculated on the phone,CPU power consumption is lower than in the autonomousmode. On the other hand, the Nexus One used moreCPU in server mode. The reason for this is the additionalWi-Fi adapter which increased CPU usage of the NexusOne for the transmission task. The Galaxy SII has adedicated chipset for the Wi-Fi communication processing.Furthermore, the Galaxy SII showed a higher fluctuationin processor load as the Nexus One. This is due to thefact that the Galaxy SII reduces the clock frequency ofthe CPU when the load is low3.

For a CPU profiling of the app we used the profilingtool included in the debugger of the Android SDK.We logged the trace files for both running modes tocompare the different CPU time allocations. Figure 6shows the profiling of the autonomous and the servermode on both testing devices for the execution stepsFraming, FFT, Cepstrum, SVM, HTTP, and Rest. FFT

3http://www.arm.com/products/processors/cortex-a/cortex-a9.php

233

Galaxy SII Nexus One0

10

20

30

CP

Uu

sage

[%]

Autonomous Mode Server Mode

Fig. 5: CPU usage of the testing devices for autonomousand server mode.

and Cepstrum represent the system components featureextraction, whereas Rest includes operations for thesystem components merging and the audio recorder (seeFigure 1). The different execution steps are ordered inthe way as they occur in the chain. The CPU time of onesingle task is measured as a percentage of the time it takesto complete the processing chain. The results show thatserver mode needed less time than in autonomous mode,also for the Google Nexus One (note that in contrast toFigure 5 the CPU profiling includes just the CPU usage ofthe processing chain). The FFT used to derive the MFCCfeatures used up almost 50% and the calculation of thecepstrum used about 35% of the CPU time. The SVMclassify used about 14% of the CPU time in autonomousmode and none in server mode, as in server mode therecognition is not done on the phone. Similarly there isno CPU time used for the HTTP client in autonomousmode, because the features do not have to be sent to aserver. CPU load could be decreased by ∼80% moving thefront-end processing to the server. However, in this casethe raw audio data has to be sent to the server increasingthe data rate from ∼ 3kbit/sec to 256kbit/sec.

D. Recognition Time

We define the recognition time as the time the systemneeds to calculate one recognition (see Figure 7). Thisincludes in the autonomous mode just the execution timeof the SVM recognition. In the server mode the recognitiontime includes the execution time of sending the featurevector (∼ 370Bytes) to the server, the SVM recognitionon the server, and sending the result (∼ 5Bytes) backto the smartphone. The evaluation was done with bothdevices in autonomous mode, and in the server mode overthe Wi-Fi as well over the 3G network. The measurementof the latency over Wi-Fi connection has been done froma LAN outside the network the classification server is in.The phones have been connected to a D-Link DI-524 Wi-Firouter following the 802.11g / 2.4GHz standard. For each

0 10 20 30 40 50 60 70 80 90 100 110

Server Mode

Sam

sung

Gal

axy

SII

Framing FFT Cepstrum SVM HTTP Rest

0 10 20 30 40 50 60 70 80 90 100

Autonomous Mode

0 10 20 30 40 50 60 70 80 90 100 110

Server Mode

CPU time [%]

Googl

eN

exus

One

0 10 20 30 40 50 60 70 80 90 100

Autonomous Mode

Fig. 6: Comparison of the CPU profile between au-tonomous and server mode of the two testing devices.100% CPU time corresponds to the data processing forone recognition.

front-end processing

resulton screen

classificationautonomous

mode

classificationserver mode

3G or

Wi-F

i

3G or Wi-Fi

JSON

(~37

0Byte

s)

JSON (~5Bytes)

recognition time

Fig. 7: Definition of recognition time for autonomous andserver mode.

mode we ran the experiment for 10min (600 recognitions).In Figure 8, the latency in the 3G network shows a highermean and standard deviation for both phones. However,a 3G latency of approximately 260ms does not limit theusability of the application as this is still within the onesecond interval in which the request packets are sent.

VI. Conclusion

AmbientSense is a real-time ambient sound recogni-tion system for smartphones. The system can run au-tonomously on an Android smartphone or work in com-bination with a server to recognize auditory scenes fromcaptured sound. For 23 ambient sound classes, the recog-nition accuracy of the system was 58.45%, which meetsresult of previous efforts in auditory scene analysis thatconsidered a similar number of ambient sound classes,e.g. [2]. Overall, our runtime and energy consumptionanalysis showed similar results for autonomous and servermodes. For the Galaxy SII, the runtime in server modewas ∼ 2h lower than in autonomous mode, which could

234

3GWi-Fiautonomous

0

100

200

300

400

259

142

13

264

175.5

21

reco

gnit

ion

tim

e[m

s]Galaxy SIINexus One

Fig. 8: Recognition time of the testing devices in au-tonomous and Wi-Fi server mode, and 3G server mode.For each mode, standard deviation, mean, minimum andmaximum values are shown.

be explained by the network communication usage.

Our further analysis revealed that ∼80% of the totalprocessing time was spent for feature computation (Fram-ing, FFT and ceptrum), where the server mode cannotgain advantages over the autonomous mode. In contrast,only ∼14% of CPU time are required for computingclassification results. Combined with the communicationoverhead, we concluded that the server mode is not ben-eficial for auditory scene recognition as in our approach.However, a server mode implementation could be benefi-cial if more complex classification models are chosen (e.g.modeling MFCC distributions with a Gaussian mixturemodel). The server mode may have an advantage if crowd-sourced data is added dynamically and learning is per-formed online. However, even in this setting the revisedmodels could be updated to be available in autonomousmode.

References

[1] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue, “Bathroomactivity monitoring based on sound,” in In Third InternationalConference on Pervasive Computing, 2005, pp. 47–61.

[2] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Kla-puri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi,“Audio-based context recognition,” IEEE Transactions on Au-dio, Speech, and Language Processing, vol. 14(1), pp. 321–329,2006.

[3] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acousticevent detection in real-life recordings,” in 18th European SignalProcessing Conference, 2010, pp. 1267–1271.

[4] M. Rossi, O. Amft, and G. Tröster, “Collaborative personalspeaker identification: A generalized approach pervasive andmobile computing,” Elsevier Pervasive and Mobile Computing(PMC), vol. 8(3), pp. 1574–1192, 2012.

[5] T. Franke, P. Lukowicz, K. Kunze, and D. Bannach, “Can amobile phone in a pocket reliably recognize ambient sounds?” inInternational Symposium on Wearable Computers (ISWC ’09),2009, pp. 161–162.

[6] M. Stager, P. Lukowicz, and G. Troster, “Implementation andevaluation of a low-power sound-based user activity recognition

system,” in Eighth International Symposium on Wearable Com-puters (ISWC ’04), 2004, pp. 138–141.

[7] E. Miluzzo, N. D. Lane, K. Fodor, R. Peterson, H. Lu, M. Mu-solesi, S. B. Eisenman, X. Zheng, and A. Campbell, “Sensingmeets mobile social networks: the design, implementation andevaluation of the cenceme application,” in Proceedings of the 6thACM conference on Embedded network sensor systems, 2008,pp. 337–350.

[8] N. Sawhney and C. Schmandt, “Nomadic radio: Speech andaudio interaction for contextual messaging in nomadic environ-ments,” ACM Trans. Computer Human Interaction, vol. 7(3),pp. 353–383, 2000.

[9] H. Lu, A. J. B. Brush, B. Priyantha, A. K. Karlson, andJ. Liu, “Speakersense : Energy efficient unobtrusive speakeridentification on mobile phones,” in The Ninth InternationalConference on Pervasive Computing, 2011, pp. 188–205.

[10] H. Lu, W. Pan, N. D. Lane, T. Choudhury, and A. T. Campbell,“Soundsense: Scalable sound sensing for people-centric applica-tions on mobile phones.” in Proceedings of the 7th internationalconference on Mobile systems, applications, and services (Mo-biSys ’09), 2009, pp. 165–178.

[11] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon, “Large-scale content-based audio retrieval from text queries,” in Pro-ceeding of the 1st ACM international conference on Multimediainformation retrieval, 2008, pp. 105–112.

[12] C.-C. Chang and C.-J. Lin, “Libsvm: A library for supportvector machines.”ACMTransactions on Intelligent Systems andTechnology, vol. 2(3), pp. 188–205, 2011.

[13] “funf|open sensing framework: http://funf.media.mit.edu/.”

235

ambientsense: a real-time ambient sound recognition system ...mlabrador/share/... · sound is a...

Documents