cardea: context–aware visual privacy protection from ... · context–aware and interactive...

10
Cardea: Context–Aware Visual Privacy Protection from Pervasive Cameras Jiayu Shu * , Rui Zheng , and Pan Hui HKUST-DT System and Media Laboratory Hong Kong University of Science and Technology, Hong Kong Email: * [email protected], [email protected], [email protected] Abstract—The growing popularity of mobile and wearable devices with built–in cameras, the bright prospect of camera related applications such as augmented reality and life–logging system, the increased ease of taking and sharing photos, and advances in computer vision techniques have greatly facilitated people’s lives in many aspects, but have also inevitably raised people’s concerns about visual privacy at the same time. Moti- vated by recent user studies that people’s privacy concerns are dependent on the context, in this paper, we propose Cardea, a context–aware and interactive visual privacy protection frame- work that enforces privacy protection according to people’s privacy preferences. The framework provides people with fine– grained visual privacy protection using: i) personal privacy profiles, with which people can define their context–dependent privacy preferences; and ii) visual indicators: face features, for devices to automatically locate individuals who request privacy protection; and iii) hand gestures, for people to flexibly interact with cameras to temporarily change their privacy preferences. We design and implement the framework consisting of the client app on Android devices and the cloud server. Our evaluation results confirm this framework is practical and effective with 86% overall accuracy, showing promising future for context– aware visual privacy protection from pervasive cameras. I. I NTRODUCTION Nowadays, built–in cameras have been serving as indispens- able components of mobile and wearable devices. Cameras with smaller size and higher resolution support a number of services and applications such as taking photos, mobile augmented reality, and life–logging systems on devices like smartphones, Microsoft HoloLens [1], Google Glass [2], and Narrative Clip [3]. The trend of embedding cameras in wear- ables will keep growing, an example of which is smart contact lens 1 . However, the ubiquitous presence of cameras, the ease of taking photos and recording video, along with “always on” and “non–overt act” features threaten individuals’ rights to have private or anonymous social lives, raising people’s concerns of visual privacy. More specifically, photos and videos captured without getting permissions from bystanders, and then up- loaded to social networking sites, can be accessed by everyone online, potentially leading to invasion of privacy. Malicious applications on the device may also inadvertently leak captured media data 2 . What makes it worse is that recognition technolo- 1 http://money.cnn.com/2016/05/12/technology/ eyeball-camera-contact-sony/ 2 http://www.infosecurity-magazine.com/news/ popular-android-camera-app-leaks/ gies can link images to specific people, places, and things, thus reveal far more information than expected, making searchable what was not previously considered searchable [4], [5]. All these possible consequences, whether have been realized by people or not, may hinder their acceptance of advanced wear- able consumer products. A representative example is Google Glass, which has been questioned by US Congressional Bi- Partisan Privacy Caucus and Data Protection Commissioners around the world concerning privacy risks to the public [6], [7]. They have huge concerns regarding the privacy of non– users/bystanders, and have raised questions of “How does Google plan to to prevent Google Glass from unintentionally collecting data about non–users without consent?” and “Are product lifecycle guidelines and frameworks, such as Privacy by Design, being implemented in connection with its design and commercialization?” From these legal concerns, we are confident that in the future wearable devices with cameras are supposed to implement Privacy by Design before being released to the global markets. Therefore, we base our research on this assumption and aim to develop the technology that can enable such requirement. In reality, both legal and technical measurements have been proposed to address privacy issues raised by unauthorized or unnoticed visual information collection. For instance, Google Glass is banned at places such as banks, hospitals, and bars 3 . However, prohibiting camera usage does not resolve the issue fundamentally, but instead sacrifices people’s rights to capture happy moments even if there is no bystander in the back- ground. As a result, there are growing needs to design techni- cal solutions to protect individuals’ visual privacy in a world where cameras are becoming pervasive. Some recent attempts are using visual markers such as QR code [8], [9] or colorful hints like hats [10] for individuals to actively express their unwillingness to be captured. However, these visual markers suffer from similar limitations. First, people are less likely to wear a QR code, despite the technical feasibility of these approaches. Moreover, privacy concerns vary widely among individuals, and people’s privacy preferences change from time to time, following patterns which cannot be conveyed by static visual markers. In fact, what individuals are doing, with whom, and where at, are all factors which determine whether people 3 https://www.searchenginejournal.com/top-10-places-that-have-banned-google-glass/ 66585/ arXiv:1610.00889v1 [cs.CR] 4 Oct 2016

Upload: others

Post on 01-Jun-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

Cardea ContextndashAware Visual Privacy Protectionfrom Pervasive Cameras

Jiayu Shulowast Rui Zhengdagger and Pan HuiDaggerHKUST-DT System and Media Laboratory

Hong Kong University of Science and Technology Hong KongEmail lowastjshuaausthk daggerrzhengacusthk Daggerpanhuiusthk

AbstractmdashThe growing popularity of mobile and wearabledevices with builtndashin cameras the bright prospect of camerarelated applications such as augmented reality and lifendashloggingsystem the increased ease of taking and sharing photos andadvances in computer vision techniques have greatly facilitatedpeoplersquos lives in many aspects but have also inevitably raisedpeoplersquos concerns about visual privacy at the same time Moti-vated by recent user studies that peoplersquos privacy concerns aredependent on the context in this paper we propose Cardea acontextndashaware and interactive visual privacy protection frame-work that enforces privacy protection according to peoplersquosprivacy preferences The framework provides people with finendashgrained visual privacy protection using i) personal privacyprofiles with which people can define their contextndashdependentprivacy preferences and ii) visual indicators face features fordevices to automatically locate individuals who request privacyprotection and iii) hand gestures for people to flexibly interactwith cameras to temporarily change their privacy preferencesWe design and implement the framework consisting of the clientapp on Android devices and the cloud server Our evaluationresults confirm this framework is practical and effective with86 overall accuracy showing promising future for contextndashaware visual privacy protection from pervasive cameras

I INTRODUCTION

Nowadays builtndashin cameras have been serving as indispens-able components of mobile and wearable devices Cameraswith smaller size and higher resolution support a numberof services and applications such as taking photos mobileaugmented reality and lifendashlogging systems on devices likesmartphones Microsoft HoloLens [1] Google Glass [2] andNarrative Clip [3] The trend of embedding cameras in wear-ables will keep growing an example of which is smart contactlens1

However the ubiquitous presence of cameras the ease oftaking photos and recording video along with ldquoalways onrdquo andldquononndashovert actrdquo features threaten individualsrsquo rights to haveprivate or anonymous social lives raising peoplersquos concerns ofvisual privacy More specifically photos and videos capturedwithout getting permissions from bystanders and then up-loaded to social networking sites can be accessed by everyoneonline potentially leading to invasion of privacy Maliciousapplications on the device may also inadvertently leak capturedmedia data2 What makes it worse is that recognition technolo-

1httpmoneycnncom20160512technologyeyeball-camera-contact-sony

2httpwwwinfosecurity-magazinecomnewspopular-android-camera-app-leaks

gies can link images to specific people places and things thusreveal far more information than expected making searchablewhat was not previously considered searchable [4] [5] Allthese possible consequences whether have been realized bypeople or not may hinder their acceptance of advanced wear-able consumer products A representative example is GoogleGlass which has been questioned by US Congressional Bi-Partisan Privacy Caucus and Data Protection Commissionersaround the world concerning privacy risks to the public [6][7] They have huge concerns regarding the privacy of nonndashusersbystanders and have raised questions of ldquoHow doesGoogle plan to to prevent Google Glass from unintentionallycollecting data about nonndashusers without consentrdquo and ldquoAreproduct lifecycle guidelines and frameworks such as Privacyby Design being implemented in connection with its designand commercializationrdquo From these legal concerns we areconfident that in the future wearable devices with camerasare supposed to implement Privacy by Design before beingreleased to the global markets Therefore we base our researchon this assumption and aim to develop the technology that canenable such requirement

In reality both legal and technical measurements have beenproposed to address privacy issues raised by unauthorized orunnoticed visual information collection For instance GoogleGlass is banned at places such as banks hospitals and bars3However prohibiting camera usage does not resolve the issuefundamentally but instead sacrifices peoplersquos rights to capturehappy moments even if there is no bystander in the back-ground As a result there are growing needs to design techni-cal solutions to protect individualsrsquo visual privacy in a worldwhere cameras are becoming pervasive Some recent attemptsare using visual markers such as QR code [8] [9] or colorfulhints like hats [10] for individuals to actively express theirunwillingness to be captured However these visual markerssuffer from similar limitations First people are less likelyto wear a QR code despite the technical feasibility of theseapproaches Moreover privacy concerns vary widely amongindividuals and peoplersquos privacy preferences change from timeto time following patterns which cannot be conveyed by staticvisual markers In fact what individuals are doing with whomand where at are all factors which determine whether people

3httpswwwsearchenginejournalcomtop-10-places-that-have-banned-google-glass66585

arX

iv1

610

0088

9v1

[cs

CR

] 4

Oct

201

6

think their privacy should be protected Therefore we arelooking for a natural flexible and fine-grained mechanismfor people to express modify and control their individualizedprivacy preferences

In this paper we propose a visual privacy protection frame-work for individuals using i) personalized privacy profilesthat people can define their contextndashdependent privacy prefer-ences with a set of privacy related factors including locationscene and otherrsquos presence ii) face features for devices tolocate individuals who request privacy control and iii) handgestures which help people interact with cameras to temporar-ily change their privacy preferences By using this frameworkthe device will automatically compute context factors comparethem with peoplersquos privacy profiles and finally enforce privacyprotection conforming to peoplersquos privacy preferences

The rest of the paper is organized as follows in Section IIwe introduce motivation and challenges of practical visualprivacy protection in Section III we provide a highndashleveloverview of our framework in Section IV we describe detaileddesign and implementation of the system in Section V wepresent the evaluation results in Section VI we discuss therelated work and finally in Section VII we conclude the paperand discuss plans for future work

II PRACTICAL VISUAL PRIVACY PROTECTION

The goal of our work is to propose an in situ privacyprotection approach conforming with the principle of Privacyby Design Devices with recording capability can protectbystandersrsquo visual privacy automatically and immediately ac-cording to their privacy preferences Our work is motivated bythe findings from recent user studies and encounters severalchallenges that help shape up the final design

A Contextndashdependent Personal Privacy Preferences

Recently some works try to understand peoplersquos attitudestowards visual privacy concerns raised by pervasive camerasand rapid development of wearable technologies Hoyle et al[11] [12] find that lifendashloggers are concerned about the privacyof bystanders and a combination of factors including timelocation and objects in the photo determines the sensitivityof the photo A user study conducted by Denning et al [13]also shows that participantsrsquo acceptability of being recordedby Augmented Reality Glasses are dependent on a number ofelements including the place and what they are doing when therecording is taken Based on the above findings we concludethe following points that motivate our work

bull Peoplersquos privacy concerns are dependent on context Al-though location is an important factor of privacy con-cerns what individuals are doing and with whom aremore essential and crucial factors that directly relate toprivacy

bull Peoplersquos privacy preferences vary from each other thusindividuals should be able to express their own personalprivacy preferences

bull Peoplersquos privacy preferences may change from time totime therefore individuals need a way to change suchpreferences easily

bull Generally the public hold positive attitudes towards en-forcing privacy protection on images to respect othersrsquoprivacy preferences

B Challenges Principles and Limitations

According to peoplersquos contextndashdependent privacy concernsand conclusions a practical visual privacy protection frame-work is faced with the following challenges

The first challenge is what elements should be taken intoconsideration regarding context A practical protection ap-proach should seek a balance between granularity and rep-resentativity of context at the same time taking computa-tional complexity into consideration In our current designwe choose location scene and presence of other people aselements to define context Location gives an approximate areascope that requests privacy protection Scene describes placesand usually indicates what individuals are doing Presence ofother people can be a concern when individuals want to keeptheir social relationships private

The second challenge is how to inform cameras of by-standersrsquo privacy preferences Compared with external markerindividualsrsquo faces are more natural and effective visual cuesTo this end people need to provide their face features forindividual recognition and then set their personal privacyprofiles by selecting context elements they are concerned with

The third challenge is how people can easily modify theirprivacy preferences Moreover bystanders should be able toreact to cameras immediately when they find being capturedwithout efforts on updating profiles Therefore we offer handgestures for individuals to ldquospeak outrdquo their privacy prefencesin the capturing moment Once a certain gesture is detectedthe image will be processed to retain or remove individualrsquosidentifiable information accordingly

The technical focus of our approach is on proving thefeasibility of a contextndashaware and interactive visual privacyprotection framework through system design implementationand evaluation However like other technical solutions [14][10] [15] [9] [16] our approach only works with compliantdevices and users Any nonndashcompliant device can still captureimages without respecting peoplersquos privacy preferences Alsowe assume the cloud server that stores user profiles is trustedThough there are limitations of our approach we hope ourwork can motivate more complete visual privacy protectionapproaches and finally be integrated into default camerasubsystems

III CARDEA OVERVIEW

In this section we first introduce some key concepts Thenwe discuss major components of Cardea

A Key Concepts

Bystander Recorder User and Cloud A bystander isthe person who may be captured by pervasive cameras A

Fig 1 Cardea overview The client application allows bystanders to registerand set privacy profile Users can use the platform to take pictures whilerespecting others privacy preferences The cloud server recognizes usersretrieves privacy profiles and performs necessary computations

recorder is the person who holds a device with a builtndashincamera A bystander who worries about his visual privacycan use Cardea to express his privacy preferences A recorderwho respects bystandersrsquo privacy preferences uses Cardeaframework to capture images Both the bystander and recorderbecome users of the privacy protection framework The cloudlistens to recorderrsquos requests and makes sure captured imagesare compliant with registered bystanderrsquos privacy preferences

Privacy Profile Privacy profiles contain context elementswhether enabling hand gestures or not and the protectionaction

a) Context Elements Context elements are factors thatreflect peoplersquos visual privacy concerns Currently we haveconsidered location scene and people in the image Moreelements can be included in the profile to describe peoplersquoscontextndashdependent privacy preferences

b) Gesture Interaction We define ldquoYesrdquo and ldquoNordquogestures A ldquoYesrdquo gesture is a victory sign which means ldquoIwould like to appear in the photordquo A ldquoNordquo gesture is a palmwhich represents ldquoI do not want to be photographedrdquo They areconsistent with peoplersquos usual expectations and is commonlyused in daily lives to express willingness or unwillingness tobe captured by others

c) Privacy Protection Action This action refers toremoval of identifiable information In our implementationwe blur userrsquos face as an example of protection action Othermethods such as replacing the face with an average faceblurring the whole body or blending the body region into thebackground can be integrated into our framework

B System Overview

Cardea is composed of the client app and cloud server Itworks based on data exchange and collaborative computing ofboth the client and cloud sides The major components andinteractions are shown in Figure 1

Cardea Bystander application Bystanders use Cardea ap-plication to register as users and define their privacy profiles

A bystander is presented with the user interface of Cardeaclient application for registration The application will extract

Fig 2 Workflow of Cardea Privacy protection decision is made according tocomputing results from local recording device and remote cloud Final privacyprotection action is performed on the original image

face features from the bystander automatically and then up-load to the cloud to train the face recognition model Afterregistration a user can define his contextndashdependent privacyprofile The profile will also be sent to the cloud and storedin the cloud database for future retrieval Both features andprofiles can be updated

Cardea Recorder application Recorders use Cardea appli-cation to take images After capturing an image context ele-ments will be computed locally on the device Meanwhile theapplication detects faces and extracts face features As handgesture recognition is extremely computationallyndashintensivethe task will be offloaded to the cloud To lower the risk ofprivacy leakage during transmission detected faces will beremoved from the image before sending the compressed imagedata and face features to the cloud

Upon receiving synthesized results from the cloud the finalprivacy protection action will be performed on the imageaccording to the computing results from both the local deviceand remote cloud Details of how to make privacy protectiondecisions are discussed in Section IV

Cardea Cloud Server The function of the cloud serveris twofold a) storing usersrsquo profiles and training the userrecognition model and b) responding to clientsrsquo requests byrecognizing users in the image and perform correspondingtasks

When the cloud server receives requests from the clientapplication it first recognizes users in the image using theface recognition model If there is any recognized user thecorresponding privacy profile will be retrieved to triggerrelated tasks For example if ldquoYesrdquo or ldquoNordquo gesture is enabledgesture recognition task will start Finally computing resultson the cloud will be synthesized and sent back to the client

IV DESIGN AND IMPLEMENTATION

The client application and cloud server are implementedon Android smartphones and desktop computer respectivelyNext we discuss the whole decision making procedures asshown in Figure 2 which involves both the client applicationand cloud server

After an image is captured using Cardea we first detectfaces and extract face features If there is any face in the image

we then input the extracted face features into the prendashtrainedface recognition model to recognize users If any Cardea useris recognized his privacy profile will be retrieved Otherwisewe do nothing to the original image

The profile describes whether the user enables hand ges-tures and the location where he is concerned about hisvisual privacy The profile may also contain information aboutselected scenes and other people that the user especially caresabout Among these factors ldquoYesrdquo and ldquoNordquo gestures havethe highest priority that is if any gesture is detected andenabled in userrsquos profile privacy protection action will bedetermined instantly In this way users can temporarily modifytheir privacy preferences for cases in which that they are awareof being photographed without updating their privacy profiles

If gestures are not enabled or not detected for a user andthe image is captured within the arealocation user specifieswe will start tasks of recognizing scene and finding if thereis anyone in the image that the user cares about according tohis privacy profile When all the tasks are finished the finalprivacy protection decision will be made and performed on theoriginal image automatically to protect userrsquos visual privacy

A User Recognition

User recognition is to identify users that request visualprivacy protection in the image

Face Detection Face detection is to detect face regions inthe whole image It is the first step for all face image process-ing We use Adaboost cascade classifier [17] implemented inOpenCV [18] library to detect face regions which has realndashtime performance We then filter detected face regions usingDlib [19] to reduce false positives

Face Feature Extraction Face feature extraction is to finda compact yet discriminative representation that can best de-scribe a face instead of using raw pixels Convolutional NeuralNetworks (CNNs) have achieved state-of-the-art results onmany computer vision tasks including face recognition whichhas been studied for years using different image processingtechniques [20] [21] Compared with conventional featuressuch as Local Binary Patterns [22] features extracted from theCNN models yield much better performance Therefore weuse 256ndashdimensional CNNndashfeatures extracted from a lightenedCNN model [23] which has a small size fast speed of featureextraction and low-dimensional representation The modelis pre-trained with the CASIA-WebFace dataset containingsim 05M face images from 10575 subjects [24] We use the1KB output of the ldquoeltwise fc1rdquo layer from the CNN modelas face features We have also experimented a deeper CNNmodel with VGG architecture [25] which is widely used inCNN models However the computational burden using VGGmodel is too heavy compared with the lighted CNN modelwith no obvious performance improvement We use the opensource deep learning framework Caffe [26] and its Androidlibrary [27] to extract face features using the lightened CNNmodel on Android smartphone

Face Recognition Face recognition is to identify usersin the image We train the face recognition model using

TABLE ITEN GENERAL SCENE GROUPS

Group name Scenes Examples

Shopping 20 clothing store market supermarket

Travelling 9 airport bus station subway platform

Park amp street 12 downtown park street alley

Eating amp 18 bar bistro cafeteria coffee shopdrinking fastfood restaurant food court

Working amp 9 classroom conference center librarystudy office reading room

Scantily clad 12 beach swimming pool water park

Medical care 2 hospital room nursing home

Religion 11 cathedral chapel church temple

Entertainment 5 amusement park ballroom discotheque

All 98

Support Vector Machine (SVM) which can achieve goodresults without much training data The model training is asupervised learning process The input training data for theface recognition model are face features from Cardea usersWe train the SVM model using LibSVM [28] with linearkernel as the number of training samples from each user issmaller than the dimensions of each input data

With the face recognition model we can get the predictionresult of the input face feature vector The output is pi i isin1 2 n where pi is the probability of being user i Weset a probability threshold Tp and only when maxi pi ge Tp

does it mean the user is recognized This is to avoid the caseswhen the input face feature from a nonndashregistered bystanderis mistakenly recognized as a registered user The thresholdTp is chosen based on experiments that will be described inthe next section

B Contextndashaware Computing

Location scene and people in the image are three contextelements defined in usersrsquo privacy profiles We acquire thesecontext elements in different ways after the image is capturedThey decide the final privacy protection action to be performedon the raw image

Location Filtering Location provides coarse control forindividuals Users can name a concrete sensitive area in whichthat they may have privacy worries By specifying locationssuch as a campus userrsquos privacy control will work in the areaof the campus We directly obtain the location from the GPSwhen the image is taken

Scene Classification Scene context is a complex conceptthat not only relates to places but also gives clues about whatpeople may be doing We summarize 9 general scene groupsIn this way people can select the general scene group insteadof listing all places they care about We finendashtune the prendashtrained CNN model to do the scene classification The detailedprocedures are described below

a) Data Preparing and Preprocessing The data fortraining scene classification is from Places2 dataset [29] Atthe time when Cardea is built Places2 dataset contained 401categories with more than 8 millions training images Among401 scene categories we choose 98 categories and then groupthem into 9 general scene groups as listed in Table I basedon their contextual similarity The grouped scenes are eitherpervasive or common scenes in daily life that may have anumber of bystanders or places where people are more likelyto have privacy concerns This scene subset is composed of19 million training images and 4900 validation images 50validation images for each category

b) Training The training procedure is a standard finendashtuning process We first extract features of all training imagesfrom 98 categories The features are the output of ldquofc7rdquo layerusing the prendashtrained AlexNet model provided by Places2project as a feature extractor [29] With all features we thentrain a Softmax classifier for 98 scene categories

The category classifier achieves 55 validation accuracy on98 categories There is no other benchmark result on the subsetwe choose but recent benchmarks give 53 sim 56 validationaccuracy on the new Places2 dataset with 365 categories [29]Both feature extraction and classifier training are implementedusing Caffe library [26] [30]

c) Prediction To get the probability of each scene groupwe simply sum up the probabilities of categories in the samegroup The final prediction result is the group that has thehighest probability among 9 groups which can be seen as ahardndashcoded clustering process

The prediction accuracy of our scene group model on thevalidation set achieves 83 As category prediction proba-bilities are usually distributed among similar categories thatbelong to the same group the group prediction results aresuperior to category prediction results It is what we desirefor the purpose of privacy protection based on more generalscene context description

Presence of Other People The third context element wetake into account is people in the image For example userAlice can upload 10 face features of Bob with whom Alicedoes not want to be photographed The task is to determinewhether Bob also appears in the image when Alice is capturedin the image

A simple similarity matching is adopted as the numberof potential matches (ie people in the image except Aliceherself) to be considered is small We apply cosine similarityas distance metric to get the distance value between a pairof feature vectors Assume we have 10 of Bobrsquos face featuresand 1 face feature extracted from a person P in the imageWe compare Prsquos feature with each of Bobrsquos features If thevalue does not exceed the distance threshold Td we regardit as a hit When the hit ratio of Bobrsquos features reaches theratio threshold Tr we believe P is Bob therefore Alice shouldbe protected Detailed face matching method is described inAlgorithm 1

Algorithm 1 Face Matching1 initialize Prsquos feature f0 Bobrsquos feature fi i isin 1 N distance threshold Td

hit ratio threshold Tr

2 mlarr 0 number of hits3 for i = 1 to N do4 di larr dis(f0 fi) dis(x y) returns the distance between x and y5 if di le Td then6 mlarr m + 17 end if8 end for9 if mN ge Tr then

10 return true P is Bob11 else12 return false P and Bob are two persons

13 end if

C Interaction using Hand Gestures

Hand gestures are natural body languages Our goal is todetect and recognize ldquoYesrdquo and ldquoNordquo gestures in the imageHowever hand gesture recognition in images taken by regularcameras with cluttered background is a difficult task Conven-tional skin colorndashbased hand detectors will fail dramaticallyin complex environments In our design we use the statendashofndashthendashart object detection framework Faster RegionndashbasedCNN (Faster R-CNN) [31] to train a hand gesture detectoras described below

Gesture Recognition Model Training According to thegesture recognition task we categorize hand gestures into 3classes 1) ldquoYesrdquo gesture 2) ldquoNordquo gesture and 3) ldquoNatu-ralrdquo gesture which is a hand in any other pose The dateused to train the gesture recognition model is composed of13050 ldquoNaturalrdquo gesture instances from 5628 images in VGGhand dataset [32] [33] and 4712 ldquoYesrdquo gesture instances3503 ldquoNordquo gesture instances from 5900 images crawled fromGoogle and Flickr Each annotation consists of an axis alignedbounding rectangle along with its gesture class

With the comprehensive gesture dataset and VGG16 prendashtrained model provided by FasterndashRCNN library we finendashtunethe ldquoconv3 1rdquo and upper layers together with region proposallayers and FastndashRCNN detection layers [34] The detailedtraining procedures can be found in [35] After the gestureis detected we link it to the nearest face which requires usersto show their hands near their faces when using gestures

We show the user interface of Cardea client application inFigure 3(a) together with an example of image processingresults in Figure 3(b)

V EVALUATION

In this section we present evaluation results along 3 axes1) vision microndashbenchmarks to evaluate the performance ofdifferent computer vision tasks including face recognition andmatching scene classification and hand gesture recognition2) system overall performance according to final privacyprotection decisions and usersrsquo privacy preferences3) runtime and energy consumption which shows theprocessing time of each part and energy consumed on oneimage

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 2: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

think their privacy should be protected Therefore we arelooking for a natural flexible and fine-grained mechanismfor people to express modify and control their individualizedprivacy preferences

In this paper we propose a visual privacy protection frame-work for individuals using i) personalized privacy profilesthat people can define their contextndashdependent privacy prefer-ences with a set of privacy related factors including locationscene and otherrsquos presence ii) face features for devices tolocate individuals who request privacy control and iii) handgestures which help people interact with cameras to temporar-ily change their privacy preferences By using this frameworkthe device will automatically compute context factors comparethem with peoplersquos privacy profiles and finally enforce privacyprotection conforming to peoplersquos privacy preferences

The rest of the paper is organized as follows in Section IIwe introduce motivation and challenges of practical visualprivacy protection in Section III we provide a highndashleveloverview of our framework in Section IV we describe detaileddesign and implementation of the system in Section V wepresent the evaluation results in Section VI we discuss therelated work and finally in Section VII we conclude the paperand discuss plans for future work

II PRACTICAL VISUAL PRIVACY PROTECTION

The goal of our work is to propose an in situ privacyprotection approach conforming with the principle of Privacyby Design Devices with recording capability can protectbystandersrsquo visual privacy automatically and immediately ac-cording to their privacy preferences Our work is motivated bythe findings from recent user studies and encounters severalchallenges that help shape up the final design

A Contextndashdependent Personal Privacy Preferences

Recently some works try to understand peoplersquos attitudestowards visual privacy concerns raised by pervasive camerasand rapid development of wearable technologies Hoyle et al[11] [12] find that lifendashloggers are concerned about the privacyof bystanders and a combination of factors including timelocation and objects in the photo determines the sensitivityof the photo A user study conducted by Denning et al [13]also shows that participantsrsquo acceptability of being recordedby Augmented Reality Glasses are dependent on a number ofelements including the place and what they are doing when therecording is taken Based on the above findings we concludethe following points that motivate our work

bull Peoplersquos privacy concerns are dependent on context Al-though location is an important factor of privacy con-cerns what individuals are doing and with whom aremore essential and crucial factors that directly relate toprivacy

bull Peoplersquos privacy preferences vary from each other thusindividuals should be able to express their own personalprivacy preferences

bull Peoplersquos privacy preferences may change from time totime therefore individuals need a way to change suchpreferences easily

bull Generally the public hold positive attitudes towards en-forcing privacy protection on images to respect othersrsquoprivacy preferences

B Challenges Principles and Limitations

According to peoplersquos contextndashdependent privacy concernsand conclusions a practical visual privacy protection frame-work is faced with the following challenges

The first challenge is what elements should be taken intoconsideration regarding context A practical protection ap-proach should seek a balance between granularity and rep-resentativity of context at the same time taking computa-tional complexity into consideration In our current designwe choose location scene and presence of other people aselements to define context Location gives an approximate areascope that requests privacy protection Scene describes placesand usually indicates what individuals are doing Presence ofother people can be a concern when individuals want to keeptheir social relationships private

The second challenge is how to inform cameras of by-standersrsquo privacy preferences Compared with external markerindividualsrsquo faces are more natural and effective visual cuesTo this end people need to provide their face features forindividual recognition and then set their personal privacyprofiles by selecting context elements they are concerned with

The third challenge is how people can easily modify theirprivacy preferences Moreover bystanders should be able toreact to cameras immediately when they find being capturedwithout efforts on updating profiles Therefore we offer handgestures for individuals to ldquospeak outrdquo their privacy prefencesin the capturing moment Once a certain gesture is detectedthe image will be processed to retain or remove individualrsquosidentifiable information accordingly

The technical focus of our approach is on proving thefeasibility of a contextndashaware and interactive visual privacyprotection framework through system design implementationand evaluation However like other technical solutions [14][10] [15] [9] [16] our approach only works with compliantdevices and users Any nonndashcompliant device can still captureimages without respecting peoplersquos privacy preferences Alsowe assume the cloud server that stores user profiles is trustedThough there are limitations of our approach we hope ourwork can motivate more complete visual privacy protectionapproaches and finally be integrated into default camerasubsystems

III CARDEA OVERVIEW

In this section we first introduce some key concepts Thenwe discuss major components of Cardea

A Key Concepts

Bystander Recorder User and Cloud A bystander isthe person who may be captured by pervasive cameras A

Fig 1 Cardea overview The client application allows bystanders to registerand set privacy profile Users can use the platform to take pictures whilerespecting others privacy preferences The cloud server recognizes usersretrieves privacy profiles and performs necessary computations

recorder is the person who holds a device with a builtndashincamera A bystander who worries about his visual privacycan use Cardea to express his privacy preferences A recorderwho respects bystandersrsquo privacy preferences uses Cardeaframework to capture images Both the bystander and recorderbecome users of the privacy protection framework The cloudlistens to recorderrsquos requests and makes sure captured imagesare compliant with registered bystanderrsquos privacy preferences

Privacy Profile Privacy profiles contain context elementswhether enabling hand gestures or not and the protectionaction

a) Context Elements Context elements are factors thatreflect peoplersquos visual privacy concerns Currently we haveconsidered location scene and people in the image Moreelements can be included in the profile to describe peoplersquoscontextndashdependent privacy preferences

b) Gesture Interaction We define ldquoYesrdquo and ldquoNordquogestures A ldquoYesrdquo gesture is a victory sign which means ldquoIwould like to appear in the photordquo A ldquoNordquo gesture is a palmwhich represents ldquoI do not want to be photographedrdquo They areconsistent with peoplersquos usual expectations and is commonlyused in daily lives to express willingness or unwillingness tobe captured by others

c) Privacy Protection Action This action refers toremoval of identifiable information In our implementationwe blur userrsquos face as an example of protection action Othermethods such as replacing the face with an average faceblurring the whole body or blending the body region into thebackground can be integrated into our framework

B System Overview

Cardea is composed of the client app and cloud server Itworks based on data exchange and collaborative computing ofboth the client and cloud sides The major components andinteractions are shown in Figure 1

Cardea Bystander application Bystanders use Cardea ap-plication to register as users and define their privacy profiles

A bystander is presented with the user interface of Cardeaclient application for registration The application will extract

Fig 2 Workflow of Cardea Privacy protection decision is made according tocomputing results from local recording device and remote cloud Final privacyprotection action is performed on the original image

face features from the bystander automatically and then up-load to the cloud to train the face recognition model Afterregistration a user can define his contextndashdependent privacyprofile The profile will also be sent to the cloud and storedin the cloud database for future retrieval Both features andprofiles can be updated

Cardea Recorder application Recorders use Cardea appli-cation to take images After capturing an image context ele-ments will be computed locally on the device Meanwhile theapplication detects faces and extracts face features As handgesture recognition is extremely computationallyndashintensivethe task will be offloaded to the cloud To lower the risk ofprivacy leakage during transmission detected faces will beremoved from the image before sending the compressed imagedata and face features to the cloud

Upon receiving synthesized results from the cloud the finalprivacy protection action will be performed on the imageaccording to the computing results from both the local deviceand remote cloud Details of how to make privacy protectiondecisions are discussed in Section IV

Cardea Cloud Server The function of the cloud serveris twofold a) storing usersrsquo profiles and training the userrecognition model and b) responding to clientsrsquo requests byrecognizing users in the image and perform correspondingtasks

When the cloud server receives requests from the clientapplication it first recognizes users in the image using theface recognition model If there is any recognized user thecorresponding privacy profile will be retrieved to triggerrelated tasks For example if ldquoYesrdquo or ldquoNordquo gesture is enabledgesture recognition task will start Finally computing resultson the cloud will be synthesized and sent back to the client

IV DESIGN AND IMPLEMENTATION

The client application and cloud server are implementedon Android smartphones and desktop computer respectivelyNext we discuss the whole decision making procedures asshown in Figure 2 which involves both the client applicationand cloud server

After an image is captured using Cardea we first detectfaces and extract face features If there is any face in the image

we then input the extracted face features into the prendashtrainedface recognition model to recognize users If any Cardea useris recognized his privacy profile will be retrieved Otherwisewe do nothing to the original image

The profile describes whether the user enables hand ges-tures and the location where he is concerned about hisvisual privacy The profile may also contain information aboutselected scenes and other people that the user especially caresabout Among these factors ldquoYesrdquo and ldquoNordquo gestures havethe highest priority that is if any gesture is detected andenabled in userrsquos profile privacy protection action will bedetermined instantly In this way users can temporarily modifytheir privacy preferences for cases in which that they are awareof being photographed without updating their privacy profiles

If gestures are not enabled or not detected for a user andthe image is captured within the arealocation user specifieswe will start tasks of recognizing scene and finding if thereis anyone in the image that the user cares about according tohis privacy profile When all the tasks are finished the finalprivacy protection decision will be made and performed on theoriginal image automatically to protect userrsquos visual privacy

A User Recognition

User recognition is to identify users that request visualprivacy protection in the image

Face Detection Face detection is to detect face regions inthe whole image It is the first step for all face image process-ing We use Adaboost cascade classifier [17] implemented inOpenCV [18] library to detect face regions which has realndashtime performance We then filter detected face regions usingDlib [19] to reduce false positives

Face Feature Extraction Face feature extraction is to finda compact yet discriminative representation that can best de-scribe a face instead of using raw pixels Convolutional NeuralNetworks (CNNs) have achieved state-of-the-art results onmany computer vision tasks including face recognition whichhas been studied for years using different image processingtechniques [20] [21] Compared with conventional featuressuch as Local Binary Patterns [22] features extracted from theCNN models yield much better performance Therefore weuse 256ndashdimensional CNNndashfeatures extracted from a lightenedCNN model [23] which has a small size fast speed of featureextraction and low-dimensional representation The modelis pre-trained with the CASIA-WebFace dataset containingsim 05M face images from 10575 subjects [24] We use the1KB output of the ldquoeltwise fc1rdquo layer from the CNN modelas face features We have also experimented a deeper CNNmodel with VGG architecture [25] which is widely used inCNN models However the computational burden using VGGmodel is too heavy compared with the lighted CNN modelwith no obvious performance improvement We use the opensource deep learning framework Caffe [26] and its Androidlibrary [27] to extract face features using the lightened CNNmodel on Android smartphone

Face Recognition Face recognition is to identify usersin the image We train the face recognition model using

TABLE ITEN GENERAL SCENE GROUPS

Group name Scenes Examples

Shopping 20 clothing store market supermarket

Travelling 9 airport bus station subway platform

Park amp street 12 downtown park street alley

Eating amp 18 bar bistro cafeteria coffee shopdrinking fastfood restaurant food court

Working amp 9 classroom conference center librarystudy office reading room

Scantily clad 12 beach swimming pool water park

Medical care 2 hospital room nursing home

Religion 11 cathedral chapel church temple

Entertainment 5 amusement park ballroom discotheque

All 98

Support Vector Machine (SVM) which can achieve goodresults without much training data The model training is asupervised learning process The input training data for theface recognition model are face features from Cardea usersWe train the SVM model using LibSVM [28] with linearkernel as the number of training samples from each user issmaller than the dimensions of each input data

With the face recognition model we can get the predictionresult of the input face feature vector The output is pi i isin1 2 n where pi is the probability of being user i Weset a probability threshold Tp and only when maxi pi ge Tp

does it mean the user is recognized This is to avoid the caseswhen the input face feature from a nonndashregistered bystanderis mistakenly recognized as a registered user The thresholdTp is chosen based on experiments that will be described inthe next section

B Contextndashaware Computing

Location scene and people in the image are three contextelements defined in usersrsquo privacy profiles We acquire thesecontext elements in different ways after the image is capturedThey decide the final privacy protection action to be performedon the raw image

Location Filtering Location provides coarse control forindividuals Users can name a concrete sensitive area in whichthat they may have privacy worries By specifying locationssuch as a campus userrsquos privacy control will work in the areaof the campus We directly obtain the location from the GPSwhen the image is taken

Scene Classification Scene context is a complex conceptthat not only relates to places but also gives clues about whatpeople may be doing We summarize 9 general scene groupsIn this way people can select the general scene group insteadof listing all places they care about We finendashtune the prendashtrained CNN model to do the scene classification The detailedprocedures are described below

a) Data Preparing and Preprocessing The data fortraining scene classification is from Places2 dataset [29] Atthe time when Cardea is built Places2 dataset contained 401categories with more than 8 millions training images Among401 scene categories we choose 98 categories and then groupthem into 9 general scene groups as listed in Table I basedon their contextual similarity The grouped scenes are eitherpervasive or common scenes in daily life that may have anumber of bystanders or places where people are more likelyto have privacy concerns This scene subset is composed of19 million training images and 4900 validation images 50validation images for each category

b) Training The training procedure is a standard finendashtuning process We first extract features of all training imagesfrom 98 categories The features are the output of ldquofc7rdquo layerusing the prendashtrained AlexNet model provided by Places2project as a feature extractor [29] With all features we thentrain a Softmax classifier for 98 scene categories

The category classifier achieves 55 validation accuracy on98 categories There is no other benchmark result on the subsetwe choose but recent benchmarks give 53 sim 56 validationaccuracy on the new Places2 dataset with 365 categories [29]Both feature extraction and classifier training are implementedusing Caffe library [26] [30]

c) Prediction To get the probability of each scene groupwe simply sum up the probabilities of categories in the samegroup The final prediction result is the group that has thehighest probability among 9 groups which can be seen as ahardndashcoded clustering process

The prediction accuracy of our scene group model on thevalidation set achieves 83 As category prediction proba-bilities are usually distributed among similar categories thatbelong to the same group the group prediction results aresuperior to category prediction results It is what we desirefor the purpose of privacy protection based on more generalscene context description

Presence of Other People The third context element wetake into account is people in the image For example userAlice can upload 10 face features of Bob with whom Alicedoes not want to be photographed The task is to determinewhether Bob also appears in the image when Alice is capturedin the image

A simple similarity matching is adopted as the numberof potential matches (ie people in the image except Aliceherself) to be considered is small We apply cosine similarityas distance metric to get the distance value between a pairof feature vectors Assume we have 10 of Bobrsquos face featuresand 1 face feature extracted from a person P in the imageWe compare Prsquos feature with each of Bobrsquos features If thevalue does not exceed the distance threshold Td we regardit as a hit When the hit ratio of Bobrsquos features reaches theratio threshold Tr we believe P is Bob therefore Alice shouldbe protected Detailed face matching method is described inAlgorithm 1

Algorithm 1 Face Matching1 initialize Prsquos feature f0 Bobrsquos feature fi i isin 1 N distance threshold Td

hit ratio threshold Tr

2 mlarr 0 number of hits3 for i = 1 to N do4 di larr dis(f0 fi) dis(x y) returns the distance between x and y5 if di le Td then6 mlarr m + 17 end if8 end for9 if mN ge Tr then

10 return true P is Bob11 else12 return false P and Bob are two persons

13 end if

C Interaction using Hand Gestures

Hand gestures are natural body languages Our goal is todetect and recognize ldquoYesrdquo and ldquoNordquo gestures in the imageHowever hand gesture recognition in images taken by regularcameras with cluttered background is a difficult task Conven-tional skin colorndashbased hand detectors will fail dramaticallyin complex environments In our design we use the statendashofndashthendashart object detection framework Faster RegionndashbasedCNN (Faster R-CNN) [31] to train a hand gesture detectoras described below

Gesture Recognition Model Training According to thegesture recognition task we categorize hand gestures into 3classes 1) ldquoYesrdquo gesture 2) ldquoNordquo gesture and 3) ldquoNatu-ralrdquo gesture which is a hand in any other pose The dateused to train the gesture recognition model is composed of13050 ldquoNaturalrdquo gesture instances from 5628 images in VGGhand dataset [32] [33] and 4712 ldquoYesrdquo gesture instances3503 ldquoNordquo gesture instances from 5900 images crawled fromGoogle and Flickr Each annotation consists of an axis alignedbounding rectangle along with its gesture class

With the comprehensive gesture dataset and VGG16 prendashtrained model provided by FasterndashRCNN library we finendashtunethe ldquoconv3 1rdquo and upper layers together with region proposallayers and FastndashRCNN detection layers [34] The detailedtraining procedures can be found in [35] After the gestureis detected we link it to the nearest face which requires usersto show their hands near their faces when using gestures

We show the user interface of Cardea client application inFigure 3(a) together with an example of image processingresults in Figure 3(b)

V EVALUATION

In this section we present evaluation results along 3 axes1) vision microndashbenchmarks to evaluate the performance ofdifferent computer vision tasks including face recognition andmatching scene classification and hand gesture recognition2) system overall performance according to final privacyprotection decisions and usersrsquo privacy preferences3) runtime and energy consumption which shows theprocessing time of each part and energy consumed on oneimage

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 3: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

Fig 1 Cardea overview The client application allows bystanders to registerand set privacy profile Users can use the platform to take pictures whilerespecting others privacy preferences The cloud server recognizes usersretrieves privacy profiles and performs necessary computations

recorder is the person who holds a device with a builtndashincamera A bystander who worries about his visual privacycan use Cardea to express his privacy preferences A recorderwho respects bystandersrsquo privacy preferences uses Cardeaframework to capture images Both the bystander and recorderbecome users of the privacy protection framework The cloudlistens to recorderrsquos requests and makes sure captured imagesare compliant with registered bystanderrsquos privacy preferences

Privacy Profile Privacy profiles contain context elementswhether enabling hand gestures or not and the protectionaction

a) Context Elements Context elements are factors thatreflect peoplersquos visual privacy concerns Currently we haveconsidered location scene and people in the image Moreelements can be included in the profile to describe peoplersquoscontextndashdependent privacy preferences

b) Gesture Interaction We define ldquoYesrdquo and ldquoNordquogestures A ldquoYesrdquo gesture is a victory sign which means ldquoIwould like to appear in the photordquo A ldquoNordquo gesture is a palmwhich represents ldquoI do not want to be photographedrdquo They areconsistent with peoplersquos usual expectations and is commonlyused in daily lives to express willingness or unwillingness tobe captured by others

c) Privacy Protection Action This action refers toremoval of identifiable information In our implementationwe blur userrsquos face as an example of protection action Othermethods such as replacing the face with an average faceblurring the whole body or blending the body region into thebackground can be integrated into our framework

B System Overview

Cardea is composed of the client app and cloud server Itworks based on data exchange and collaborative computing ofboth the client and cloud sides The major components andinteractions are shown in Figure 1

Cardea Bystander application Bystanders use Cardea ap-plication to register as users and define their privacy profiles

A bystander is presented with the user interface of Cardeaclient application for registration The application will extract

Fig 2 Workflow of Cardea Privacy protection decision is made according tocomputing results from local recording device and remote cloud Final privacyprotection action is performed on the original image

face features from the bystander automatically and then up-load to the cloud to train the face recognition model Afterregistration a user can define his contextndashdependent privacyprofile The profile will also be sent to the cloud and storedin the cloud database for future retrieval Both features andprofiles can be updated

Cardea Recorder application Recorders use Cardea appli-cation to take images After capturing an image context ele-ments will be computed locally on the device Meanwhile theapplication detects faces and extracts face features As handgesture recognition is extremely computationallyndashintensivethe task will be offloaded to the cloud To lower the risk ofprivacy leakage during transmission detected faces will beremoved from the image before sending the compressed imagedata and face features to the cloud

Upon receiving synthesized results from the cloud the finalprivacy protection action will be performed on the imageaccording to the computing results from both the local deviceand remote cloud Details of how to make privacy protectiondecisions are discussed in Section IV

Cardea Cloud Server The function of the cloud serveris twofold a) storing usersrsquo profiles and training the userrecognition model and b) responding to clientsrsquo requests byrecognizing users in the image and perform correspondingtasks

When the cloud server receives requests from the clientapplication it first recognizes users in the image using theface recognition model If there is any recognized user thecorresponding privacy profile will be retrieved to triggerrelated tasks For example if ldquoYesrdquo or ldquoNordquo gesture is enabledgesture recognition task will start Finally computing resultson the cloud will be synthesized and sent back to the client

IV DESIGN AND IMPLEMENTATION

The client application and cloud server are implementedon Android smartphones and desktop computer respectivelyNext we discuss the whole decision making procedures asshown in Figure 2 which involves both the client applicationand cloud server

After an image is captured using Cardea we first detectfaces and extract face features If there is any face in the image

we then input the extracted face features into the prendashtrainedface recognition model to recognize users If any Cardea useris recognized his privacy profile will be retrieved Otherwisewe do nothing to the original image

The profile describes whether the user enables hand ges-tures and the location where he is concerned about hisvisual privacy The profile may also contain information aboutselected scenes and other people that the user especially caresabout Among these factors ldquoYesrdquo and ldquoNordquo gestures havethe highest priority that is if any gesture is detected andenabled in userrsquos profile privacy protection action will bedetermined instantly In this way users can temporarily modifytheir privacy preferences for cases in which that they are awareof being photographed without updating their privacy profiles

If gestures are not enabled or not detected for a user andthe image is captured within the arealocation user specifieswe will start tasks of recognizing scene and finding if thereis anyone in the image that the user cares about according tohis privacy profile When all the tasks are finished the finalprivacy protection decision will be made and performed on theoriginal image automatically to protect userrsquos visual privacy

A User Recognition

User recognition is to identify users that request visualprivacy protection in the image

Face Detection Face detection is to detect face regions inthe whole image It is the first step for all face image process-ing We use Adaboost cascade classifier [17] implemented inOpenCV [18] library to detect face regions which has realndashtime performance We then filter detected face regions usingDlib [19] to reduce false positives

Face Feature Extraction Face feature extraction is to finda compact yet discriminative representation that can best de-scribe a face instead of using raw pixels Convolutional NeuralNetworks (CNNs) have achieved state-of-the-art results onmany computer vision tasks including face recognition whichhas been studied for years using different image processingtechniques [20] [21] Compared with conventional featuressuch as Local Binary Patterns [22] features extracted from theCNN models yield much better performance Therefore weuse 256ndashdimensional CNNndashfeatures extracted from a lightenedCNN model [23] which has a small size fast speed of featureextraction and low-dimensional representation The modelis pre-trained with the CASIA-WebFace dataset containingsim 05M face images from 10575 subjects [24] We use the1KB output of the ldquoeltwise fc1rdquo layer from the CNN modelas face features We have also experimented a deeper CNNmodel with VGG architecture [25] which is widely used inCNN models However the computational burden using VGGmodel is too heavy compared with the lighted CNN modelwith no obvious performance improvement We use the opensource deep learning framework Caffe [26] and its Androidlibrary [27] to extract face features using the lightened CNNmodel on Android smartphone

Face Recognition Face recognition is to identify usersin the image We train the face recognition model using

TABLE ITEN GENERAL SCENE GROUPS

Group name Scenes Examples

Shopping 20 clothing store market supermarket

Travelling 9 airport bus station subway platform

Park amp street 12 downtown park street alley

Eating amp 18 bar bistro cafeteria coffee shopdrinking fastfood restaurant food court

Working amp 9 classroom conference center librarystudy office reading room

Scantily clad 12 beach swimming pool water park

Medical care 2 hospital room nursing home

Religion 11 cathedral chapel church temple

Entertainment 5 amusement park ballroom discotheque

All 98

Support Vector Machine (SVM) which can achieve goodresults without much training data The model training is asupervised learning process The input training data for theface recognition model are face features from Cardea usersWe train the SVM model using LibSVM [28] with linearkernel as the number of training samples from each user issmaller than the dimensions of each input data

With the face recognition model we can get the predictionresult of the input face feature vector The output is pi i isin1 2 n where pi is the probability of being user i Weset a probability threshold Tp and only when maxi pi ge Tp

does it mean the user is recognized This is to avoid the caseswhen the input face feature from a nonndashregistered bystanderis mistakenly recognized as a registered user The thresholdTp is chosen based on experiments that will be described inthe next section

B Contextndashaware Computing

Location scene and people in the image are three contextelements defined in usersrsquo privacy profiles We acquire thesecontext elements in different ways after the image is capturedThey decide the final privacy protection action to be performedon the raw image

Location Filtering Location provides coarse control forindividuals Users can name a concrete sensitive area in whichthat they may have privacy worries By specifying locationssuch as a campus userrsquos privacy control will work in the areaof the campus We directly obtain the location from the GPSwhen the image is taken

Scene Classification Scene context is a complex conceptthat not only relates to places but also gives clues about whatpeople may be doing We summarize 9 general scene groupsIn this way people can select the general scene group insteadof listing all places they care about We finendashtune the prendashtrained CNN model to do the scene classification The detailedprocedures are described below

a) Data Preparing and Preprocessing The data fortraining scene classification is from Places2 dataset [29] Atthe time when Cardea is built Places2 dataset contained 401categories with more than 8 millions training images Among401 scene categories we choose 98 categories and then groupthem into 9 general scene groups as listed in Table I basedon their contextual similarity The grouped scenes are eitherpervasive or common scenes in daily life that may have anumber of bystanders or places where people are more likelyto have privacy concerns This scene subset is composed of19 million training images and 4900 validation images 50validation images for each category

b) Training The training procedure is a standard finendashtuning process We first extract features of all training imagesfrom 98 categories The features are the output of ldquofc7rdquo layerusing the prendashtrained AlexNet model provided by Places2project as a feature extractor [29] With all features we thentrain a Softmax classifier for 98 scene categories

The category classifier achieves 55 validation accuracy on98 categories There is no other benchmark result on the subsetwe choose but recent benchmarks give 53 sim 56 validationaccuracy on the new Places2 dataset with 365 categories [29]Both feature extraction and classifier training are implementedusing Caffe library [26] [30]

c) Prediction To get the probability of each scene groupwe simply sum up the probabilities of categories in the samegroup The final prediction result is the group that has thehighest probability among 9 groups which can be seen as ahardndashcoded clustering process

The prediction accuracy of our scene group model on thevalidation set achieves 83 As category prediction proba-bilities are usually distributed among similar categories thatbelong to the same group the group prediction results aresuperior to category prediction results It is what we desirefor the purpose of privacy protection based on more generalscene context description

Presence of Other People The third context element wetake into account is people in the image For example userAlice can upload 10 face features of Bob with whom Alicedoes not want to be photographed The task is to determinewhether Bob also appears in the image when Alice is capturedin the image

A simple similarity matching is adopted as the numberof potential matches (ie people in the image except Aliceherself) to be considered is small We apply cosine similarityas distance metric to get the distance value between a pairof feature vectors Assume we have 10 of Bobrsquos face featuresand 1 face feature extracted from a person P in the imageWe compare Prsquos feature with each of Bobrsquos features If thevalue does not exceed the distance threshold Td we regardit as a hit When the hit ratio of Bobrsquos features reaches theratio threshold Tr we believe P is Bob therefore Alice shouldbe protected Detailed face matching method is described inAlgorithm 1

Algorithm 1 Face Matching1 initialize Prsquos feature f0 Bobrsquos feature fi i isin 1 N distance threshold Td

hit ratio threshold Tr

2 mlarr 0 number of hits3 for i = 1 to N do4 di larr dis(f0 fi) dis(x y) returns the distance between x and y5 if di le Td then6 mlarr m + 17 end if8 end for9 if mN ge Tr then

10 return true P is Bob11 else12 return false P and Bob are two persons

13 end if

C Interaction using Hand Gestures

Hand gestures are natural body languages Our goal is todetect and recognize ldquoYesrdquo and ldquoNordquo gestures in the imageHowever hand gesture recognition in images taken by regularcameras with cluttered background is a difficult task Conven-tional skin colorndashbased hand detectors will fail dramaticallyin complex environments In our design we use the statendashofndashthendashart object detection framework Faster RegionndashbasedCNN (Faster R-CNN) [31] to train a hand gesture detectoras described below

Gesture Recognition Model Training According to thegesture recognition task we categorize hand gestures into 3classes 1) ldquoYesrdquo gesture 2) ldquoNordquo gesture and 3) ldquoNatu-ralrdquo gesture which is a hand in any other pose The dateused to train the gesture recognition model is composed of13050 ldquoNaturalrdquo gesture instances from 5628 images in VGGhand dataset [32] [33] and 4712 ldquoYesrdquo gesture instances3503 ldquoNordquo gesture instances from 5900 images crawled fromGoogle and Flickr Each annotation consists of an axis alignedbounding rectangle along with its gesture class

With the comprehensive gesture dataset and VGG16 prendashtrained model provided by FasterndashRCNN library we finendashtunethe ldquoconv3 1rdquo and upper layers together with region proposallayers and FastndashRCNN detection layers [34] The detailedtraining procedures can be found in [35] After the gestureis detected we link it to the nearest face which requires usersto show their hands near their faces when using gestures

We show the user interface of Cardea client application inFigure 3(a) together with an example of image processingresults in Figure 3(b)

V EVALUATION

In this section we present evaluation results along 3 axes1) vision microndashbenchmarks to evaluate the performance ofdifferent computer vision tasks including face recognition andmatching scene classification and hand gesture recognition2) system overall performance according to final privacyprotection decisions and usersrsquo privacy preferences3) runtime and energy consumption which shows theprocessing time of each part and energy consumed on oneimage

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 4: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

we then input the extracted face features into the prendashtrainedface recognition model to recognize users If any Cardea useris recognized his privacy profile will be retrieved Otherwisewe do nothing to the original image

The profile describes whether the user enables hand ges-tures and the location where he is concerned about hisvisual privacy The profile may also contain information aboutselected scenes and other people that the user especially caresabout Among these factors ldquoYesrdquo and ldquoNordquo gestures havethe highest priority that is if any gesture is detected andenabled in userrsquos profile privacy protection action will bedetermined instantly In this way users can temporarily modifytheir privacy preferences for cases in which that they are awareof being photographed without updating their privacy profiles

If gestures are not enabled or not detected for a user andthe image is captured within the arealocation user specifieswe will start tasks of recognizing scene and finding if thereis anyone in the image that the user cares about according tohis privacy profile When all the tasks are finished the finalprivacy protection decision will be made and performed on theoriginal image automatically to protect userrsquos visual privacy

A User Recognition

User recognition is to identify users that request visualprivacy protection in the image

Face Detection Face detection is to detect face regions inthe whole image It is the first step for all face image process-ing We use Adaboost cascade classifier [17] implemented inOpenCV [18] library to detect face regions which has realndashtime performance We then filter detected face regions usingDlib [19] to reduce false positives

Face Feature Extraction Face feature extraction is to finda compact yet discriminative representation that can best de-scribe a face instead of using raw pixels Convolutional NeuralNetworks (CNNs) have achieved state-of-the-art results onmany computer vision tasks including face recognition whichhas been studied for years using different image processingtechniques [20] [21] Compared with conventional featuressuch as Local Binary Patterns [22] features extracted from theCNN models yield much better performance Therefore weuse 256ndashdimensional CNNndashfeatures extracted from a lightenedCNN model [23] which has a small size fast speed of featureextraction and low-dimensional representation The modelis pre-trained with the CASIA-WebFace dataset containingsim 05M face images from 10575 subjects [24] We use the1KB output of the ldquoeltwise fc1rdquo layer from the CNN modelas face features We have also experimented a deeper CNNmodel with VGG architecture [25] which is widely used inCNN models However the computational burden using VGGmodel is too heavy compared with the lighted CNN modelwith no obvious performance improvement We use the opensource deep learning framework Caffe [26] and its Androidlibrary [27] to extract face features using the lightened CNNmodel on Android smartphone

Face Recognition Face recognition is to identify usersin the image We train the face recognition model using

TABLE ITEN GENERAL SCENE GROUPS

Group name Scenes Examples

Shopping 20 clothing store market supermarket

Travelling 9 airport bus station subway platform

Park amp street 12 downtown park street alley

Eating amp 18 bar bistro cafeteria coffee shopdrinking fastfood restaurant food court

Working amp 9 classroom conference center librarystudy office reading room

Scantily clad 12 beach swimming pool water park

Medical care 2 hospital room nursing home

Religion 11 cathedral chapel church temple

Entertainment 5 amusement park ballroom discotheque

All 98

Support Vector Machine (SVM) which can achieve goodresults without much training data The model training is asupervised learning process The input training data for theface recognition model are face features from Cardea usersWe train the SVM model using LibSVM [28] with linearkernel as the number of training samples from each user issmaller than the dimensions of each input data

With the face recognition model we can get the predictionresult of the input face feature vector The output is pi i isin1 2 n where pi is the probability of being user i Weset a probability threshold Tp and only when maxi pi ge Tp

does it mean the user is recognized This is to avoid the caseswhen the input face feature from a nonndashregistered bystanderis mistakenly recognized as a registered user The thresholdTp is chosen based on experiments that will be described inthe next section

B Contextndashaware Computing

Location scene and people in the image are three contextelements defined in usersrsquo privacy profiles We acquire thesecontext elements in different ways after the image is capturedThey decide the final privacy protection action to be performedon the raw image

Location Filtering Location provides coarse control forindividuals Users can name a concrete sensitive area in whichthat they may have privacy worries By specifying locationssuch as a campus userrsquos privacy control will work in the areaof the campus We directly obtain the location from the GPSwhen the image is taken

Scene Classification Scene context is a complex conceptthat not only relates to places but also gives clues about whatpeople may be doing We summarize 9 general scene groupsIn this way people can select the general scene group insteadof listing all places they care about We finendashtune the prendashtrained CNN model to do the scene classification The detailedprocedures are described below

a) Data Preparing and Preprocessing The data fortraining scene classification is from Places2 dataset [29] Atthe time when Cardea is built Places2 dataset contained 401categories with more than 8 millions training images Among401 scene categories we choose 98 categories and then groupthem into 9 general scene groups as listed in Table I basedon their contextual similarity The grouped scenes are eitherpervasive or common scenes in daily life that may have anumber of bystanders or places where people are more likelyto have privacy concerns This scene subset is composed of19 million training images and 4900 validation images 50validation images for each category

b) Training The training procedure is a standard finendashtuning process We first extract features of all training imagesfrom 98 categories The features are the output of ldquofc7rdquo layerusing the prendashtrained AlexNet model provided by Places2project as a feature extractor [29] With all features we thentrain a Softmax classifier for 98 scene categories

The category classifier achieves 55 validation accuracy on98 categories There is no other benchmark result on the subsetwe choose but recent benchmarks give 53 sim 56 validationaccuracy on the new Places2 dataset with 365 categories [29]Both feature extraction and classifier training are implementedusing Caffe library [26] [30]

c) Prediction To get the probability of each scene groupwe simply sum up the probabilities of categories in the samegroup The final prediction result is the group that has thehighest probability among 9 groups which can be seen as ahardndashcoded clustering process

The prediction accuracy of our scene group model on thevalidation set achieves 83 As category prediction proba-bilities are usually distributed among similar categories thatbelong to the same group the group prediction results aresuperior to category prediction results It is what we desirefor the purpose of privacy protection based on more generalscene context description

Presence of Other People The third context element wetake into account is people in the image For example userAlice can upload 10 face features of Bob with whom Alicedoes not want to be photographed The task is to determinewhether Bob also appears in the image when Alice is capturedin the image

A simple similarity matching is adopted as the numberof potential matches (ie people in the image except Aliceherself) to be considered is small We apply cosine similarityas distance metric to get the distance value between a pairof feature vectors Assume we have 10 of Bobrsquos face featuresand 1 face feature extracted from a person P in the imageWe compare Prsquos feature with each of Bobrsquos features If thevalue does not exceed the distance threshold Td we regardit as a hit When the hit ratio of Bobrsquos features reaches theratio threshold Tr we believe P is Bob therefore Alice shouldbe protected Detailed face matching method is described inAlgorithm 1

Algorithm 1 Face Matching1 initialize Prsquos feature f0 Bobrsquos feature fi i isin 1 N distance threshold Td

hit ratio threshold Tr

2 mlarr 0 number of hits3 for i = 1 to N do4 di larr dis(f0 fi) dis(x y) returns the distance between x and y5 if di le Td then6 mlarr m + 17 end if8 end for9 if mN ge Tr then

10 return true P is Bob11 else12 return false P and Bob are two persons

13 end if

C Interaction using Hand Gestures

Hand gestures are natural body languages Our goal is todetect and recognize ldquoYesrdquo and ldquoNordquo gestures in the imageHowever hand gesture recognition in images taken by regularcameras with cluttered background is a difficult task Conven-tional skin colorndashbased hand detectors will fail dramaticallyin complex environments In our design we use the statendashofndashthendashart object detection framework Faster RegionndashbasedCNN (Faster R-CNN) [31] to train a hand gesture detectoras described below

Gesture Recognition Model Training According to thegesture recognition task we categorize hand gestures into 3classes 1) ldquoYesrdquo gesture 2) ldquoNordquo gesture and 3) ldquoNatu-ralrdquo gesture which is a hand in any other pose The dateused to train the gesture recognition model is composed of13050 ldquoNaturalrdquo gesture instances from 5628 images in VGGhand dataset [32] [33] and 4712 ldquoYesrdquo gesture instances3503 ldquoNordquo gesture instances from 5900 images crawled fromGoogle and Flickr Each annotation consists of an axis alignedbounding rectangle along with its gesture class

With the comprehensive gesture dataset and VGG16 prendashtrained model provided by FasterndashRCNN library we finendashtunethe ldquoconv3 1rdquo and upper layers together with region proposallayers and FastndashRCNN detection layers [34] The detailedtraining procedures can be found in [35] After the gestureis detected we link it to the nearest face which requires usersto show their hands near their faces when using gestures

We show the user interface of Cardea client application inFigure 3(a) together with an example of image processingresults in Figure 3(b)

V EVALUATION

In this section we present evaluation results along 3 axes1) vision microndashbenchmarks to evaluate the performance ofdifferent computer vision tasks including face recognition andmatching scene classification and hand gesture recognition2) system overall performance according to final privacyprotection decisions and usersrsquo privacy preferences3) runtime and energy consumption which shows theprocessing time of each part and energy consumed on oneimage

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 5: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

a) Data Preparing and Preprocessing The data fortraining scene classification is from Places2 dataset [29] Atthe time when Cardea is built Places2 dataset contained 401categories with more than 8 millions training images Among401 scene categories we choose 98 categories and then groupthem into 9 general scene groups as listed in Table I basedon their contextual similarity The grouped scenes are eitherpervasive or common scenes in daily life that may have anumber of bystanders or places where people are more likelyto have privacy concerns This scene subset is composed of19 million training images and 4900 validation images 50validation images for each category

b) Training The training procedure is a standard finendashtuning process We first extract features of all training imagesfrom 98 categories The features are the output of ldquofc7rdquo layerusing the prendashtrained AlexNet model provided by Places2project as a feature extractor [29] With all features we thentrain a Softmax classifier for 98 scene categories

The category classifier achieves 55 validation accuracy on98 categories There is no other benchmark result on the subsetwe choose but recent benchmarks give 53 sim 56 validationaccuracy on the new Places2 dataset with 365 categories [29]Both feature extraction and classifier training are implementedusing Caffe library [26] [30]

c) Prediction To get the probability of each scene groupwe simply sum up the probabilities of categories in the samegroup The final prediction result is the group that has thehighest probability among 9 groups which can be seen as ahardndashcoded clustering process

The prediction accuracy of our scene group model on thevalidation set achieves 83 As category prediction proba-bilities are usually distributed among similar categories thatbelong to the same group the group prediction results aresuperior to category prediction results It is what we desirefor the purpose of privacy protection based on more generalscene context description

Presence of Other People The third context element wetake into account is people in the image For example userAlice can upload 10 face features of Bob with whom Alicedoes not want to be photographed The task is to determinewhether Bob also appears in the image when Alice is capturedin the image

A simple similarity matching is adopted as the numberof potential matches (ie people in the image except Aliceherself) to be considered is small We apply cosine similarityas distance metric to get the distance value between a pairof feature vectors Assume we have 10 of Bobrsquos face featuresand 1 face feature extracted from a person P in the imageWe compare Prsquos feature with each of Bobrsquos features If thevalue does not exceed the distance threshold Td we regardit as a hit When the hit ratio of Bobrsquos features reaches theratio threshold Tr we believe P is Bob therefore Alice shouldbe protected Detailed face matching method is described inAlgorithm 1

Algorithm 1 Face Matching1 initialize Prsquos feature f0 Bobrsquos feature fi i isin 1 N distance threshold Td

hit ratio threshold Tr

2 mlarr 0 number of hits3 for i = 1 to N do4 di larr dis(f0 fi) dis(x y) returns the distance between x and y5 if di le Td then6 mlarr m + 17 end if8 end for9 if mN ge Tr then

10 return true P is Bob11 else12 return false P and Bob are two persons

13 end if

C Interaction using Hand Gestures

Hand gestures are natural body languages Our goal is todetect and recognize ldquoYesrdquo and ldquoNordquo gestures in the imageHowever hand gesture recognition in images taken by regularcameras with cluttered background is a difficult task Conven-tional skin colorndashbased hand detectors will fail dramaticallyin complex environments In our design we use the statendashofndashthendashart object detection framework Faster RegionndashbasedCNN (Faster R-CNN) [31] to train a hand gesture detectoras described below

Gesture Recognition Model Training According to thegesture recognition task we categorize hand gestures into 3classes 1) ldquoYesrdquo gesture 2) ldquoNordquo gesture and 3) ldquoNatu-ralrdquo gesture which is a hand in any other pose The dateused to train the gesture recognition model is composed of13050 ldquoNaturalrdquo gesture instances from 5628 images in VGGhand dataset [32] [33] and 4712 ldquoYesrdquo gesture instances3503 ldquoNordquo gesture instances from 5900 images crawled fromGoogle and Flickr Each annotation consists of an axis alignedbounding rectangle along with its gesture class

With the comprehensive gesture dataset and VGG16 prendashtrained model provided by FasterndashRCNN library we finendashtunethe ldquoconv3 1rdquo and upper layers together with region proposallayers and FastndashRCNN detection layers [34] The detailedtraining procedures can be found in [35] After the gestureis detected we link it to the nearest face which requires usersto show their hands near their faces when using gestures

We show the user interface of Cardea client application inFigure 3(a) together with an example of image processingresults in Figure 3(b)

V EVALUATION

In this section we present evaluation results along 3 axes1) vision microndashbenchmarks to evaluate the performance ofdifferent computer vision tasks including face recognition andmatching scene classification and hand gesture recognition2) system overall performance according to final privacyprotection decisions and usersrsquo privacy preferences3) runtime and energy consumption which shows theprocessing time of each part and energy consumed on oneimage

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 6: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

(a) Cardea client application user interface (b) Privacy protection example

Fig 3 Cardea user interface and privacy protection results In (a) jiayu registers as a Cardea user by extracting and uploading her 20 face features Shespecifies HKUST four scene groups for privacy protection She also enables ldquoYesrdquo and ldquoNordquo gestures In (b) a picture is taken in HKUST 3 registered usersone ldquoYesrdquo and one ldquoNordquo gesture are recognized The scene is correctly predicted as Park amp street Therefore jiayursquos face is not blurred due to her ldquoYesrdquogesture Prediction probabilities are also shown in (b)

A Accuracy of Face Recognition and Matching

We first select 50 subjects from LFW dataset [36] who hasmore than 10 images as registered users Note that subjectsin the CASIA WebFace Database used to train the lightenedCNN model do not overlap with those in LFW [37] For eachsubject we extract at least 100 face features form Youtubevideo to simulate the process of user registration In total weget 5042 feature vectors These features are then divided intothe training set and validation set In addition we collect usertest set and nonndashuser test set to evaluate the face recognitionaccuracy The user test set is composed of 511 face featurevectors from online images of all 50 registered users Thenonndashuser test set consists of 166 face feature vectors from100 subjects in LFW database whose names start with ldquoArdquo asnonndashregistered bystanders

Figure 4(a) shows the recognition accuracy as to validationset and user test set with different numbers of features perperson used for training The accuracy refers to the fraction offaces that are correctly recognized The results show the modeltrained with 10 sim 20 features per person can achieve near100 accuracy on the validation set and over 90 accuracyon the user test set Little improvement can be achieved withmore training data Therefore we train the face recognitionmodel with 20 features per person Figure 4(b) shows theoverall accuracy with probability threshold Tp For nonndashusertest set the accuracy means the fraction of faces that arenot recognized as registered users As a result the accuracywill increase for the nonndashuser test set but decrease for thevalidation set and user test set when Tp goes up To makesure that registered users can be correctly recognized and non-registered users will not be mistakenly recognized we chooseTp to be 008 sim 009 which achieves over 80 recognitionaccuracy for both users and nonndashusers

To evaluate face matching performance we still use theuser test set Subjects who have more than 20 features areregarded as the database group the rest are the query groupSimilar to face recognition accuracy we break face matching

0 5 10 15 20Number of training features per person

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser Test set

(a) Training accuracy

000 005 010 015 020Probability threshold Tp

0

20

40

60

80

100

Acc

urac

y (

)

Validation setUser test setNon-user test set

(b) Testing accuracy with threshold Tp

Fig 4 Face recognition accuracy

00 02 04 06 08 10Threshold Td

0

20

40

60

80

100

Acc

urac

y (

)

Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(a) Cosine distance

60 80 100 120 140Threshold Td

0

20

40

60

80

100

Acc

urac

y (

) Tr=03 dbTr=03 qrTr=05 dbTr=05 qrTr=07 dbTr=07 qr

(b) Euclidean distance

Fig 5 Face matching accuracy with different distance threshold Td andratio threshold Tr

accuracy into two parts for persons belonging to the databasegroup we need to correctly match features only to the correctperson for persons not belonging to the database groupwe should not match features to anyone Figure 5 showsthe matching accuracy with Cosine distance and Euclideandistance respectively The results show that Cosine distancecan achieve better performance with near 100 accuracy forboth situations in which people belong to the database or querygroup The preferable parameters would be distance thresholdTd asymp 05 and ratio threshold Tr asymp 05

Overall the face recognition and matching methods we em-ploy with appropriate thresholds Tp Td and Tr can effectivelyand efficiently recognize users and match faces in the imagesMore importantly only a small number of features are needed

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 7: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

0

20

40

60

80

100

120

Num

ber o

f im

ages

TPFN

(a) Recall

Sh Tr Pa Ea Wo Sc Me Re EnPredict group

Entertainment

Religion

Medical care

Scantily clad

Working amp study

Eating amp drinking

Park amp street

Travelling

Shopping

Targ

et G

roup

2 0 0 10 1 0 0 0 13

0 0 1 0 0 0 0 24 0

0 3 0 1 0 0 29 0 0

1 0 0 0 1 44 0 0 0

7 8 1 7 63 0 0 0 0

16 4 0 68 10 0 0 0 0

0 5 83 0 0 1 0 0 0

2 88 16 1 0 0 0 0 9

118 0 0 1 0 0 0 0 0

(b) Confusion matrix

Fig 6 Scene classification results

for training the face recognition model and for face matchingalgorithm

B Performance of Scene Classification

We recruited 8 volunteers and asked them to take picturesldquoin the wildrdquo belonging to 9 general scene groups Aftermanually annotating these pictures we had 759 images intotal with 638 images in 9 scene groups we are interested inWe then predict scene group for each image using the sceneclassification model we trained

The number of images and classification result of eachgroup are shown in Figure 6(a) The true positives (TP) refersto images that are correctly classified and false negatives (FN)are those classified as other scene groups Overall we canachieve 083 recall (TP(TP+FN)) and 5 scene groups arewith recall more than 090 It is worth mention that the recallof scene groups Scantily clad (Sc) Medical care (Me) andReligion (Re) exceed 095 which provides strong support toprotect usersrsquo privacy in sensitive scenes

We also give the detailed classification confusion matrix inFigure 6(b) It shows that most FN of Eating amp drinking (Ea)are classified as Shopping (Sh) or Working amp study (Wo) andmost FN of Wo are classified as Ea or Sh The reason is thatboundaries between Sh Ea or Wo are not clear For exampleshopping malls have food courts or people study in coffeeshop The same reason accounts for the confusion betweenPark amp street (Pa) and Travelling (Tr) Moreover for scenecategories such as pub and bar people may group them intoEa or En Therefore a safe way is to select more scene groupsfor instance both Ea and En when you go to a pub at night

In general the evaluation results from images captured ldquoin

TABLE IICARDEArsquoS OVERALL PRIVACY PROTECTION PERFORMANCE

Overall accuracy 864 Protection accuracy 804No protection accuracy 910

Face recognition accuracy 985 ldquoYesrdquo gesture recall 979scene classification recall 777 ldquoNordquo gesture recall 773

the wildrdquo demonstrate that most of scenes can be correctlyclassified the performance is especially satisfactory for thosesensitive scenes

C Performance of Gesture Recognition

We asked our volunteers to take pictures for each otherwith different distances angles lighting conditions and back-grounds We manually annotated all images with hand regionsand the scene group it belongs to In total we got 338 handgesture images with 208 ldquoYesrdquo gestures 211 ldquoNordquo gesturesand 363 natural hands The images covers 6 out of 9 scenegroups For images do not belong to our summarized scenegroups we categorize them into group Others

Figure 7(a) and Figure 7(b) show the recall and precisionfor ldquoYesrdquo and ldquoNordquo gestures under different scene groups Therecall and precision of ldquoYesrdquo gesture the precision of ldquoNordquogesture reach 10 for most of the scene groups while the recallof ldquoNordquo gesture achieves 070 on average For Entertainment(En) the recall is low resulting from dim illumination inmost of the images we tested In general the performanceof gesture recognition does not show any marked correlationwith different scene groups Therefore we further investigatedthe recall in terms of gesture size as low recall will greatlythreatens userrsquos privacy compared with precision

Figure 7(c) plots the recall of gestures with varying handregion sizes Each image will be resized to about 800 times 600square pixels while keeping its aspect ratio We classify theminto 10 size intervals as plotted along the xndashaxis The resultshows ldquoYesrdquo gestures can achieve more than 09 recall for allsizes of hand region On the other hand recall of ldquoNordquo gesturestends to rise with increasing hand region size in general Itindicates that the performance of ldquoNordquo gesture recognitioncan be improved with more training data with smaller sizes

In summary the performance of gesture recognition demon-strates the feasibility of integrating gesture interaction inCardea for flexible privacy preference modification It per-forms extremely well for ldquoYesrdquo gesture recognition and thereis room for improvement of ldquoNordquo gesture recognition withmore training samples

D Overall Performance of Cardea Privacy Protection

After evaluating each vision task separately we now presentCardearsquos overall privacy protection performance Faces inthe image taken using Cardea end up being protected (egblurred) or remain unchanged correctly or incorrectly de-pending on protection decisions made based on both userrsquosprivacy profile and results from vision tasks Therefore weasked 5 volunteers to register as Cardea users and set their

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 8: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Rec

all

Yes No

(a) Recall for different scenes

Others Ea Sh Wo Tr Pa EnScene groups

00

02

04

06

08

10

Pre

cisi

on

Yes No

(b) precision for different scenes

2-4 4-6 6-8 8-10 10-12 12-14 14-16 16-21 21-26 gt26Hand region size (K square pixel)

00

02

04

06

08

10

Rec

all

Yes No

(c) Recall for different hand region sizes

Fig 7 Hand gesture recognition results

privacy profiles Now the face recognition model is trainedusing 1100 face feature vectors from 55 people including 50people from LFW dataset We take about 300 images andget processed images As we focus more on face recognitionrather than face detection we only keep images that faceshave been successfully detected In total we got 224 imagesfor evaluation

Table III shows the final privacy protection accuracy aswell as performance of each vision task The protection accu-racy shows 804 faces that require protection are actuallyprotected and 910 faces that do not ask for protectionremain unchanged Overall 864 faces are processed cor-rectly though the scene classification recall and ldquoNordquo gesturerecall do not reach 80 The reason is that protection deci-sion making process of Cardea sometimes can make up formistakes happening in the early step For example if userrsquosldquoNordquo gesture is not detected his face can still be protectedwhen the predict scene is selected in userrsquos profile

In summary Cardea achieves over 85 accuracy for users inthe real world Improvements of each vision part will directlybenefit Cardearsquos overall performance in the future

E Runtime and Energy Consumption

We validate the client side implementations on SamsungGalaxy Note 44 with 4times27 GHz Krait 450 CPU Qual-comm Snapdragon 805 Chipset 3GB RAM and 16 MP f22Camera The server side is configured with Intel i7ndash5820KCPU 16GB RAM GeForce 980Ti Graphic Card (6GB RAM)The client and server communicate via a TCP over WindashFiconnection

Figure 8 plots the time taken for Cardea to completedifferent vision tasks We take images with 1 face 2 facesand 5 faces in the size of 1920 times 1080 The images will becompressed in JPEG format On average the data sent is about950 KB Note some vision tasks will not be triggered in somesituations according to the decision workflow as explained inSection IV For example if no user is recognized in the imageall other tasks will not start For the purpose of measurementwe still activate all tasks to illustrate the runtime difference ofimages captured with varying number of faces

Among all vision tasks face processing (ie face detectionand feature extraction) and scene classification are performed

4httpwwwgsmarenacomsamsung galaxy note 4-6434php

1 face 2 faces 5 faces0

500

1000

1500

2000

2500

3000

Run

time

(mill

isec

ond)

Face detection amp feature extraction (client)Scene classification (client)Face recognition amp matching (server)Gesture recognition (server)Network

Fig 8 Task level runtime of Cardea

on the smartphone The face processing takes about 900milliseconds for 1 face and increases about 200 millisecondsper additional face Therefore it reaches about 1100 and1700 milliseconds for 2 faces and 5 faces respectively Thisis the most fundamental step that runs on the smartphonelocally due to privacy considerations Owing to the realndashtimeOpenCV face detection implementation and lightened CNNfeature extraction model it takes less then 13 overall runtimeThe scene classification takes around 900 milliseconds perimage as it only performs once independent of number ofpeople in the image The face recognition and matching taskson the server side take less than 30 milliseconds for fivepeople Though the time grows with increasing number ofpeople compared with other tasks they barely affect the overallruntime The gesture recognition also runs on the server andit takes about 330 milliseconds Similar to scene recognitionit performs on the whole image once regardless the numberof faces According to the measurement we find the networktransmission accounts for a majority of the overall runtimedue to the unstable network environment

In general photographers using Cardea to take picturescan get one processed image within 5 seconds in the mostheavy case (ie there is registered user who enables gesturesand scene classification will be triggered on the smartphone)Compared with existing image capture platforms that alsoprovide privacy protection such as I-pic [16] Cardea offersmore efficient image processing functionality

Next we measure the energy consumption of taking pictureswith Cardea on Galaxy Note 4 phone using the MonsoonPower Monitor [38] The images are also taken in size of1920times 1080 square pixels The first two columns of Table II

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 9: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

TABLE IIIENERGY CONSUMPTION OF CARDEA WITH DIFFERENT NUMBER OF FACES

Face recognition Whole process (uAh) of images

1 face 2172 (std 34) 11345 (std 459) sim 28002 faces 3441 (std 131) 12767 (std 1138) sim 25005 faces 6926 (std 365) 16411 (std 660) sim 2000

show the energy consumption for the face processing partonly and for the whole process (ie from taking a picture togetting the processed image) with 1 face 2 faces and 5 facesrespectively The screen stays on during the whole processtherefore a large portion of the energy consumption is dueto the alwaysndashon screen Moreover we can observe that faceprocessing energy is linear to face numbers All the other partsincluding scene classification sending and receiving data areindependent of the number of faces in the image which isconsistent with runtime measurements

Using energy measurements we also show Cardearsquos capac-ity on Galaxy Note 4 in the last column of Table II Thisdevice has a 3220 mAh battery therefore can capture about2000 high quality images with 5 faces using Cardea

VI RELATED WORK

Visual privacy protection has attracted extensive attentionsthese years due to increasing popularity of mobile and wear-able devices with builtndashin cameras As a result some researchworks have proposed solutions to address these privacy issues

Halderman et al [14] propose an approach in which closeddevices can encrypt data together during recording utilizingshort range wireless communication to exchange public keysand negotiate encryption key Only by obtaining all of thepermissions from people who encrypt the recording can onedecrypts it Juna et al [39] [40] present methods that thirdndashparty applications such as perceptual and augmented realityapplications have access to only higher-level objects such asa skeleton or a face instead of raw sensor feeds Raval et al[41] propose a system that gives users control to mark secureregions that the camera have access to therefore cameras areprevented from capturing sensitive information Unlike abovesolutions our work focuses on protecting bystandersrsquo privacyby respecting their privacy preferences when they are capturedin photos

To identify individuals who request privacy protectionadvanced techniques have been applied PriSurv [42] is avideo surveillance system that identifies objects using RFIDndashtags to protect the privacy of objects in the video surveillancesystem Courteous Glass [15] is a wearable camera integratedwith a far-infrared imager that turns off recording when newpersons or specific gestures are detected However theseapproaches require extra facilities or sensors that are notcurrently equipped on devices Our approach takes advantageof statendashofndashthendashart computer vision techniques that are reliableand effective

Some other efforts rely on visual markers Respectful Cam-eras [10] is a video surveillance system that requires people

who wish to remain anonymous wear colored markers suchas hats or vests and their faces will be blurred Bo et al[8] use QR code as privacy tags to link an individual withhis photo sharing preferences which empowers photo serviceproviders to exert privacy protection following usersrsquo policyexpressions to mitigate the publicrsquos privacy concerns Roesneret al [9] propose a general framework for controlling access tosensor data on continuous sensing platforms allowing objectsto explicitly specify access policies For example a personcan wear a QR code to communicate her privacy policy anda depth camera is used to find the person Unfortunatelyusing existing markers like QR code is technically feasiblybut lacks usability as few would wear QR code or otherconspicuous markers in public places Moreover static policiescannot satisfy peoplersquos privacy preferences all the time whichis contextndashdependent varying from each other and changingfrom time to time Therefore we allow individuals to updatetheir privacy profiles at anytime providing other cues toinform cameras of necessary protection

A recently interesting work Indashpic [16] allows people tobroadcast their privacy preferences and appearance informa-tion to nearby devices using BLE This work can be incor-porated into our framework Furthermore we specify contextelements that have not been considered before such as sceneand presence of others Besides we provide a convenientmechanism for people to temporarily change their privacypreferences using hand gestures when facing the camera whilebroadcasted data in I-pic may not be received by people whotake images or the received data is outdated

VII CONCLUSION AND FUTURE WORK

In this work we designed implemented and evaluatedCardea a visual privacy protection system that aims to addressindividualsrsquo visual privacy issues caused by pervasive cam-eras Cardea leverages the statendashofndashthendashart CNN models forfeature extraction visual classification and recognition WithCardea people can express their contextndashdependent privacypreferences in terms of location scenes and presence of otherpeople in the image Besides people can show ldquoYesrdquo andldquoNordquo gestures to dynamically modify their privacy prefer-ences We demonstrated performances of different vision taskswith pictures ldquoin the wildrdquo and overall it can achieve about86 accuracy on usersrsquo requests We also evaluated runtimeand energy consumed with Cardea prototype which provesthe feasibility of running Cardea for taking pictures whilerespecting bystandersrsquo privacy preferences

Cardea can be enhanced in different aspects The futurework includes improving scene classification and hand gesturerecognition performances compressing CNN models to reduceoverall runtime and integrating Cardea with camera subsystemto enforce privacy protection measure Moreover protectingpeoplersquos visual privacy in the video is another challenging andsignificant task which will make Cardea a complete visualprivacy protection framework

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References
Page 10: Cardea: Context–Aware Visual Privacy Protection from ... · context–aware and interactive visual privacy protection frame-work that enforces privacy protection according to people’s

REFERENCES

[1] ldquoMicrosoft Hololensrdquo httpwwwmicrosoftcommiscrosoft-hololensen-us

[2] ldquoGoogle Glassrdquo httpwwwgooglecomglassstart[3] ldquoNarrative Cliprdquo httpwwwgetnarrativecom[4] R Shaw ldquoRecognition markets and visual privacyrdquo UnBlinking New

Perspectives on Visual Privacy in the 21st Century 2006[5] A Acquisti R Gross and F Stutzman ldquoFace recognition and privacy

in the age of augmented realityrdquo Journal of Privacy and Confidentialityvol 6 no 2 p 1 2014

[6] ldquoCongressrsquos letterrdquo httpblogswsjcomdigits20130516congress-asks-google-about-glass-privacy

[7] ldquoData protection authoritiesrsquo letterrdquo httpswwwprivgccamedianr-c2013nr-c 130618 easp

[8] C Bo G Shen J Liu X-Y Li Y Zhang and F Zhao ldquoPrivacy tagPrivacy concern expressed and respectedrdquo in Proceedings of the 12thACM Conference on Embedded Network Sensor Systems ACM 2014pp 163ndash176

[9] F Roesner D Molnar A Moshchuk T Kohno and H J Wang ldquoWorld-driven access control for continuous sensingrdquo in Proceedings of the 2014ACM SIGSAC Conference on Computer and Communications SecurityACM 2014 pp 1169ndash1181

[10] J Schiff M Meingast D K Mulligan S Sastry and K Gold-berg ldquoRespectful cameras Detecting visual markers in real-time toaddress privacy concernsrdquo in Protecting Privacy in Video SurveillanceSpringer 2009 pp 65ndash89

[11] R Hoyle R Templeman S Armes D Anthony D Crandall andA Kapadia ldquoPrivacy behaviors of lifeloggers using wearable camerasrdquoin Proceedings of the 2014 ACM International Joint Conference onPervasive and Ubiquitous Computing ACM 2014 pp 571ndash582

[12] R Hoyle R Templeman D Anthony D Crandall and A KapadialdquoSensitive lifelogs A privacy analysis of photos from wearable cam-erasrdquo in Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems ACM 2015 pp 1645ndash1648

[13] T Denning Z Dehlawi and T Kohno ldquoIn situ with bystanders of aug-mented reality glasses Perspectives on recording and privacy-mediatingtechnologiesrdquo in Proceedings of the 32nd annual ACM conference onHuman factors in computing systems ACM 2014 pp 2377ndash2386

[14] J A Halderman B Waters and E W Felten ldquoPrivacy management forportable recording devicesrdquo in Proceedings of the 2004 ACM workshopon Privacy in the electronic society ACM 2004 pp 16ndash24

[15] J Jung and M Philipose ldquoCourteous glassrdquo in Proceedings of the2014 ACM International Joint Conference on Pervasive and UbiquitousComputing Adjunct Publication ACM 2014 pp 1307ndash1312

[16] P Aditya R Sen P Druschel S J Oh R Benenson M FritzB Schiele B Bhattacharjee and T T Wu ldquoI-pic A platform forprivacy-compliant image capturerdquo in Proceedings of the 14th AnnualInternational Conference on Mobile Systems Applications and ServicesMobiSys vol 16 2016

[17] P Viola and M J Jones ldquoRobust real-time face detectionrdquo Internationaljournal of computer vision vol 57 no 2 pp 137ndash154 2004

[18] ldquoOpenCVrdquo httpopencvorg[19] ldquoDlibrdquo httpdlibnet[20] W Zhao R Chellappa P J Phillips and A Rosenfeld ldquoFace recog-

nition A literature surveyrdquo ACM computing surveys (CSUR) vol 35no 4 pp 399ndash458 2003

[21] E Learned-Miller G B Huang A RoyChowdhury H Li and G HualdquoLabeled faces in the wild A surveyrdquo in Advances in Face Detectionand Facial Image Analysis Springer 2016 pp 189ndash248

[22] T Ahonen A Hadid and M Pietikainen ldquoFace description with localbinary patterns Application to face recognitionrdquo IEEE transactions onpattern analysis and machine intelligence vol 28 no 12 pp 2037ndash2041 2006

[23] X Wu R He and Z Sun ldquoA lightened cnn for deep face representa-tionrdquo arXiv preprint arXiv151102683 2015

[24] ldquoCASIA-WebFace datasetrdquo httpwwwcbsriaascnenglishCASIA-WebFace-Databasehtml

[25] O M Parkhi A Vedaldi and A Zisserman ldquoDeep face recognitionrdquoin British Machine Vision Conference vol 1 no 3 2015 p 6

[26] Y Jia E Shelhamer J Donahue S Karayev J Long R GirshickS Guadarrama and T Darrell ldquoCaffe Convolutional architecture forfast feature embeddingrdquo in Proceedings of the ACM InternationalConference on Multimedia ACM 2014 pp 675ndash678

[27] sh1r0 ldquoCaffe-android-librdquo httpsgithubcomsh1r0caffe-android-lib[28] C-C Chang and C-J Lin ldquoLibsvm a library for support vector

machinesrdquo ACM Transactions on Intelligent Systems and Technology(TIST) vol 2 no 3 p 27 2011

[29] B Zhou A Khosla A Lapedriza A Torralba and A Oliva ldquoPlaces2dataset projectrdquo httpplaces2csailmiteduindexhtml

[30] BVLC ldquoCafferdquo httpcaffeberkeleyvisionorg[31] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-time

object detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[32] ldquoVGG hand datasetrdquo httpwwwrobotsoxacuksimvggdatahands[33] A Mittal A Zisserman and P H Torr ldquoHand detection using multiple

proposalsrdquo in BMVC Citeseer 2011 pp 1ndash11[34] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE International

Conference on Computer Vision 2015 pp 1440ndash1448[35] ldquoFaster R-CNN (Python implementation)rdquo httpsgithubcomrbgirshick

py-faster-rcnn[36] ldquoLFW datasetrdquo httpvis-wwwcsumassedulfw[37] D Yi Z Lei S Liao and S Z Li ldquoLearning face representation from

scratchrdquo arXiv preprint arXiv14117923 2014[38] ldquoMonsoon Power Monitorrdquo httphttpswwwmsooncom

LabEquipmentPowerMonitor[39] S Jana A Narayanan and V Shmatikov ldquoA scanner darkly Protecting

user privacy from perceptual applicationsrdquo in Security and Privacy (SP)2013 IEEE Symposium on IEEE 2013 pp 349ndash363

[40] S Jana D Molnar A Moshchuk A Dunn B Livshits H J Wangand E Ofek ldquoEnabling fine-grained permissions for augmented realityapplications with recognizersrdquo in Presented as part of the 22nd USENIXSecurity Symposium (USENIX Security 13) 2013 pp 415ndash430

[41] N Raval A Srivastava A Razeen K Lebeck A Machanavajjhala andL P Cox ldquoWhat you mark is what apps seerdquo in ACM InternationalConference on Mobile Systems Applications and Services (Mobisys)2016

[42] K Chinomi N Nitta Y Ito and N Babaguchi ldquoPrisurv privacyprotected video surveillance system using adaptive visual abstractionrdquoin International Conference on Multimedia Modeling Springer 2008pp 144ndash154

  • I Introduction
  • II Practical Visual Privacy Protection
    • II-A Contextndashdependent Personal Privacy Preferences
    • II-B Challenges Principles and Limitations
      • III Cardea Overview
        • III-A Key Concepts
        • III-B System Overview
          • IV Design and Implementation
            • IV-A User Recognition
            • IV-B Contextndashaware Computing
            • IV-C Interaction using Hand Gestures
              • V Evaluation
                • V-A Accuracy of Face Recognition and Matching
                • V-B Performance of Scene Classification
                • V-C Performance of Gesture Recognition
                • V-D Overall Performance of Cardea Privacy Protection
                • V-E Runtime and Energy Consumption
                  • VI Related Work
                  • VII Conclusion and Future Work
                  • References