what’s the gist? privacy-preserving aggregation of user profiles igor bilogrevic (google), julien...
TRANSCRIPT
What’s the Gist? Privacy-Preserving Aggregation of User ProfilesIgor Bilogrevic (Google), Julien Freudiger (PARC), Emiliano De Cristofaro (UCL), Ersin Uzun (PARC)
Scott Kildall – Data Crystals
2
Data is the Crux of Internet Economy
Corporations seek personal data for better targeting
More data and more sensitive data
Data Brokers
Third Parties
UsersUsersUsersUsers
Credit card transactionsInterestsPolitical partyApps usageBrowsing historyMobility patterns…
3
Issues with Current Approach
PrivacyWhat personal data is collected?How much and how good is it?
TransparencyWho knows what about me?[1]
Where does this data come from?
RemunerationUsers value their data Users don’t get money for it Data Brokers
A Call for Transparency and AccountabilityFTC, May 2014
[1] aboutthedata.com
4
“This question calls for Acxiom to provide information that would reveal business practices that are of a highly competitive nature. Acxiom cannot provide a list of each entity that has provided data from, or about, consumers to us.”
ACXIOM
5
6Julian Oliver - 2013
7
An Emerging Model
Data Brokers
Third Parties
UsersUsersUsersUsers
Participatory Data Brokers
BenefitsUsers retain control over who access what about themUsers decide what data can be monetizedUsers get some revenue
8
“What if Facebook paid you? Several startups envision an era in which we are all the brokers, and beneficiaries, of our own personal data.“
David Zax, Is personal data the new currency? MIT Tech Review
You
9
Our Contribution
What’s the Gist? Method for monetization of user personal data with privacyUsers choose what to shareBrokers are not required to be trustworthy
IdeaRather than selling data as-is, monetize a model of the data
Age20 30 50
pdfUser data (age)
User1 22User2 56User3 43User4 33…
Aggregate (age)
40 60
10
System Architecture
AggregatorThird Party
1. Query 2. Select users
3. QueriesUsers
5. Noisy encrypted answers
6. Aggregate, decrypt, sample, and monetize
7. Answer
UsersUsersUsers
4. Extract features
Interactive modeCustomer queries for certain desired aggregates
Batch modeAggregator prepares certain aggregates
11
Users – Profile Computation
Each user i has profile pi with K attributes {ai,j}
Each element ai,j is an integer representing a value or a preference
ai,2
ai,2
ai,3
..
..ai,K
User i
pi =
2822356..23
pi =Example
Age# of friendsAction moviesDrama movies…Rock musicHistory books
12
Users – Feature Computation
Features depend on chosen probability modelFor Gaussian model, each user i computes
fi = {[ai,1 , ai,12], …, [ai,K , ai,K
2]}
[28], [282][223], [2232][5], [52][6], [62]..[2], [22][3], [32]
pi =
Age# of friendsAction moviesDrama movies…Rock musicHistory books
13
Private Aggregation
PrivacyDifferentially private ri prevents aggregator from deducting user data[1]
SecurityAggregator can only decrypt sumNo shared secret, no pairwise distributed computations
Aggregator
…
User i
User 1
User n
Assume
Knows
Computes
[1] E Shi et al. Privacy-Preserving Aggregation of Time-Series Data. NDSS, 2011
14
Aggregator – Gaussian Approximation
Entities contribute
Enc[a1], Enc[a12], …, Enc[ai], Enc[ai
2]
Broker aggregates to compute mean μ, and variance σ2
Obtains Gaussian approximation N(μ, σ2) for each attribute
age
N(μ, σ2)pdf
15
Aggregator - Attribute Ranking
AssumptionAttributes with uniform distribution reveal less information about individual entities
Measure divergenceDistance between two probability distributionsJenson-Shannon (JS) divergenceSmall JS distance means low value
Uniform distribution
16
Performance
Dataset and implementation100,000 real users from U.S. Census [data.gov, July 2013]3 types of attributes (income, education, age)Java, measurements on Core i5 2.53 GHz, 8 GB RAM
MetricsAccuracy of Gaussian approximationInformation leakage for each attributeRevenueOverhead
17
Inco
me
Edu
cati
on
Age
100 users 1,000 users 100,000 users
18
Gaussian Approximations
Accuracy improves quickly with number of users (100 is good)
Fit for income and age is 3x better than for education
19
Information Leakage vs Uniform
Maximum information leakage achieved at about 1,000 users
Information leakage not necessarily increasing with number of users (stable after a while)
Larger user samples do not necessarily provide better discriminating features
20
Revenue Model
Value of user information: from $0.0005[2] to $33[1]
Where w=0.1 is the commission.
[1] J. P. Carrascal, C. Riederer, V. Erramilli, M. Cherubini, and R. de Oliveira. Your browsing behavior for a big mac: Economics of personal information online. WWW, 2013[2] L. Olejnik, T. Minh-Dung, C. Castelluccia. Selling off privacy at auction. NDSS, 2014
21
Revenue per AttributeThree privacy sensitivity distributions
User revenue is small and does not increase with the number of participants Revenue similar to Amazon Mechanical Turk
Broker incentivized to collect as many users as possible ($0.07 $ 2897)
Third parties incentivized to select demographic group of size 100
22
Overhead
1.5 min for 100 users 27.7 h for 100,000 usersCan and should be parallelized
User Aggregator
1 ms totalIndependent of number of users
23
Related Work
Privacy-preserving aggregation
Modified version of the Paillier encryption scheme[1,2] But P2P communications between participants
Homomorphic encryption and differential privacy[3,4] But differential privacy by third party and contributions linkable to users before aggregation
[1] Z. Erkin and G. Tsudik. Private computation of spatial and temporal power consumption with smart meter. ACNS 2012[2] E. Shi, R. Zhang, Y. Liu, and Y. Zhang. Prisense: privacy-preserving data aggregation in people-centric urban sensing systems. INFOCOM, 2010[3] R. Chen, I. E. Akkus, and P. Francis. Splitx: high-performance private analytics. SIGCOMM, 2013 [4] R. Chen, A. Reznichenko, P. Francis, and J. Gehrke. Towards statistical queries over distributed private user data. NSDI, 2012
24
Related Work
Privacy-preserving monetization
Local user profile generation, categorization, and ad selection[1,2]
Anonymizing proxies to shield users’ behavioral data from third parties[3]
[1] V. Toubiana, A. Narayanan, D. Boneh, H. Nissenbaum, and S. Barocas. Adnostic: Privacy preserving targeted advertising. NDSS, 2010[2] S.Guha, B.Cheng, and P. Francis. Privad: practical privacy in online advertising. NSDI, 2011[3] C. Riederer, V. Erramilli, A. Chaintreau, B. Krishnamurthy, and P. Rodriguez. For sale: your data: by: you. HotNETs, 2011
25
Conclusion
Designed method to monetize sensitive data with privacy
If data is new currency, we are creating marketplace
Evaluation shows practical performance, good accuracy with as little as 100 users and good incentives for parties involved
Future workEnhance security features (range checks to thwart pollution attacks, fault-tolerance, efficient key establishment)Enable targeting of users after aggregationEnable subsequent collection of more than model (i.e., black swan)