differentially private recommendation systems

33
Differentially Private Recommendation Systems Jeremiah Blocki Fall 2009 18739A: Foundations of Security and Privacy

Upload: sylvia

Post on 25-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

18739A: Foundations of Security and Privacy. Differentially Private Recommendation Systems. Jeremiah Blocki Fall 2009. Netflix $1,000,000 Prize Competition. Queries: On a scale of 1 to 5 how would John rate “The Notebook” if he watched it?. Netflix Prize Competition. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Differentially Private Recommendation Systems

Differentially Private Recommendation Systems

Jeremiah BlockiFall 2009

18739A: Foundations of Security and Privacy

Page 2: Differentially Private Recommendation Systems

Netflix $1,000,000 Prize Competition

Queries: On a scale of 1 to 5 how would John rate “The Notebook” if he watched it?

User/Movie

…. 300 The Notebook

….

… … … … …John 4 UnratedMary Unrated UnratedSue 2 5Joe 5 1… … … … …

Page 3: Differentially Private Recommendation Systems

Netflix Prize Competition

Note: N x M table is very sparse (M = 17,770 movies, N = 500,000 users)

To Protect Privacy:• Each user was randomly assigned to a globally unique ID• Only 1/10 of the ratings were published• The ratings that were published were perturbed a little bit

User/Movie

…. 13,537 13,538 ….

… … … … …258,964 (4,

10/11/2005)Unrated

258,965 Unrated Unrated258,966 (2, 6/16/2005) (5, 6/18/2005)258,967 (5, 9/15/2005) (1,4/28/2005)… … … … …

Page 4: Differentially Private Recommendation Systems
Page 5: Differentially Private Recommendation Systems

Outline Breaking the anonymity of the Netflix prize

dataset

Recap: Differential Privacy and define Approximate Differential Privacy

Prediction Algorithms

Privacy Preserving Prediction Algorithms

Remaining Issues

Page 6: Differentially Private Recommendation Systems

Netflix FAQ: PrivacyQ: “Is there any customer information in the dataset that should be kept private?”

A: “No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy [. . . ] Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”

Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

Page 7: Differentially Private Recommendation Systems

De-Anonymizing the Netflix Dataset

IMDB •The Internet Movie Database (IMDB) is a huge source of auxiliary information!

Challenge

•Even the same user, may not provide the same ratings for IMDB and Netflix. The dates also may not correspond exactly.

Adversary •Knows Approximate Ratings (±1)

•Knows Approximate Dates (14-day error)

Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

Page 8: Differentially Private Recommendation Systems

De-Anonymizing the Netflix Dataset

Does anyone really care about the privacy of movie ratings?

Video Privacy Protection Act of 1988 exists, so some people do.

In some cases knowing movie ratings allows you to infer an individual’s political affiliation, religion, or even sexual orientation.

Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

Page 9: Differentially Private Recommendation Systems

Privacy in Recommender Systems Netflix might base its recommendation to me

on both: My own rating history The rating history of other users Maybe it isn’t such a big deal if the

recommendations Netflix makes to me leak information about my own history, but they should preserve the privacy of other users

It would be possible to learn about someone else’s ratings history If we have auxiliary information Active attacks are a source auxiliary information

Calandrino, et al. Don’t review that book: Privacy risks of collaborative filtering, 2009.

Page 10: Differentially Private Recommendation Systems

Recall Differential Privacy [Dwork et al 2006] Randomized sanitization function κ has ε-

differential privacy if for all data sets D1 and D2 differing by at most one element and all subsets S of the range of κ,

Pr[κ(D1) S ] ≤ eε Pr[κ(D2) S ]

Page 11: Differentially Private Recommendation Systems

Achieving Differential Privacy

How much noise should be added?

Intuition: f(D) can be released accurately when f is insensitive to individual entries x1, … xn (the more sensitive f is, higher the noise added)

Tell me f(D)

f(D)+noisex1…xn

Database DUser

Page 12: Differentially Private Recommendation Systems

Sensitivity of a function

Examples: count 1, histogram 1 Note: f does not depend on the database

Page 13: Differentially Private Recommendation Systems

Sensitivity with Laplace Noise

Page 14: Differentially Private Recommendation Systems

Approximate Differential Privacy

Randomized sanitization function κ has (ε, δ)-differential privacy if for all data sets D1 and D2 differing by at most one element and all subsets S of the range of κ,

Pr[κ(D1) ∈ S] ≤ eε Pr[κ(D2) S ] + δ

Source: Dwork et al 2006

Page 15: Differentially Private Recommendation Systems

Approximate Differential Privacy One interpretation of approximate differential privacy is

that the outcomes of the computation κ are unlikely to provide much more information than ε-differential privacy, but it is possible.

Key Difference: Approximate Differential Privacy does NOT require that:

Range(κ(D1)) = Range(κ(D2))

The privacy guarantees made by (ε,δ)-differential privacy are not as strong as ε-differential privacy, but less noise is required to achieve (ε,δ)-differential privacy.

Source: Dwork et al 2006

Page 16: Differentially Private Recommendation Systems

Achieving Approximate Differential PrivacyKey Differences:

• Use of the L2 instead of L1 norm to define the sensitivity of Δf

• Use of Gaussian Noise instead of Laplacian Noise

Source: Differentially Private Recommender Systems(McSherry and Mironov)

Page 17: Differentially Private Recommendation Systems

Differential Privacy for Netflix Queries What level of granularity to consider? What does it

mean for databases D1 and D2 to differ on at most one element? One user (column) is present in D1 but not in D2 One rating (cell) is present in D1 but not in D2

Issue 1: Given a query “how would user i rate movie j?” Consider: K(D-u[i]) - how can it possibly be accurate?

Issue 2: If the definition of differing in at most one element is taken over cells, then what privacy guarantees are made for a user with many data points?

Page 18: Differentially Private Recommendation Systems

Netflix Predictions – High Level Q(i,j) – “How would user i rate movie j?”

Predicted rating may typically depend on Global average rating over all movies and all users Average movie rating of user i Average rating of movie j Ratings user i gave to similar movies Ratings similar users gave to movie j

Intuition: Δf may be small for many of these queries

Page 19: Differentially Private Recommendation Systems

Personal Rating Scale For Alice a rating of 3 might mean the movie

was really terrible. For Bob the same rating might mean that the

movie was excellent. How do we tell the difference?

?0 iim rr

Page 20: Differentially Private Recommendation Systems

How do we tell if two users are similar?

ji LLm

jjmiim rrrrjiS ))((),(

Pearson’s Correlation is one metric for similarity of users i and j

• Consider all movies rated by both users• Negative value whenever i likes a movie that j

dislikes• Positive value whever i and j agreee

We can use similar metrics to measure the similarity between two movies.

Page 21: Differentially Private Recommendation Systems

Netflix Predictions Example Collaborative Filtering

Find the k-nearest neighbors of user i who have rated movie j by Pearson’s Correlation:

Predicted Rating

),(

1 )(jkNu

uujkiiji

rrrp

},...,1{),(),(

ukujkNjiS

i

similarity of users i and j

k most similar users

Page 22: Differentially Private Recommendation Systems

Netflix Prediction Sensitivity Example Pretend the query Q(i,j) included user i’s rating

history

At most one of the neighbors ratings changes, and the range of ratings is 4. The L1 sensitivity of the prediction is:

p = 4/k

),(

1 )(jkNu

uujkiiji

rrrp

Page 23: Differentially Private Recommendation Systems

Similarity of Two Movies Let U be the set of all users who have rated

both movies i and j then

)()(),( uuiUu

uuj rrrrjiS

Page 24: Differentially Private Recommendation Systems

K-Nearest Users or K-Nearest Movies?

Find k most similar users to i that have also rated movie j?

Find k most similar movies to j that user i

has rated?

Either way, after some pre-computation, we need to be able to find the k-nearest users/movies quickly!

Page 25: Differentially Private Recommendation Systems

Covariance Matrix• (MxM) matrix• Cov[i][j] measures

similarity between movies i and j

• M ≈ 17,000

Movie-Movie

Covariance Matrix

• (NxN) Matrix to measure similarity between users

• N ≈ 500,000 so this is difficult to pre-compute or even store in memory!

User-User Covariance Matrix?

Page 26: Differentially Private Recommendation Systems

What do we need to make predictions?For a large class of prediction algorithms it suffices to

have: Gavg

Mavg – for each movie

Average Movie Rating for each user

Movie-Movie Covariance Matrix (COV)

Page 27: Differentially Private Recommendation Systems

Differentially Private Recommender Systems (High Level)

To respect approximate differential privacy publish Gavg + NOISE

Mavg + NOISE

COV + NOISE

Don’t publish average ratings for users

Gavg, Mavg are very small so they can be published with little noise

COV requires more care.

Source: Differentially Private Recommender Systems(McSherry and Mironov)

Page 28: Differentially Private Recommendation Systems

Covariance Matrix Tricks Trick 1: Center and clamp all ratings around

averages. If we use clamped ratings then we reduce the sensitivity of our function.

Page 29: Differentially Private Recommendation Systems

Covariance Matrix Tricks Trick 2: Carefully weight the contribution of

each user to reduce the sensitivity of the function. Users who have rated more movies are assigned lower weight.

Page 30: Differentially Private Recommendation Systems

Publishing the Covariance Matrix Theorem*: Pick then

* Some details about stabilized averages are being swept under the rug

Page 31: Differentially Private Recommendation Systems

Results

Source: Differentially Private Recommender Systems(McSherry and Mironov)

Page 32: Differentially Private Recommendation Systems

Note About Results Granularity: One rating present in D1 but not

in D2 Accuracy is much lower when one user is present

in D1 but not in D2 Intuition: Given query Q(i,j) the database D-u[i]

gives us no history about user i.

Approximate Differential Privacy Gaussian Noise added according to L2 Sensitivity Clamped Ratings (B =1) to further reduce noise

Page 33: Differentially Private Recommendation Systems

Questions Could the results be improved if the analysis

was tailored to more specific prediction algorithms?

What if new movies are added? Can the movie-movie covariance matrix be changed without comprising privacy guarantees?

How strong are the privacy guarantees given that there are potentially many data points clustered around a single user?