– with special attention to location ( ) privacy

77
– with special attention to location ( ) privacy SPACE WEB MINING PRIVACY and : foes or friends? Bettina Berendt Dept. Computer Science K.U. Leuven

Upload: dewei

Post on 12-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Bettina Berendt Dept. Computer Science K.U. Leuven. – with special attention to location ( ) privacy. SPACE. WEB. MINING. and. PRIVACY. : foes or friends?. SPACE. WEB. MINING. PRIVACY. BASICS. SPACE. WEB. MINING. PRIVACY. What is Web Mining? And who am I?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: – with special attention  to location (               ) privacy

– with special attention to location ( ) privacySPACE

WEB MINING

PRIVACY

and

: foes or friends?

Bettina BerendtDept. Computer ScienceK.U. Leuven

Page 2: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 3: – with special attention  to location (               ) privacy

BASICS

Page 4: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 5: – with special attention  to location (               ) privacy

5

What is Web Mining? And who am I?

Knowledge discovery (aka Data mining):

"the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."

Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:

Web content miningWeb structure mining

Web usage mining

Navigation, queries, content access & creation

Page 6: – with special attention  to location (               ) privacy

6

Why Web / data mining?

“the database ofIntentions“(J. Battelle)

Page 7: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 8: – with special attention  to location (               ) privacy

Location-based services and augmented

reality

www.poynt.com

Page 9: – with special attention  to location (               ) privacy

Semiotically augmented reality: semapedia and related ideas

Page 10: – with special attention  to location (               ) privacy

Mobile Social Web

Page 11: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 12: – with special attention  to location (               ) privacy

What's special about spatial information? 1. Interpreting

Rich inferences from spatial position to personal properties and/or identity possible Pos(A,9-17) = P1 → workplace(A,P1) Pos(A,20-6) = P2 → home(A,P2)

An even richer „database of intentions“?! Pos(A,now) = P3 & temp(P3,now,hot) →

wants(A,ice-cream) (location-based services)

Pos(A, t in 13-18) = Pos(Demonstration,13-18) → suspicious(A) (ex. Dresden phone surveillance case 2011)

Page 13: – with special attention  to location (               ) privacy

What's special about spatial information? 2. Sending, or: Opt-out impossible?!

Physically: You cannot be nowhere Corollary: You cannot be in two places at once →

limits on identity-building Contractually: Rental car with tracking, ... Culturally I: Opt-out may preclude basics of

identity construction No mobile phone/internet communication Culturally II: Opt-out considered suspicious in

itself (ex. A. Holm surveillance case 2007)

Page 14: – with special attention  to location (               ) privacy

FOES ?

Page 15: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 16: – with special attention  to location (               ) privacy

16

Behaviour on the Web (and elsewhere)

Data

Page 17: – with special attention  to location (               ) privacy

17

(Web) data analysis and mining

Data

Privacyproblems!

Page 18: – with special attention  to location (               ) privacy

18

Technical background of the problem:

• The dataset allows for Web mining (e.g., which search queries lead to which site choices),• it violates k-anonymity (e.g. "Lilburn" a likely k = #inhabitants of Lilburn)

Page 19: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 20: – with special attention  to location (               ) privacy

Inferences

Data mining / machine learning: inductive learning of models („knowledge“) from data

Privacy-relevant (Re-)identification: inferences towards identity Profiling: inferences towards properties Application of the inferred knowledge

Page 21: – with special attention  to location (               ) privacy

21

What is identity merging?Or: Is this the same person?

Page 22: – with special attention  to location (               ) privacy

22

Data integration: an example Paper published by the MovieLens team

(collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/

Public dataset: users mention films in forum posts

Private dataset (may be released e.g. for research purposes): users‘ ratings

Film IDs can easily be extracted from the posts

Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)

[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]

Generalisation with more robust de-anonymization attacks and different data:[Narayanan A, Shmatikov V (2009) De-anonymizing social networks.

In: Proc. 30th IEEE Symposium on Security and Privacy 2009]

Page 23: – with special attention  to location (               ) privacy

23

Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset

Rank these users u by their likelihood of being t

Evalute: If t is in the top k of this list, then t is k-identified

Count percentage of users who are k-identified

E.g. measure likelihood by TF.IDF (m: item)

Merging identities – the computational problem

Page 24: – with special attention  to location (               ) privacy

24

Results

Page 25: – with special attention  to location (               ) privacy

25

What do you think helps?

Page 26: – with special attention  to location (               ) privacy

26

What is classification (and prediction)?

Page 27: – with special attention  to location (               ) privacy

27

Predicting political affiliation from Facebook profile and link data (1): Most

Conservative Traits

Trait Name Trait Value Weight Conservative

Group george w bush is my homeboy

45.88831329

Group college republicans 40.51122488

Group texas conservatives 32.23171423

Group bears for bush 30.86484689

Group kerry is a fairy 28.50250433

Group aggie republicans 27.64720818

Group keep facebook clean 23.653477

Group i voted for bush 23.43173116

Group protect marriage one man one woman

21.60830487

Lindamood et al. 09 & Heatherly et al. 09

Page 28: – with special attention  to location (               ) privacy

28

Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name

Trait Name Trait Value Weight Liberal

activities amnesty international 4.659100601

Employer hot topic 2.753844959

favorite tv shows queer as folk 9.762900035

grad school computer science 1.698146579

hometown mumbai 3.566007713

Relationship Status in an open relationship 1.617950632

religious views agnostic 3.15756412

looking for whatever i can get 1.703651985

Lindamood et al. 09 & Heatherly et al. 09

Page 29: – with special attention  to location (               ) privacy

29

What is collaborative filtering?

"People like what

people like them

like"

Page 30: – with special attention  to location (               ) privacy

30

User-based Collaborative Filtering

Idea: People who agreed in the past are likely to agree again

To predict a user’s opinion for an item, use the opinion of similar users

Similarity between users is decided by looking at their overlap in opinions for other items

Page 31: – with special attention  to location (               ) privacy

31

Example: User-based Collaborative Filtering

Item 1 Item 2 Item 3 Item 4 Item 5

User 1 8 1 ? 2 7

User 2 2 ? 5 7 5

User 3 5 4 7 4 7

User 4 7 1 7 3 8

User 5 1 7 4 6 5

User 6 8 3 8 3 7

Page 32: – with special attention  to location (               ) privacy

32

Similarity between users

Item 1 Item 2 Item 3 Item 4 Item 5

User 1 8 1 ? 2 7

User 2 2 ? 5 7 5

User 4 7 1 7 3 8

• How similar are users 1 and 2?• How similar are users 1 and 4?

• How do you calculate similarity?

Page 33: – with special attention  to location (               ) privacy

33

Popular similarity measures

Cosine basedsimilarity

Adjusted

cosine basedsimilarity

Correlation based similarity

Page 34: – with special attention  to location (               ) privacy

34

Algorithm 1: using entire matrix

5

4

7 7

8

Aggregation function: often weighted sum

Weight depends on similarity

Page 35: – with special attention  to location (               ) privacy

35

Algorithm 2: K-Nearest-Neighbour

5

4

7 7

8

Aggregation function: often weighted sum

Weight depends on similarity

Neighbours are people who have historically had the same taste as our user

Page 36: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 37: – with special attention  to location (               ) privacy

Summary: Lots of data → lots of privacy threats (and opportunities)

The Web incites one of the semiotically richest (and often machine-processable) types of interaction

Space incites data-rich types of interaction → two rich sources of „the database of

intentions“

Page 38: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 39: – with special attention  to location (               ) privacy

How many people see an ad?

Television: sample viewers, extrapolate to population

Web: count viewers/clickers through clickstream

City streets: count pedestrians / motorists? Too many streets! → Solution intuition: sample streets, predict

Page 40: – with special attention  to location (               ) privacy

Fraunhofer IAIS (2007): predict frequencies based on similar streets Street segments modelled as vectors

Spatial / geometric information Type of street, direction, speed class, … Demographic, socio-economic data about vicinity Nearby points of interest (buffer around segment, count

#POI) KNN algorithm

Frequency of a street segment = weighted sum of frequencies from most similar k segments in sample

Dynamic + selective calculation of distance to counter the huge numbers of segments and measurements

Page 41: – with special attention  to location (               ) privacy

SPACE

WEB MINING

PRIVACY

Page 42: – with special attention  to location (               ) privacy

IP filtering:a deterministic classification model

IP → country

Page 43: – with special attention  to location (               ) privacy

43

Where do people live who will buy the Koran soon?

Technical background of the problem:

• A mashup of different data sources• Amazon wishlists• Yahoo! People (addresses)• Google Maps

each with insufficient k-anonymity, allows for attribute matching and thereby inferences

Page 44: – with special attention  to location (               ) privacy

• Event store• Learning• Reasoning

Multiple views on traffic

Operator ID: NickHeading: INCIDENTMessage: INCIDENT INFORMATIONCleared 1637: I-405 SBJS I-90 ACC BLK RL CCTV1623 – WSP, FIR ON SCENE

Incident reportsIncident reports

WeatherWeather

Major eventsMajor events

Traffic Prediction: space data + Web data + ...Traffic Prediction: space data + Web data + ...

E.g. LARKC project: I. Celino, D. Dell'Aglio, E. Della Valle, R. Grothmann, F. Steinke and V. Tresp: Integrating Machine Learning in a Semantic Web Platform for Traffic Forecasting and Routing. IRMLeS 2011 Workshop at ESWC 2011.

Page 45: – with special attention  to location (               ) privacy

PEACEFUL COEXISTENCE ?

Page 46: – with special attention  to location (               ) privacy

46

Recall (a simple view): Cryptographic privacy solutions

Data

not all !

Page 47: – with special attention  to location (               ) privacy

47

"Privacy-preserving data mining"

Data

not all !

Page 48: – with special attention  to location (               ) privacy

48

Privacy-preserving data mining (PPDM)

Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users”

Objective of PPDM : "develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process.”

Approaches: Data distribution

Decentralized holding of data Data modification

Aggregation/merging into coarser categories Perturbation, blocking of attribute values Swapping values of individual records sampling

Data or rule hiding Push the support of sensitive patterns below a threshold

Page 49: – with special attention  to location (               ) privacy

49

Example 1: Collaborative filtering

Page 50: – with special attention  to location (               ) privacy

50

Collaborative filtering: ideaand architecture

Basic idea of collaborative filtering: "Users who liked this also liked ..." generalize from "similar profiles"

Standard solution: At the community site / centralized:

Compute, from all users and their ratings/purchases, etc., a global model

To derive a recommendation for a given user: find "similar profiles" in this model and derive a prediction

Mathematically: depends on simple vector computations in the user-item space

Page 51: – with special attention  to location (               ) privacy

51

Distributed data mining / secure multi-party computation: The principle explained by secure sum

Given a number of values x1,...,xn belonging to n entities

compute xi

such that each entity ONLY knows its input and the result of the computation (The aggregate sum of the data)

Page 52: – with special attention  to location (               ) privacy

52

Canny: Collaborative filtering with privacy

Each user starts with their own preference data, and knowledge of who their peers are in their community.

By running the protocol, users exchange various encrypted messages.

At the end of the protocol, every user has an unencrypted copy of the linear model Λ, ψ of the community’s preferences.

They can then use this to extrapolate their own ratings At no stage does unencypted information about a

user’s preferences leave their own machine. Users outside the community can request a copy of

the model Λ, ψ from any community member, and derive recommendations for themselves

Canny (2002), Proc. IEEE Symp. Security and Privacy; Proc. SIGIR

Page 53: – with special attention  to location (               ) privacy

53

Ex. 2: Frequent itemset mining

Page 54: – with special attention  to location (               ) privacy

54

Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets

Spaghetti: support = 3 (60%)

tomato sauce: support = 3 (60%)

bread: support = 4 (80%)

butter: support = 1 (20%)

Transaction ID

Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 55: – with special attention  to location (               ) privacy

55

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 56: – with special attention  to location (               ) privacy

56

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 57: – with special attention  to location (               ) privacy

57

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 58: – with special attention  to location (               ) privacy

58

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 59: – with special attention  to location (               ) privacy

How many people see an ad? Next steps ...

Not only ads, but personalized ads Ad sequences?

→ need to know trajectories Single trajectories: highly privacy-sensitive data Aggregate (e.g. frequent) trajectories also

interesting for other applications – e.g., traffic planning

Page 60: – with special attention  to location (               ) privacy

Privacy-preserving frequent-route mining by data coarsening: intuition

Page 61: – with special attention  to location (               ) privacy

Ex.: Gidófalvi et al. (2007): Privacy-preserving data mining on moving

object trajectories Basic strategy: Aggregation/merging into coarser

categories, performed by client

Anonymization rectangles satisfying (areasize, maxLocProb): <R,t

s,t

e>

→ allows inference of location probability of R

Page 62: – with special attention  to location (               ) privacy

Coarsened trajectories

Page 63: – with special attention  to location (               ) privacy

Time interval probabilistically frequent route queries

Split trajectories inside query time interval into m sub-trajectories of equal time length

→ trajectory = set/sequence of spatio-temporal grid cell IDs, each associated with a location probability = transaction of items (X,P)

Transaction p-satisfies itemset Y if Y in X and for all i in Y intersects X: i.prob >= min_prob

p-support of an item(set) i.count: #TAs that p-satisfy the item(set)

frequent routes := maximal p-frequent itemsets with a frequent-itemset miner (can be discontinuous)

Extension to frequent sequence mining?!

Page 64: – with special attention  to location (               ) privacy

64

Outlook: Privacy-preserving data publishing (PPDP)

In contrast to the general assumptions of PPDM, arbitrary mining methods may be performed after publishing

need adversary models Objective: "access to published data should not

enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attacker’s background knowledge obtained from other sources”

(this needs to be relaxed by assumptions about the background knowledge)

A comprehensive current survey: Fung et al. ACM Computing Surveys 2010

Page 65: – with special attention  to location (               ) privacy

65

Problem solved?

Page 66: – with special attention  to location (               ) privacy

66

No ...

How do people like/buy books?

What do our Webserver logs tell us about viewing behaviour? How can we combine Webserver and

transaction logs?

Which data noise do we

have to remove from our logs?

Which of these

association rules are

frequent/confident

enough?

Should we show the

recommendations at the top or bottom of the page?

Only to registered customers

?

What if someone bought a book as a

present for their

father?

Page 67: – with special attention  to location (               ) privacy

FRIENDS ?

Page 68: – with special attention  to location (               ) privacy

against

From ...

SPACE

WEB MINING

PRIVACY

Page 69: – with special attention  to location (               ) privacy

for

… to

SPACE

WEB MINING

PRIVACY

Page 70: – with special attention  to location (               ) privacy

70

Why Web / data mining?

Who is doingthe learning?

Page 71: – with special attention  to location (               ) privacy

71

Privacy as practice: Identity construction

Data

Page 72: – with special attention  to location (               ) privacy

72

Example: Privacy Wizards for Social Networking Sites

[Fang & LeFevre 2010] Interface: user specifies what they

want to share with whom

Not in an abstract way ("group X" or "friends of friends" etc.)

Not for every friend separately

But for a subsect of friends, and the system learns the "implicit rules behind that"

Data mining: active learning (system asks only about the most informative friends instances)

Results: good accuracy, better for "friends by communities" (linkage information) than for "friends by profile" (their profile data)

Page 73: – with special attention  to location (               ) privacy

73

Privacy Wizards ... – more feedback: “Expert interface“ shows the learned classifier

Page 74: – with special attention  to location (               ) privacy

encrypted content,unobservable communication

selectivity by access control

identification of information flows

feedback & awareness tools

educational materialsand communication design

cognitive biases and nudging interventions

legal aspects

offline communities:social identities, social requirements

profiling

Page 75: – with special attention  to location (               ) privacy

Summary and conclusions

The Web and space are rich sources of behavioural and other data

Data mining is learning (inductively) from these data – a process of knowledge discovery (KD)

„privacy-preserving data mining“ modifies data and/or algorithms to preserve utility & privacy

Privacy threats arise in all phases of KD But KD can also offer privacy opportunities

Page 76: – with special attention  to location (               ) privacy

Outlook: from

macro-space to

micro-space / social signal processing

Page 77: – with special attention  to location (               ) privacy

QUESTIONS

PLEASE

THANK YOU