– with special attention to location ( ) privacy
DESCRIPTION
Bettina Berendt Dept. Computer Science K.U. Leuven. – with special attention to location ( ) privacy. SPACE. WEB. MINING. and. PRIVACY. : foes or friends?. SPACE. WEB. MINING. PRIVACY. BASICS. SPACE. WEB. MINING. PRIVACY. What is Web Mining? And who am I?. - PowerPoint PPT PresentationTRANSCRIPT
– with special attention to location ( ) privacySPACE
WEB MINING
PRIVACY
and
: foes or friends?
Bettina BerendtDept. Computer ScienceK.U. Leuven
SPACE
WEB MINING
PRIVACY
BASICS
SPACE
WEB MINING
PRIVACY
5
What is Web Mining? And who am I?
Knowledge discovery (aka Data mining):
"the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."
Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:
Web content miningWeb structure mining
Web usage mining
Navigation, queries, content access & creation
6
Why Web / data mining?
“the database ofIntentions“(J. Battelle)
SPACE
WEB MINING
PRIVACY
Location-based services and augmented
reality
www.poynt.com
Semiotically augmented reality: semapedia and related ideas
Mobile Social Web
SPACE
WEB MINING
PRIVACY
What's special about spatial information? 1. Interpreting
Rich inferences from spatial position to personal properties and/or identity possible Pos(A,9-17) = P1 → workplace(A,P1) Pos(A,20-6) = P2 → home(A,P2)
An even richer „database of intentions“?! Pos(A,now) = P3 & temp(P3,now,hot) →
wants(A,ice-cream) (location-based services)
Pos(A, t in 13-18) = Pos(Demonstration,13-18) → suspicious(A) (ex. Dresden phone surveillance case 2011)
What's special about spatial information? 2. Sending, or: Opt-out impossible?!
Physically: You cannot be nowhere Corollary: You cannot be in two places at once →
limits on identity-building Contractually: Rental car with tracking, ... Culturally I: Opt-out may preclude basics of
identity construction No mobile phone/internet communication Culturally II: Opt-out considered suspicious in
itself (ex. A. Holm surveillance case 2007)
FOES ?
SPACE
WEB MINING
PRIVACY
16
Behaviour on the Web (and elsewhere)
Data
17
(Web) data analysis and mining
Data
Privacyproblems!
18
Technical background of the problem:
• The dataset allows for Web mining (e.g., which search queries lead to which site choices),• it violates k-anonymity (e.g. "Lilburn" a likely k = #inhabitants of Lilburn)
SPACE
WEB MINING
PRIVACY
Inferences
Data mining / machine learning: inductive learning of models („knowledge“) from data
Privacy-relevant (Re-)identification: inferences towards identity Profiling: inferences towards properties Application of the inferred knowledge
21
What is identity merging?Or: Is this the same person?
22
Data integration: an example Paper published by the MovieLens team
(collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/
Public dataset: users mention films in forum posts
Private dataset (may be released e.g. for research purposes): users‘ ratings
Film IDs can easily be extracted from the posts
Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)
[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]
Generalisation with more robust de-anonymization attacks and different data:[Narayanan A, Shmatikov V (2009) De-anonymizing social networks.
In: Proc. 30th IEEE Symposium on Security and Privacy 2009]
23
Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute: If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
Merging identities – the computational problem
24
Results
25
What do you think helps?
26
What is classification (and prediction)?
27
Predicting political affiliation from Facebook profile and link data (1): Most
Conservative Traits
Trait Name Trait Value Weight Conservative
Group george w bush is my homeboy
45.88831329
Group college republicans 40.51122488
Group texas conservatives 32.23171423
Group bears for bush 30.86484689
Group kerry is a fairy 28.50250433
Group aggie republicans 27.64720818
Group keep facebook clean 23.653477
Group i voted for bush 23.43173116
Group protect marriage one man one woman
21.60830487
Lindamood et al. 09 & Heatherly et al. 09
28
Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name
Trait Name Trait Value Weight Liberal
activities amnesty international 4.659100601
Employer hot topic 2.753844959
favorite tv shows queer as folk 9.762900035
grad school computer science 1.698146579
hometown mumbai 3.566007713
Relationship Status in an open relationship 1.617950632
religious views agnostic 3.15756412
looking for whatever i can get 1.703651985
Lindamood et al. 09 & Heatherly et al. 09
29
What is collaborative filtering?
"People like what
people like them
like"
30
User-based Collaborative Filtering
Idea: People who agreed in the past are likely to agree again
To predict a user’s opinion for an item, use the opinion of similar users
Similarity between users is decided by looking at their overlap in opinions for other items
31
Example: User-based Collaborative Filtering
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 8 1 ? 2 7
User 2 2 ? 5 7 5
User 3 5 4 7 4 7
User 4 7 1 7 3 8
User 5 1 7 4 6 5
User 6 8 3 8 3 7
32
Similarity between users
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 8 1 ? 2 7
User 2 2 ? 5 7 5
User 4 7 1 7 3 8
• How similar are users 1 and 2?• How similar are users 1 and 4?
• How do you calculate similarity?
33
Popular similarity measures
Cosine basedsimilarity
Adjusted
cosine basedsimilarity
Correlation based similarity
34
Algorithm 1: using entire matrix
5
4
7 7
8
Aggregation function: often weighted sum
Weight depends on similarity
35
Algorithm 2: K-Nearest-Neighbour
5
4
7 7
8
Aggregation function: often weighted sum
Weight depends on similarity
Neighbours are people who have historically had the same taste as our user
SPACE
WEB MINING
PRIVACY
Summary: Lots of data → lots of privacy threats (and opportunities)
The Web incites one of the semiotically richest (and often machine-processable) types of interaction
Space incites data-rich types of interaction → two rich sources of „the database of
intentions“
SPACE
WEB MINING
PRIVACY
How many people see an ad?
Television: sample viewers, extrapolate to population
Web: count viewers/clickers through clickstream
City streets: count pedestrians / motorists? Too many streets! → Solution intuition: sample streets, predict
Fraunhofer IAIS (2007): predict frequencies based on similar streets Street segments modelled as vectors
Spatial / geometric information Type of street, direction, speed class, … Demographic, socio-economic data about vicinity Nearby points of interest (buffer around segment, count
#POI) KNN algorithm
Frequency of a street segment = weighted sum of frequencies from most similar k segments in sample
Dynamic + selective calculation of distance to counter the huge numbers of segments and measurements
SPACE
WEB MINING
PRIVACY
IP filtering:a deterministic classification model
IP → country
43
Where do people live who will buy the Koran soon?
Technical background of the problem:
• A mashup of different data sources• Amazon wishlists• Yahoo! People (addresses)• Google Maps
each with insufficient k-anonymity, allows for attribute matching and thereby inferences
• Event store• Learning• Reasoning
Multiple views on traffic
Operator ID: NickHeading: INCIDENTMessage: INCIDENT INFORMATIONCleared 1637: I-405 SBJS I-90 ACC BLK RL CCTV1623 – WSP, FIR ON SCENE
Incident reportsIncident reports
WeatherWeather
Major eventsMajor events
Traffic Prediction: space data + Web data + ...Traffic Prediction: space data + Web data + ...
E.g. LARKC project: I. Celino, D. Dell'Aglio, E. Della Valle, R. Grothmann, F. Steinke and V. Tresp: Integrating Machine Learning in a Semantic Web Platform for Traffic Forecasting and Routing. IRMLeS 2011 Workshop at ESWC 2011.
PEACEFUL COEXISTENCE ?
46
Recall (a simple view): Cryptographic privacy solutions
Data
not all !
47
"Privacy-preserving data mining"
Data
not all !
48
Privacy-preserving data mining (PPDM)
Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users”
Objective of PPDM : "develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process.”
Approaches: Data distribution
Decentralized holding of data Data modification
Aggregation/merging into coarser categories Perturbation, blocking of attribute values Swapping values of individual records sampling
Data or rule hiding Push the support of sensitive patterns below a threshold
49
Example 1: Collaborative filtering
50
Collaborative filtering: ideaand architecture
Basic idea of collaborative filtering: "Users who liked this also liked ..." generalize from "similar profiles"
Standard solution: At the community site / centralized:
Compute, from all users and their ratings/purchases, etc., a global model
To derive a recommendation for a given user: find "similar profiles" in this model and derive a prediction
Mathematically: depends on simple vector computations in the user-item space
51
Distributed data mining / secure multi-party computation: The principle explained by secure sum
Given a number of values x1,...,xn belonging to n entities
compute xi
such that each entity ONLY knows its input and the result of the computation (The aggregate sum of the data)
52
Canny: Collaborative filtering with privacy
Each user starts with their own preference data, and knowledge of who their peers are in their community.
By running the protocol, users exchange various encrypted messages.
At the end of the protocol, every user has an unencrypted copy of the linear model Λ, ψ of the community’s preferences.
They can then use this to extrapolate their own ratings At no stage does unencypted information about a
user’s preferences leave their own machine. Users outside the community can request a copy of
the model Λ, ψ from any community member, and derive recommendations for themselves
Canny (2002), Proc. IEEE Symp. Security and Privacy; Proc. SIGIR
53
Ex. 2: Frequent itemset mining
54
Generating large k-itemsets with Apriori
Min. support = 40%
step 1: candidate 1-itemsets
Spaghetti: support = 3 (60%)
tomato sauce: support = 3 (60%)
bread: support = 4 (80%)
butter: support = 1 (20%)
Transaction ID
Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce
55
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
56
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
57
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
58
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
How many people see an ad? Next steps ...
Not only ads, but personalized ads Ad sequences?
→ need to know trajectories Single trajectories: highly privacy-sensitive data Aggregate (e.g. frequent) trajectories also
interesting for other applications – e.g., traffic planning
Privacy-preserving frequent-route mining by data coarsening: intuition
Ex.: Gidófalvi et al. (2007): Privacy-preserving data mining on moving
object trajectories Basic strategy: Aggregation/merging into coarser
categories, performed by client
Anonymization rectangles satisfying (areasize, maxLocProb): <R,t
s,t
e>
→ allows inference of location probability of R
Coarsened trajectories
Time interval probabilistically frequent route queries
Split trajectories inside query time interval into m sub-trajectories of equal time length
→ trajectory = set/sequence of spatio-temporal grid cell IDs, each associated with a location probability = transaction of items (X,P)
Transaction p-satisfies itemset Y if Y in X and for all i in Y intersects X: i.prob >= min_prob
p-support of an item(set) i.count: #TAs that p-satisfy the item(set)
frequent routes := maximal p-frequent itemsets with a frequent-itemset miner (can be discontinuous)
Extension to frequent sequence mining?!
64
Outlook: Privacy-preserving data publishing (PPDP)
In contrast to the general assumptions of PPDM, arbitrary mining methods may be performed after publishing
need adversary models Objective: "access to published data should not
enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attacker’s background knowledge obtained from other sources”
(this needs to be relaxed by assumptions about the background knowledge)
A comprehensive current survey: Fung et al. ACM Computing Surveys 2010
65
Problem solved?
66
No ...
How do people like/buy books?
What do our Webserver logs tell us about viewing behaviour? How can we combine Webserver and
transaction logs?
Which data noise do we
have to remove from our logs?
Which of these
association rules are
frequent/confident
enough?
Should we show the
recommendations at the top or bottom of the page?
Only to registered customers
?
What if someone bought a book as a
present for their
father?
FRIENDS ?
against
From ...
SPACE
WEB MINING
PRIVACY
for
… to
SPACE
WEB MINING
PRIVACY
70
Why Web / data mining?
Who is doingthe learning?
71
Privacy as practice: Identity construction
Data
72
Example: Privacy Wizards for Social Networking Sites
[Fang & LeFevre 2010] Interface: user specifies what they
want to share with whom
Not in an abstract way ("group X" or "friends of friends" etc.)
Not for every friend separately
But for a subsect of friends, and the system learns the "implicit rules behind that"
Data mining: active learning (system asks only about the most informative friends instances)
Results: good accuracy, better for "friends by communities" (linkage information) than for "friends by profile" (their profile data)
73
Privacy Wizards ... – more feedback: “Expert interface“ shows the learned classifier
encrypted content,unobservable communication
selectivity by access control
identification of information flows
feedback & awareness tools
educational materialsand communication design
cognitive biases and nudging interventions
legal aspects
offline communities:social identities, social requirements
profiling
Summary and conclusions
The Web and space are rich sources of behavioural and other data
Data mining is learning (inductively) from these data – a process of knowledge discovery (KD)
„privacy-preserving data mining“ modifies data and/or algorithms to preserve utility & privacy
Privacy threats arise in all phases of KD But KD can also offer privacy opportunities
Outlook: from
macro-space to
micro-space / social signal processing
QUESTIONS
PLEASE
THANK YOU