a study on privacy level in publishing data of smart tap network
DESCRIPTION
Using entropy to quantify privacy leve when publishing smart grid data.TRANSCRIPT
A Study on Privacy Level in Publishing Data of Smart Tap Network
The University of Tokyo Esaki Laboratory
Tran Quoc Hoan 2014.03.18@Niigata
1
Outline1. Background & Purpose
2. Related works
3. Proposal
4. Methodology
5. Result & Discussion
6. Conclusion
2
Background & Purpose• Background
1. Smart tap & Big data
2. Privacy Preserving Data Publishing (PPDP)
3. Difficulty in anonymising time series data
• Research purpose
• Using entropy to quantify the risk of publishing smart tap’s data
Alice Bob Peter
Original Dataset
Data Recipient
Data
Pub
lishi
ngDa
ta C
olle
ction
Data
ano
nymise
Data Processor
3
Related works 1. Smart Metering & Privacy (Quinn, 2009)
2. Time series chaos analysis in physiology
• Approximate Entropy (Pincus, 1992)
• Bias effect (Ex. random noise)
• Sample Entropy (Richman, 2001)
• Avoiding of bias effect
• Difference from original entropy definition
4
15.556%31.111%46.667%62.222%
Proposal(1): Privacy Level• “Privacy level” = quantity of human activity information in power consumption data (%)
Refrigerator (regularity)
Time points Time points Time points
power value power value power value
White-noise (irregularity)
Laptop (???)
Priva
cy le
vel
• Evaluation of regularity (entropy) 5
22.222%
44.444%
66.667%
88.889%
Proposal(2): Entropy rate• Entropy Rate = Entropy(data)/Entropy(white-noise)
1
0
Privacy Level = EnRate
Entropy rate
Refrigerator (regularity)
White-noise (irregularity)
Laptop (???)
HRate
LRate
Time points Time points Time points
power value power value power value
Publish Safe
Publish Safe
6
Proposing Methodology
1. Decide parameters for entropy calculation
• Time lag, m, r
2. Calculate entropy value, entropy rate
3. Decide LRate, HRate and privacy level
• Using Approximate Entropy (ApEn) & Sample Entropy (SaEn)
7
Parameters for entropy calculation
80
15
30
45
60Ex. lag = 1, m=3
• Time series x[1], x[2], …, x[N] • pattern i: (x[i],x[i+lag],…,x[i+(m-1)lag]) • m: number of data points in pattern • lag: sampling interval in pattern
• dis(i,j)=max(|x[i+(p-1)lag]-x[j+(p-1)lag]|, p=0,m-1) • r: dis(i,j) ≤ r → pattern i ~ pattern j
pattern i j ki j ki8
Entropy Calculation• A(i): number of pattern k similar with pattern i ( k != i)
• B(i): number of pattern (k+lag) similar with pattern (i+lag)
Bias when A(i)=B(i)=0 (random noise)
ー ー
0
15
30
45
60
Time points
Ex. lag = 1, m=3
j kii+lag
j+lag k+lagi+lag
9
Setting time lag
First ACF zero-crossing lag = 7 ApEn = 1.223; SaEn = 0.944
First ACF zero-crossing lag = 198 ApEn = 1.299; SaEn = 1.457
10
Setting m, r
Choose m, r satisfy 95%Confidence Interval of the Estimate ≤ 10%SaEn
White-noise Entropy
Choose m, r maximum ApEn
バイアス 領域
std: standard deviation
m=2,3 r=0,1->0.4
11
Evaluation1. Learning data set (for setting m, r)
• Tracebase (tracebase.org) (138 devices)
• m=2,3; r=0.1→0.4
2. Evaluation data set
• IREF Building 2F-5F (136 devices, 5 weeks)
12
IREF 136 devs EnRate (5 weeks)
SaEn
Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ApEnRate
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Result (1)
m=2, r=0.2*standard deviation13
Result (2)IREF Laptop EnRate (11 devs, 5 weeks)
SaEn
Rate
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
ApEnRate
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
HRate
LRate
LRate HRate
LRate = Mean - Standard Deviation HRate = Mean + Standard Deviation
Warning
14
Discussion1. Entropy is sensitive to data sets that include outliers
2. Relation between entropy and privacy of data
3. Future work
• Calculate entropy with meaning patterns
• Using entropy for other knowledge (device classification, abnormal pattern detection,…)
• Privacy Preserving Protocol
15
Conclusion1. Quantified the human activity
information included in smart-taps’ data
2. Applied entropy in physiology (ApEn, SaEn) to power consumption data
3. Defined entropy rate to determine privacy level of published power consumption data
16
A Study on Privacy Level in Publishing Data of Smart Tap Network
Esaki Laboratory [email protected]
Thank you for listening !
17
Backup slides
18
Demand and Supply1. Demand Oriented Approach of Power Grid
• Supply matches volatile demand
• Supply side is volatile as well
2. Bi-directional communication (Internet of Things)
• Anticipate future supply/demand
• Shape demand, supply-oriented
• Personal data is needed for effective demand side management
19
Risk of Privacy Abuse
20
Inference forward channel
Inference backward channel
By consumption patterns • Appliance detection • Use mode detection • Behavior deduction
By demand response data • Incentive sensitivity • Customer preference
Household Managements Data collectors
Ex. Behavior Patterns: • Washing (10h-12h) • TV (19h-23h) • Out (12h-18h)
The Concept of EU for Privacy
21
Discriminator
Machine learning
x Pseudonym
Consumption Data
non-identifying information
identifying information
Pseudonymization
Template Data
Source: “Privacy in the Smart Energy Grid”, Lecture at NII 2014-03-13, Prof. Gunter Muller
Service Feedback Loop
22
Household
Service Provider Billing
Aggregation Compliance Verification
Data collectors
• Bill • Consumption Target
Consumption trace
(My research) Privacy level = (??)%
Query
Privacy Preserving Protocol
$$$
Future workEncryption
Service Provider Billing
Aggregation Compliance Verification
Service Provider Billing
Aggregation Compliance Verification
Privacy Preserving Query Scenario
23
Q1. How many people have energy consumption between 19h-20h which is over the average ?
Q2. How many people have energy consumption between 19h-20h which is over the average except Tanaka ?
None-privateQ1: 125, Q2: 124
Attacker Detection
Privacy preservingQ1: 125, Q2: 127
Service Provider
Data Collectors
Evaluation SystemTime series segmentation
Real Event Mapping
Quantify Privacy Level
24
Linkage Attack
in: 9h-10h, 13h-13h30 out: 10h-13h, 13h30-
in: 12h-14h, 16h-18h out: 18h-
peak: 16h-18hCategorization
Alice Bob Peter
Third party information
3 people in the room: Alice, Bob, Peter Peter has printer, Alice has monitor, Bob has PC
Published Data
Identify
25
Regularity in Time Series • Linear method can’t solve problem => Nonlinear Analysis
Refrigerator data and its surrogate
ACF and periodgramTime points
26
Entropy (1)• Display time series data in phase-space
y(m,t) = [x(t), x(t+lag), …, x(t+(m-1)lag)]
• Approximate Entropy (ApEn) and Sample Entropy (SaEn): evaluate trajectory matching conditional probability
x(t+7)
x(t)x(t)
x(t+7)
x(t+14)
m=2, lag=7 m=3, lag=7
27
Setting time lag• Time lag = First zero-crossing of ACF
Dev Lag ApEn SaEn
Unknown 500 0.348 0.473
Refri 7 1.223 0.944
Laptop 198 1.299 1.457
Noise 2 3.025 3.247
28
Setting m, r (1)
29
• K_A, K_B : overlapped template matching patterns number (pattern length m, m+1)
• 95% Confidence of SaEn
Setting m, r (2)
m=2, r=(0.1~0.4)stdtraining for parameters: tracebase dataset 30
ExperimentData set Tracebase IREF
Smart tap type Plugwise Plugwise
Number of devices 138 136
Time range Variation 5 weeks
Sampling interval 1 s 2 mins
Usage Training for m, r Evaluation
Result m=2,3 r=0.1~0.4 std ***
31
Result (1)IREF : 136 devices, 5 weeks
32
Other knowledge from entropy rate (?)
Device classification & abnormal detection
Tracebase data set33