an experimental comparison of globally-optimal data de-identification algorithms

14
Technische Universität München Fabian Prasser , Florian Kohlmayer, Klaus A. Kuhn Chair for Biomedical Informatics Institute for Medical Statistics and Epidemiology Klinikum rechts der Isar der TU München An Experimental Comparison of Globally-Optimal Data De-Identification Algorithms

Upload: arx-deidentifier

Post on 22-Jan-2018

118 views

Category:

Science


4 download

TRANSCRIPT

Page 1: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn

Chair for Biomedical InformaticsInstitute for Medical Statistics and Epidemiology

Klinikum rechts der Isar der TU München

An Experimental Comparison of Globally-Optimal Data

De-Identification Algorithms

Page 2: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Optimal de-identification algorithms• Generalization hierarchies

• Pruning: predictive tagging• Optimization: roll-up• Privacy models, e.g.: k-anonymity, l-diversity, t-closeness, δ-presence

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 2

• Generalization lattice

K=2K=2

Age Gender Zipcode34 male 8166745 female 8166766 male 8192570 female 8192570 male 81925

Age Gender Zipcode20-60 * 8166720-60 * 81667≥ 61 * 81925≥ 61 * 81925≥ 61 * 81925

Page 3: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Algorithms – Incognito• LeFevre et al.

– SIGMOD 2005

• Dynamic programming

– Breadth-first search on lattices for powerset of quasi-identifiers

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 3

Page 4: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Algorithms – OLA & Flash• Emam et al.

– JAMIA 2009

• Divide & conquer

– Optimal Lattice Anonymization– Binary search on sublattices

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 4

• Kohlmayer & Prasser et al.

– PASSAT 2012

• Greedy search

– Binary depth-first search– Total order & priority queue

Page 5: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Algorithms – BFS, DFS & Questions

• Generic search methods

– Breadth-first search (BFS)

– Depth-first search (DFS)

→ Extended to use predictive tagging

• Research questions– How do the algorithms compare in terms of performance?

– Are there further differences between them?

– Are the algorithms' properties influenced by the privacy models used?

– How do problem-specific methods compare to generic search algorithms?

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 5

Page 6: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Benchmark – Method• Use all reasonable combinations of common privacy models with

typical parameters– (k)-anonymity, (l)-diversity, (t)-closeness, (δ)-presence

• Properties of the search space are influenced by combining privacy models:

– (k), (l), (t), (δ)– (k, l), (k, t), (k, δ), (l, δ), (t, δ)– (k, l, δ), (k, t, δ)

• Report three basic performance measures– Pruning power: number of anonymity checks– Optimizability: number of roll-ups– Execution times in a highly efficient runtime environment (ARX)

• Five well-known benchmark datasets

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 6

Page 7: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over datasets

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 7

# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

● Allows analyzing variations in results for different sets of privacy models

Page 8: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over datasets

● Repeating patterns

→ Consistent results for different configurations→ Differences between algorithms not influenced by privacy models used

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 8

# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

Page 9: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over datasets

● Breadth-first search is a worst-case strategy

→ No pruning-power, no optimizability→ Incognito suffers from similar performance problems

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 9

# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

Page 10: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over datasets

● Depth-first search is pretty efficient

→ Can outperform domain-specific methods (OLA)→ Because of its optimizability (best method in terms of #roll-ups)

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 10

# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

Page 11: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over datasets

● Number of checks: OLA < Flash < DFS < Incognito < BFS● Number of roll-ups: DFS > Flash > Incognito > OLA > BFS● Execution times: Flash < OLA < DFS < Incognito < BFS

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 11

# R

oll-

ups

# C

hec

ksE

xec.

tim

e [s

]

Lower isbetter

Higher isbetter

Lower isbetter

Page 12: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Results – Averaged over privacy models

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 12

– OLA provides performance comparable to Flash for smaller datasets

– DFS provides performance comparable to Flash for larger datasets

# C

heck

s#

Rol

l-ups

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

● Shows variations in results for different datasets

● Algorithms exhibit similar properties

● Flash provides the best overall performance

● Differences are mostly independent of datasets

● But

Page 13: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Lessons learned• In general, domain-specific algorithms outperform generic methods

→ Up to several orders of magnitude (BFS)

→ OLA and Flash only check between 0.2% and 1.1% of all transformations in the solution space

→ Not necessarily true for large datasets (DFS)

• Flash effectively balances optimizability with pruning power

→ Should be used if optimized runtime environments are available

• OLA provides best pruning power

→ Should be used in general-purpose environments

• DFS outperforms OLA for large datasets

→ In these cases, optimizability is more important than pruning power

→ Optimized runtime environments required

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 13

Page 14: An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Thank you for your attention!

• ARX is free software– Download – Use – Contribute

– Repository: https://github.com/arx-deidentifier/arx

• Further information– Website: http://arx.deidentifier.org– Contact

● Fabian Prasser ([email protected])

● Florian Kohlmayer ([email protected])

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 14