protecting privacy with fuzzy-feeling test data

51
fuzzy-feeling test data MATT BOWEN | SENIOR SOFTWARE ENGINEER | APRIL 5, 2016 PROTECTING USER PRIVACY

Upload: matt-bowen

Post on 16-Apr-2017

147 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Protecting privacy with fuzzy-feeling test data

fuzzy-feeling test data

MATT BOWEN | SENIOR SOFTWARE ENGINEER | APRIL 5, 2016

PROTECTING USER PRIVACY

Page 2: Protecting privacy with fuzzy-feeling test data

oh man, where are we goingAGENDA

1. Background and motivations2. Design goals3. Big concepts in our design4. How the actual system works 5. A side-trip into some metaprogramming!6. Another side-trip, this time into sadness town7. Where we are today, having left sadness town8. Q&A

Page 3: Protecting privacy with fuzzy-feeling test data

background and motivations

Page 4: Protecting privacy with fuzzy-feeling test data

what we do

•We help people quit smoking, WITH SCIENCE!• We build novel tools

that help people quit• We study how effective

the tools are

•We’ve made a SMS service, a smartphone app, a couple online communities, and system to study them! Some of it even works, clinically!

Page 5: Protecting privacy with fuzzy-feeling test data

BecomeAnEx.org

•Online quit plan, information, and community for smokers

Page 6: Protecting privacy with fuzzy-feeling test data

become an ex study system

•Our web-based clinical trials management system

Page 7: Protecting privacy with fuzzy-feeling test data

automated research studies

•It can screen people based on their registration info for BecomeAnEx.org

Page 8: Protecting privacy with fuzzy-feeling test data

survey capabilitiesIT SLICES, IT DICES…WE ALREADY HAD A TOOL CALLED GINZU THOUGH.

•Surveys via limesurvey, email followups via mailgun

Page 9: Protecting privacy with fuzzy-feeling test data

the study system

•Requires a ton of sensitive personally identifying information for running the studies.

Page 10: Protecting privacy with fuzzy-feeling test data

testing data, way back whenI JUST WANTED DUMPS FOR PRODUCTION TO MY LAPTOP…

•Test/QA data totally unrelated to production data• Worked mostly for testing, but performance testing tricky• Testing platform upgrades also was no fun

•Production data CANNOT be on dev boxes• Can’t risk doing something dumb with PII

REALLY?!

Page 11: Protecting privacy with fuzzy-feeling test data

eventually, we needed to upgrade

THE CAT IS EMBARRASSED BY THE OLD VERSION OF DJANGO

•Our version of Django had gone out of support…

•Our other eggs weren’t exactly fresh either…

Page 12: Protecting privacy with fuzzy-feeling test data

eventually, we needed to upgrade

THE CAT IS EMBARRASSED BY THE OLD VERSION OF DJANGO

•Our version of Django had gone out of support…

•Our other eggs weren’t exactly fresh either…

•We use MySQL and had no idea what the migrations would do

¯\_(ツ )_/¯

Page 13: Protecting privacy with fuzzy-feeling test data

design goals

Page 14: Protecting privacy with fuzzy-feeling test data

goals

•Protect the anonymity of our users•Specifically, avoid having a user’s real name, contact information, (or any other data point identified as PII by Truth Initiative) on a test system.

Page 15: Protecting privacy with fuzzy-feeling test data

goals

•Be able to test performance problems•Specifically ones related to having data of a certain size and shape

Page 16: Protecting privacy with fuzzy-feeling test data

goals

•Be able to test code related to deciding whether someone should be in a study•This often depends on who is already enrolled in the study — for example, be able to confirm our system for statistically balancing

Page 17: Protecting privacy with fuzzy-feeling test data

goals

•Be able to reproduce production problems related to data size on test servers•Sometimes performance problems become critical bugs.

Page 18: Protecting privacy with fuzzy-feeling test data

personal goal

•Have as little code as possible•Code is terrible.

Page 19: Protecting privacy with fuzzy-feeling test data

moar goals

• Execute completely in under 48 hours — ideally in under 16 for overnight runs.

• To the extent possible, the existing data constraints should be maintained, which meant not just turning off unique constraints and bulk-setting fields to blank values

• The data should look as much as possible like real production data, to minimize surprises when we move from test to production.

Page 20: Protecting privacy with fuzzy-feeling test data

big concepts in our design

Page 21: Protecting privacy with fuzzy-feeling test data

let’s use our backups!BUT WITH STYLE!

•In my dreams, we could just use our backups directly, but they’re too sensitive…

•… but what if we could just mask the sensitive columns!?

Page 22: Protecting privacy with fuzzy-feeling test data

let’s use our backups!BUT WITH STYLE!

•In my dreams, we could just use our backups directly, but they’re too sensitive…

•… but what if we could just mask the sensitive columns!?

•ENTER FAKER

Page 23: Protecting privacy with fuzzy-feeling test data

fakerGENERATE FAKE DATA!

•Faker lets you generate fake PII, and it’s pretty amusing too• Fake names

• Dr. Ammon Raynor DVM• Mr. Damarion Adams I

• Fake IPs• Pretty much everything I needed to fuzz

Page 24: Protecting privacy with fuzzy-feeling test data

i was lucky

•For our systems, the PII is crucial for the researchers but inconsequential for the stuff we test• We do need some potential PII, but it’s generic — some

demographic data like gender and ethnicity, smoking status, some common medical conditions, and country

• I live in fear of the study that actually uses zipcodes for something, because then I have to worry about about combinations of variables being personally identifiable

Page 25: Protecting privacy with fuzzy-feeling test data

how the actual system works

Page 26: Protecting privacy with fuzzy-feeling test data

simple in concept…

•To do what I wanted, I needed to load data from one database into another, but replacing some of the values with fake data.

•Enter…

Page 27: Protecting privacy with fuzzy-feeling test data

sqlalchemy is awesome•Reflects tables into python classes, automagically• Lets you set your own

custom base classes to be mixed-in to the dynamic ones it generates

• Lets you override columns with your own properties

• Lets you subclass the dynamic classes too!

Let’s write some code!

Page 28: Protecting privacy with fuzzy-feeling test data

how’s it look?MAKING THE DB CONNECTIONS

Page 29: Protecting privacy with fuzzy-feeling test data

how’s it look? MIGRATING THE DATA

Page 30: Protecting privacy with fuzzy-feeling test data

how’s it look?GETTING OUR CUSTOM SUBCLASSES

Page 31: Protecting privacy with fuzzy-feeling test data

how’s it look? MIGRATING THE DATA

Page 32: Protecting privacy with fuzzy-feeling test data

loading a sql dump really slowly

•With reflection and a little shell scripting to get the database schema in my target database, I had a really slow version of

•mysql db1 < dump.sql

•But we still aren’t actually doing any fuzzing of the data…

Page 33: Protecting privacy with fuzzy-feeling test data

it’s time for metaprogramming

no, YOU’RE hard to debug

Page 34: Protecting privacy with fuzzy-feeling test data

hybrid properties•SQLAlchemy has a built-in way of creating dynamic fields on models, called hybrid attributes • these properties act just

like regular model attributes

• you can query on them at the class level

• you access them at the instance level

Page 35: Protecting privacy with fuzzy-feeling test data

safe property mixinJUST OVERRIDE __GETATTRIBUTE__, IT’LL BE FINE

Page 36: Protecting privacy with fuzzy-feeling test data

finally, the whole architectureLET’S PULL IT TOGETHER HERE, PEOPLE

•Create base-classes for every table that has PII in it• Write a hybrid property named safe_whatever for any fields

that need to be masked

•Loop through all the tables• looking up our custom base classes for the source tables• converting the data from the row to a dictionary, relying on the

safe properties to mask PII• and writing to the target database using SQLAlchemy’s auto-

generated classes for the targets

Page 37: Protecting privacy with fuzzy-feeling test data

one ironic hitch

this is the sadness town part

Page 38: Protecting privacy with fuzzy-feeling test data

was testing with test data

•Before I was reasonably sure the tool worked, I had to use test data to test it

• This is the test data I was talking about before…• small data• not really representative of our actual data…

Page 39: Protecting privacy with fuzzy-feeling test data

it crashed on production dataTHANK GOODNESS FOR SQLALCHEMY AND OURSQL

Page 40: Protecting privacy with fuzzy-feeling test data

it crashed on production dataTHANK GOODNESS FOR SQLALCHEMY AND OURSQL

•It turned out, I was holding too many rows in memory at a time and running out of RAM…

•There were two optimizations that really helped a lot• first, SQLAlchemy’s yield_per lets you grab a limited number

of rows from a query but not deal with the limit pagination yourself

• second, oursql lets you do streaming instead of buffering on the client, which greatly reduces how much memory you use

Page 41: Protecting privacy with fuzzy-feeling test data

let’s see it againLET’S LOOK AT YIELD-PER THIS TIME

Page 42: Protecting privacy with fuzzy-feeling test data

this made things very slowTHE CAT IS CRYING BECAUSE IT IS SAD THE LOADS TOOK SO LONG

Thankfully, I had done no optimization, so there were easy speedups available

•If you’re doing bulk data loading like this, there are a couple things you should do as a matter of course.

•Note that this is MySQL/InnoDB specific

Page 43: Protecting privacy with fuzzy-feeling test data

drop and re-add indexes at the end

Before you start…

After the data’s loaded

Page 44: Protecting privacy with fuzzy-feeling test data

destroy all constraints

Page 45: Protecting privacy with fuzzy-feeling test data

where we are today

spoiler: we have left sadness town

Page 46: Protecting privacy with fuzzy-feeling test data

it basically works!WE USE THIS TOOL

•I got the kinks out and was able to get correct test data for the upgrade.

•It’s still “slow” — it takes 8-10 hours to load a 2.5gb dump. But it’s fast enough.

•We extended the tool to moved data from LimeSurvey (which we didn’t write) and fuzz it automatically too

Page 47: Protecting privacy with fuzzy-feeling test data

how limesurvey worksTHIS IS A LITTLE HACKY

Page 48: Protecting privacy with fuzzy-feeling test data

there are some limitations

•We’re currently pretty deeply tied to the apps the tool was written for, and to MySQL

•Again, it’s not the kind of thing you can just kick off, grab coffee, and then have your data all loaded — this is an over-night process at best.

•Because I hard-coded the batch size of rows, sometimes it’ll still be killed due to resource starvation if someone kicks off another memory-intensive job on the same box.

Page 49: Protecting privacy with fuzzy-feeling test data

where I’d like to go

•Make into a framework for writing fuzzy migrations — there’s not that much fancy stuff in here, but making it configurable and writing a tutorial might make it generally useful.

•Dynamically tune batch size based on available RAM/swap, flushing when you get close to a limit.

•For the Study System in particular, there’s an obvious place to add a cache that I suspect would make it MUCH faster.

Page 50: Protecting privacy with fuzzy-feeling test data

questions?

http://bit.ly/fuzzy-data-blog

also we are hiring python programmers

Page 51: Protecting privacy with fuzzy-feeling test data

thankyou

[email protected]