how private is synthetic data - replica analytics

45
How Private is Synthetic Data ? December 4 th 2019 1

Upload: others

Post on 18-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Private is Synthetic Data - Replica Analytics

How Private is Synthetic Data ?December 4th 2019

1

Page 2: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Agenda

2

Page 3: How Private is Synthetic Data - Replica Analytics

[email protected]

For more information:

Page 4: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Appendices• Presenter biographies

4

Page 5: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Richard Hoptroff

5

Richard Hoptroff is a long term technology inventor, investor and entrepreneur. Awarded a Ph.D. in Physics by King’s College London for his work in optical computing and artificial intelligence. In 1992, he co-founded Right Information Systems, a neural net forecasting software company that was later sold to Cognos Inc (part of IBM).

He then worked as a postdoc at the Research Laboratory for Archaeology and the History of Art at Oxford University and in 2001, created Flexipanel Ltd, a company supplying Bluetooth modules to the electronics industry.

In 2010, he founded Hoptroff London, with the aim to develop smart, hyper-accurate watch movements and create a new watch brand. In 2013 Richard Hoptroff established a new commercial category when he brought to market the first commercial atomic timepiece and an atomic wristwatch.

Richard Hoptroff then leveraged his expertise in timing technology and software to develop a hyper-accurate synchronized timestamping solution for the financial services sector, based on a unique combination of grandmaster atomic clock engineering and proprietary software.

Page 6: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Lucy Mosquera

6

Lucy Mosquera has a background in biology and mathematics, having done her studies at Queen's University in Kingston and the University of British Columbia. In the past she has provided data management support to clinical trials and observational studies at Kingston General Hospital. She also worked on clinical trial data sharing methods based on homomorphic encryption and secret sharing protocols with various companies.

At Replica Analytics, Lucy is responsible for integrating her subject area expertise in health data into innovative methods for synthetic data generation and the assessment of that data, as well as overseeing our analytics program.

Page 7: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Mike Hintze

7

Mike Hintze is a partner at Hintze Law PLLC. As a recognised leader in the field, he advises companies, industry associations and other organisations on global privacy and data protection law, policy and strategy. He was previously Chief Privacy Counsel at Microsoft, where for over 18 years he counselled on data protection compliance globally and helped lead the company’s strategic initiatives on privacy differentiation and public policy. Mike also teaches privacy law at the University of Washington School of Law, serves as an adviser to the American Law Institute’s project on Information Privacy Principles and has served on multiple advisory boards for the International Association of Privacy Professionals and other organisations. Mike has testified before Congress, state legislatures and European regulators; and he is a sought-after speaker and regular writer on data protection issues. Prior to joining Microsoft, Mike was an associate with Steptoe & Johnson LLP, which he joined following a judicial clerkship with the Washington State Supreme Court. Mike is a graduate of the University of Washington and the Columbia University School of Law.

Page 8: How Private is Synthetic Data - Replica Analytics

Privacy Risks in Synthetic DataLucy Mosquera

December 4, 2019

1

Page 9: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Generation Methods - A

2

Page 10: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Privacy Risks• It is important to be precise about the types of privacy

risks that are being managed by data synthesis

• This also allows us to be precise regarding the legal

standing of data synthesis and the synthesized data

itself

• We will start off with some definitions

3

Page 11: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Identity DisclosureWhat the adversary knows

• Hiroshi is in the data

• Hiroshi is Japanese

What the adversary learns

• Hiroshi is the first record

• Hiroshi’s income is $120k

4

Origin IncomeJapanese $120k

North African $100k

European $110k

Page 12: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Attribute Disclosure

What the adversary knows

• Hiroshi is in the data

• Hiroshi was born in the

decade 1950-595

Decade of Birth Gender Diagnosis1950-1959 Male Prostate Cancer1950-1959 Male Prostate Cancer1950-1959 Male Prostate Cancer1980-1989 Male Lung Cancer1980-1989 Female Breast Cancer

What the adversary learns

• Hiroshi has prostate

cancer with 100%

certainty

Page 13: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Inferential Disclosure

What the adversary knows

• Hiroshi is in the data

• Hiroshi was born in the

decade 1950-596

Decade of Birth Gender Diagnosis1950-1959 Male Prostate Cancer1950-1959 Male Prostate Cancer1950-1959 Male Lung Cancer1950-1959 Male Lung Cancer1980-1989 Female Breast Cancer

What the adversary learns

• Hiroshi has prostate

cancer with 50%

certainty

Page 14: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Learning Something New

What the adversary knows

• Hiroshi is in the data

• Hiroshi was born in the year

1959

• Hiroshi is male

• Hiroshi has an income of $120k7

Year of Birth Gender Income1959 Male $120k1959 Male $100k1959 Female $120k1959 Male $110k

What the adversary learns

• Hiroshi is the first record

Page 15: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

What are the Risks ?• The focus should be on learning something new

about an individual, not about learning something

new about a group, even if we know that a

particular person is in the group

• Learning something new about a group is the

essence of data analysis / statistics – applying

data protection methods to eliminate that would

be very detrimental to data utility

8

Page 16: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Group Learning• Let’s say we build a model that predicts the likelihood

of prostate cancer based on age and gender• We can learn that Hiroshi has a high likelihood of being diagnosed

with prostate cancer because he is in the data and he is in the group

• We can learn that Satoshi has a high likelihood of being diagnosed

with prostate cancer even if Satoshi is not in the data (but he is in the

group – same age and gender)

• It becomes a real problem if we restrict learning even

if someone is not in the data – but that is the essence

of learning something new about a group

9

Page 17: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Ethics Risks• We should worry whether what we learn can cause

harm – that is a subjective decision and can be

addressed through an ethics review process

• Harm will be a function of the certainty by which we

can draw the conclusion / inference

• Decisions made from what we learn can be

discriminatory and can also cause harm – that is also

subjective and can be addressed through an ethics

review10

Page 18: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Privacy Risks• Therefore, it is necessary to learn something new

about a specific individual – assigning a record to a

real person is not very meaningful if nothing new is

learned (this may be theoretically a disclosure but has

no impact)

• We also consider learning something new that is

unusual (i.e., part of a large group then the

information gain is reduced)

11

Page 19: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Something Common

12

X

Page 20: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Something Unusual

13

X

Page 21: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Meaningful Identity Disclosure• We are able to assign an identity to the specific

record

• We learn something new and unusual about that

specific record (information gain is high)

14

Decade of Birth Gender Income1950-1959 Male $250k1960-1969 Male $30k1950-1959 Female $20k1940-1949 Male $45k1980-1989 Female $40k

Page 22: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Not Meaningful Identity Disclosure

• Not being able to assign an identity to a record

• Not learning something new

• Learning something new about a group of individuals

without being able to know which record belongs to

my target (i.e., drawing an inference, irrespective of

accuracy, about a group)

15

Page 23: How Private is Synthetic Data - Replica Analytics

Legal AnalysisMike Hintze

16

Page 24: How Private is Synthetic Data - Replica Analytics

Privacy Law and Synthetic DataMIKE HINTZE

PARTNER, HINTZE LAW PLLC

AFFILIATE INSTRUCTOR, UNIV. OF WASHINGTON SCHOOL OF LAW

Page 25: How Private is Synthetic Data - Replica Analytics

Global Trends in Privacy LawNumber of privacy laws around the world has grown dramatically in recent years

Increasingly onerous requirements that make using and sharing data more difficult, costly, and risky

Many privacy laws provide some flexibility for research uses of data, but provisions are often too narrow, vague, or restrictive to be useful

Page 26: How Private is Synthetic Data - Replica Analytics

3 Key Questions for Synthetic Data1. Is the use of the original (real) data set to generate and/or evaluate a

synthetic data set restricted or regulated under the law?

2. Is sharing the original data set with a third-party service provider to generate the synthetic data set restricted or regulated under the law?

3. Does the law regulate or otherwise affect (if at all) the resulting synthetic data set?

Page 27: How Private is Synthetic Data - Replica Analytics

Synthetic Data and Europe’s GDPR (1)Is the use of the original (real) data set to generate and/or evaluate a synthetic data set restricted or regulated under the law?◦ Any processing of personal data (including the processing to

create or test a synthetic data set) requires a “legal basis”◦ Where obtaining consent is impracticable, the most suitable

legal basis is often “legitimate interests”◦ Legitimate interests requires a balancing test but given the

benefits of synthetic data to both the data controller and the data subject, processing personal data for this purpose will almost always be justified

◦ Other GDPR obligations on the processing of personal data must also be observed (security, transparency, etc.)

Page 28: How Private is Synthetic Data - Replica Analytics

Synthetic Data and Europe’s GDPR (2)Is sharing the original data set with a third-party service provider to generate the synthetic data set restricted or regulated under the law?◦ Under the GDPR, a “data controller” can share personal data with a “data processor” as necessary to

enable the processor to perform a service at the direction of and on behalf of the controller◦ The controller has a duty of care to select a processor that can provide sufficient guarantees regarding

the protection of the data◦ There must be a contract in place between the controller and the processor that obligates the processor

to:◦ process the personal data “only on documented instructions from the controller” ◦ have reasonable security procedures and practices to protect the personal data; and ensure that each person processing

the personal data is subject to a duty of confidentiality with respect to the data ◦ engage a subcontractor only pursuant to a written contract that passes through the data protection requirements and only

with the general or specific prior consent of the controller ◦ assist the controller as needed to meet the controller’s obligations with respect to data subjects’ rights, data security,

notification of data breaches, risk assessments, and consultation with regulators ◦ delete or return all personal data to the controller at the completion of the service(s) – unless retention is required by law ◦ provide to the controller, upon request, information necessary to demonstrate the processor’s compliance with its

obligations; and permit audits by the controller (or an auditor selected by the controller)

Page 29: How Private is Synthetic Data - Replica Analytics

Synthetic Data and Europe’s GDPR (3)Does the law regulate or otherwise affect (if at all) the resulting synthetic data set?◦ GDPR defines personal data broadly as:

any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

◦ Sets a high bar for de-identification. De-identified data is still real data related to real individuals. Pseudonymized data is still personal data. Very challenging to achieve fully anonymous data through traditional de-identification.

◦ By contrast, synthetic data is not real data related to real individuals. There is no link between records in a synthetic data set and records in the original (real) data set.◦ The records do not “relat[e] to an identified or identifiable natural person”◦ It does not “reference the physical, physiological, genetic, mental, economic, cultural, or social identity” of an

actual natural person◦ Thus, synthetic data should not be considered “personal data” subject to the GDPR

Page 30: How Private is Synthetic Data - Replica Analytics

Synthetic Data and California’s CCPA (1)Is the use of the original (real) data set to generate and/or evaluate a synthetic data set restricted or regulated under the law?◦ Unlike the GDPR, the CCPA does not require a lawful basis to process

personal information◦ Does not significantly regulate the collection and internal use of data◦ Thus, using a real data set to generate a synthetic data set is not

significantly regulated by the CCPA◦ Companies still must meet certain obligations with respect to any

processing of personal data, such as providing sufficient notice.

Page 31: How Private is Synthetic Data - Replica Analytics

Synthetic Data and California’s CCPA (2)Is sharing the original data set with a third-party service provider to generate the synthetic data set restricted or regulated under the law?◦ A significant focus of the CCPA is on regulating the “sale” of personal information◦ “Sale” is defined very broadly, and in a way that could cover almost any transfer of personal information

for a commercial purpose. ◦ However, transfers of data to service providers are exempt from the restrictions on “sales” if:◦ The business has provided notice that personal information will be shared with service providers. ◦ The service provider does not collect, use, sell, or disclose the personal data for any purpose other than as

necessary to provide the service(s) on behalf of the business. ◦ There is a written contract between the business and the service provider that specifies the service provider is

prohibited from retaining, using, selling, or disclosing the personal information for any purpose other than for the specific purpose of performing the services specified in the contract on behalf of business.

Page 32: How Private is Synthetic Data - Replica Analytics

Synthetic Data and California’s CCPA (3)Does the law regulate or otherwise affect (if at all) the resulting synthetic data set?◦ CCPA defines “personal information” broadly as any information that “that identifies, relates to,

describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.”

◦ This definition should not capture synthetic data. Records in a synthetic set should not be seen as being associated, linked, or related to a particular real consumer or household.

◦ Further, the CCPA definition or “personal information” specifies that it does not include “aggregate consumer information” ◦ “Aggregate consumer information” is defined as “information that relates to a group or category of consumers,

from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device.”

◦ Although a synthetic data set could be seen as applying to a group or category of consumers, the exclusion for aggregate data gives additional weight to the conclusion that a synthetic data set is not “personal information.”

◦ Thus, because synthetic data is not “personal information” under the CCPA, it is not subject to the requirements of CCPA. It can therefore be freely used and distributed – even “sold” – without restriction.

Page 33: How Private is Synthetic Data - Replica Analytics

Synthetic Data and U.S. Federal HIPAA (1)Is the use of the original (real) data set to generate and/or evaluate a synthetic data set restricted or regulated under the law?◦ Under HIPAA, many uses of “protected health information” requires

individual authorization, but there are some exemptions◦ One exempted use is for “uses and disclosures to create de-identified

information” which means “to create information that is not individually identifiable health information.”

◦ Another is for “health care operations” which includes “general administrative activities of the entity, including, but not limited to ... creating de-identified health information or a limited data set.”

◦ Although synthetic data is distinct from de-identified data, the use of PHI for creating either should be thought of similarly under HIPPA, and the above-quoted exemptions are broad enough to include the use of PHI to create synthetic data sets

Page 34: How Private is Synthetic Data - Replica Analytics

Synthetic Data and U.S. Federal HIPAA (2)Is sharing the original data set with a third-party service provider to generate the synthetic data set restricted or regulated under the law?◦ Under HIPAA, a covered entity is permitted to share PHI with a “business associate” providing a service

on behalf of that covered entity ◦ A contract or similar arrangement must be in place between a covered entity and the business

associate. The contract must:◦ specify the nature of the service for which the PHI is shared◦ describe the permitted and required uses of protected health information by the business associate ◦ provide assurances that the business associate will appropriately protect the privacy and security of the PHI

◦ In addition to the contractual terms, business associates are directly subject to the HIPAA Security Rule and certain aspects of the HIPAA Privacy Rule

◦ Thus, as long as there is an appropriate contract in place that meets the requirements of a business associate agreement under the HIPAA Privacy Rule, and the service provider meets its other obligations under the Rule, the sharing of PHI with a service provider to create synthetic data is allowed

Page 35: How Private is Synthetic Data - Replica Analytics

Synthetic Data and U.S. Federal HIPAA (3)Does the law regulate or otherwise affect (if at all) the resulting synthetic data set?◦ HIPAA regulates “individually identifiable health information” and “protected health information.”◦ “Individually identifiable health information” is information created or received by a covered entity, relating to the

physical or mental health or condition of an individual, the provision of health care to an individual, or the payment for the provision of health care to an individual, where that information either identifies the individual or for which there is a reasonable basis to believe the information can be used to identify the individual.

◦ “Protected health information” is roughly the same; it is defined as “individually identifiable health information,” subject to a few minor exclusions for certain educational records and employment records.

◦ Because synthetic data is not “real” data related to actual individuals, synthetic data does not identify any individual, nor can it reasonably be used to identify an individual.

◦ Synthetic data is therefore outside the scope of HIPAA and not subject to the requirements of the HIPAA rules. It can therefore be freely used in clinical trials, shared for other research purposes, or even made publicly available without restriction.

Page 36: How Private is Synthetic Data - Replica Analytics

Key TakeawaysBecause the creation of synthetic data starts with processing a set of personal data (i.e. data relating to real people), laws that affect the processing of personal data will impact the initial creation and testing of synthetic data

Privacy laws generally permit such processing

The processing must adhere to other requirements of the law that apply in any case – such as transparency and data security

Sharing the original (real) data set with service providers that create synthetic data is also permitted, subject to certain obligations such as no secondary uses by the service providers, ensuring adequate protection of the personal data, and other contractual assurances

The resulting synthetic data set(s) should not be considered personal data, even under privacy laws with very broad definitions

Thus, synthetic data can be freely used and shared – even publicly released – for any purposes

Page 37: How Private is Synthetic Data - Replica Analytics

Measuring Privacy Risks inSynthetic Data

Lucy MosqueraDecember 4, 2019

1

Page 38: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Privacy Assurance• We have developed a privacy assurance framework to

quantitatively assess the risk of meaningful identity disclosure:1. The likelihood that an individual in the synthetic data can be matched with a

real person2. If such a match is possible, will the adversary learn something new from such

a match (because the data is fully synthetic, even if there is a match the information may be sufficiently different that nothing meaningful would be learned)

• Both tests must pass for a meaningful identity disclosure to have occurred

2

Page 39: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Matching Experiments

3

Page 40: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Similar and Unusual

4

Page 41: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Meaningful Identity Disclosure• We can compute the likelihood that one of these two

directions of attack is successful:• Match a synthetic record with a real person

• Learn something new about the real person from the synthetic record

• Various experiments on clinical trial and hospital

discharge data show that this probability is very small

and lower than generally accepted thresholds for

identity disclosure

5

Page 42: How Private is Synthetic Data - Replica Analytics

QUESTIONScc: an untrained eye - https://www.flickr.com/photos/26312642@N00

Page 43: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

You will receive• The materials from this webinar

• Over the next few of weeks we will send you

additional reports on this topic:• Accelerating AI Using Synthetic Data report

• Technical report / paper on privacy assurance

• Notice about the book on synthetic data generation when it comes out in

2020

• An invitation to our next webinar in 2020

2

Page 44: How Private is Synthetic Data - Replica Analytics

(c) Copyright 2019 Replica Analytics Ltd. All Rights Reserved.

Next Steps• If you want to learn more about synthetic data please

contact us: [email protected]

• You will be asked to opt-in to receive more information from us (technical reports and whitepapers, future webinars on AI and synthetic data, live events that we organize, and newsletters) that may be of interest to you. It is important that you opt-in to continue to receive these communications.

3

Page 45: How Private is Synthetic Data - Replica Analytics

Lucy Mosquera: [email protected]

Mike Hintze: [email protected]

4

[email protected]