andreas hellander, salman toor department of information ... · 4/23/2019 · promises to let...

Privacy-preserving federated machine learning

Andreas Hellander, Salman Toor

Department of Information Technology, Division of scientific computing, Uppsala University

Scaleout: www.scaleoutsystems.com

Federated Machine Learning

● Federated Conformal Prediction

● Algorithms for FedML

● FedML security, Blockchain

Cloud Computing

https://www.it.uu.se/research/group/dca

Data Engineering Sciences

● Hierarchical analysis of spatial

and temporal image data

(HASTE)

● Parallel, peer-to-peer streaming

● Intelligent storage backends

● Continuous analytics

Distributed Computing Applications @ IT/UU

https://www.it.uu.se/research/group/dca

Background

● Founded out of three research

teams at Uppsala University.

● Applied focus on large scale

production infrastructure in

computational science and

biotech research.

Expertise

● Cloud architecture

● Machine learning pipelines

● Continuous analytics

● Scientific data management

Cases

● SNIC cloud

● SciLifeLab

● IIS

● Rymdstyrelsen

● Safespring

Bridging the gap between research and production grade systems in machine learning

Scaleout

The centralized ML paradigm

Data Store 1

Machine learning model

Data Store 2

Data Store 3

Central Data Store

Queries

Predictions

1. Centralize data from different

sources (data lake, cloud).

2. Create ML model using centralised

data (cluster computing)

But in many cases we cannot move data

Private/Proprietary Data

Regulated Data

Big data

Central Data StoreMachine Learning

model

How can parties construct joint ML models without sharing/pooling data?

Federated machine learning

1. Train local machine

learning model on

local/private data.

2. Combine local model

updates into a global,

federated model.

Smart software on top of decentralized infrastructure/instruments

● Let’s a supplier of physical infrastructure/instruments build smart software to support all clients.

● Calibration, predictive maintenance etc.

● Customer A’s data is never shared with Customer B, or with the supplier.

● High-value, unique software offering for those using the FedML services.

Federated Model

Software services

Federated learning systemInfrastructure vendor

Integrity-preserving smart homes● Digital tools/video surveillance in

home care.

Train and deploy models based on homeowners’ private interactions without collecting central data.

Integrity preserving fleet management

● Model driver/staff behavior without compromising their integrity.

● Big data, poor connectivity

By Éric Chassaing - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8876959

Key benefit of federated learning

Promises to let parties collaborate to

build stronger models than what could

be attained the parties in isolation.

● This examples uses incremental learning

of linear models to do FedML.

● Stochastic Gradient Descent.

● One of many possible approaches to

decentralized model construction.

N. Gauraha, O. Spjuth, A. Hellander (2019) manuscript in preparation)

Standard ML on pooled data

Ok to share features?Ok to share

model/parameters?

Create one joint model

Combine predictions of separate model

Privacy-preserving/data protecting ML

No

Yes

No

Our focus area

Ok to share data?

No

Yes

Yes

Example, FedML on gboard

● Local model for search

suggestion, with context and

whether suggestion was clicked

● On device the history is

processed, and then only a

model update is suggested to

Google

● Based on Federated Averaging

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

Federated Averaging

From McMahan et al. https://arxiv.org/abs/1602.05629

1. Out of K alliance members/clients, pick a fraction C to do a global model update.

2. Perform E epochs of SGD on local minibatch of size B.

3. Average locally updated weights.

https://arxiv.org/abs/1602.05629

Different ways to do FedML:

● Federated averaging with stochastic gradient descent● Using incremental learners● Ensemble methods ● Hybrids between the above

FedML taxonomy: https://docs.google.com/spreadsheets/d/1SCwwkS_tUw-yAVMJZltJSt3NmhA_w6JG7xNJYkE8ORs/edit#gid=0

https://docs.google.com/spreadsheets/d/1SCwwkS_tUw-yAVMJZltJSt3NmhA_w6JG7xNJYkE8ORs/edit#gid=0

https://docs.google.com/spreadsheets/d/1SCwwkS_tUw-yAVMJZltJSt3NmhA_w6JG7xNJYkE8ORs/edit#gid=0

Privacy-preserving conformal prediction

Conformal prediction is a class of ML methods that give valid measures of model performance.

● Valid (based on a rigorous mathematical framework) prediction intervals/sets. ● Can be used with any standard machine learning method● No need for priors (unlike Bayesian learning)● Removes the need to talk about “domain of applicability”. ● Very interesting in the context of FedML since this class of methods gives a reliable way to measure

global model performance/improvement.

Ola SpjuthAssoc. Prof. at UU. Lead scientist AI at

Scaleout

Gauraha, N. and Spjuth, O. Synergy Conformal Prediction for Regression DiVA preprint. 1288708 (2019). URL: www.diva-portal.org/smash/get/diva2:1288708/FULLTEXT01.pdf

Gauraha, N. and Spjuth, O. Synergy Conformal Prediction DiVA preprint. 360504 (2018). URL: urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-360504

https://pharmb.io/publication/2019-synergy-conformal-prediction-regression/

https://www.diva-portal.org/smash/get/diva2:1288708/FULLTEXT01.pdf

https://www.diva-portal.org/smash/get/diva2:1288708/FULLTEXT01.pdf

https://pharmb.io/publication/2018-synergy-conformal-prediction/

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-360504

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-360504

UN Handbook for Privacy-Preserving Techniques: https://docs.google.com/document/d/1GYu6UJI81jR8LgooXVDsYk1s6FlM-SbOvo3oLHglFhY/mobilebasic

Privacy-preservation properties of FedML?

● Input privacy simplified since data stays locally (handled according to local policies)

● Output privacy - depends on the algorithm, how easy it is to invert the model etc.

● What can be learned from the coordination of computation?

○ Different for federated averaging and ensemble methods.

https://docs.google.com/document/d/1GYu6UJI81jR8LgooXVDsYk1s6FlM-SbOvo3oLHglFhY/mobilebasic

● Differential privacy (add noise to data)

● Homomorphic encryption (compute directly on encrypted data)

● Secure multiparty computation (emulate a trusted third party)

● Secure enclaves (a hardware solution to private computations)

Privacy & securityApart from “standard security” (data at rest and in transit), a number of techniques can be used to enhance privacy:

Homomorphic encryption

● SEAL (Microsoft): https://github.com/Microsoft/SEAL

● HELib (IBM):https://github.com/shaih/HElib

● PALlSADE: https://git.njit.edu/palisade/PALISADE

Computations directly on encrypted data producing encrypted results.

● Outsourced secure computations. ● “Secure pooling of data”● Still not feasible for real world ML tasks.

● In FedML we do not need to outsource computations, except for parts such as secure aggregation of model weights / scores etc. For those parts of the algorithm, HE can be a viable option.

https://github.com/Microsoft/SEAL

https://github.com/shaih/HElib

https://git.njit.edu/palisade/PALISADE

Secure multiparty computation(Secure computation, MPC, privacy preserving computation)

● No trust amongst parties P● Do not want to trust a third party to compute f ● MPC deals with protocols to emulate a trusted third party. ● Highly active area of research, hard problem for large N and large fraction

of dishonest members.

Parties P_1 .. P_N each with private data x_1,..x_N want to compute y = f(x_1, .., X_N)

In FedML, see e.g. the PySyft project, MPC in PyTorch: https://github.com/OpenMined/PySyft

https://github.com/OpenMined/PySyft

Differential Privacy can protect against inference attacks

● Rigorous statistical technique to measuring and minimizing the privacy leakage from a statistical database.

● Add controlled noise to function we want to compute (e.g. Laplace mechanism).

● An interesting tradeoff between accuracy and the number of allowed queries to the model given epsilon.

● Related to the sensitivity of the function

Explored for FedML by e.g. Papernot et. al. https://arxiv.org/abs/1802.08908

https://arxiv.org/abs/1802.08908

Private aggregation of teacher ensemblesPapernot et al., https://arxiv.org/pdf/1802.08908.pdf

https://arxiv.org/pdf/1802.08908.pdf

Differential privacy

● add noise to data (protects

against inference attacks)

Differential privacy & Homomorphic encryption in FedML

Homomorphic encryption

● Methods work on

encrypted data

Secure multiparty computation

● Aggregate/compute without a

third party trust provider/server.

Backdooring federated learning

Bagdasaryan et al. How to backdoor federated learning (2019) https://arxiv.org/pdf/1807.00459.pdf

● Big threat to a FedML comes from within the alliance / from compromized members.

● Large alliances can be expected to be relatively robust to data poisoning attacks.

● Bagdasaryan et al. shows how their proposed approach of model replacement can efficiently introduce backdoors in a global model.

● Secure aggregation/MPC makes it impossible to detect a malicious model update, and who submitted it!

https://arxiv.org/pdf/1807.00459.pdf

What does it take to build a production federated learning system ?

● Decentralized computing / fog computing ● Information security/systems security expertise● Trust provider (third-party or decentralized protocol)● Machine learning algorithms adapted to the decentralized case● Protection against adversarial ML

○ Data poisoning○ Inference attacks○ …

A considerable increase in system and developer complexity compared to the standard paradigm!

Research challengesFedML is a research area that spans many differents areas of computer science and

mathematics.

Scalability and ML performance

How do we (re)design algorithms and

frameworks to scale out to the fog and edge?

Decentralized computation

How can we do FedML without a third-party

trust provider?

Adversarial ML

How can we make the system robust to

dishonest members and external threats?

Selection of ongoing research projects

● Privacy-preserving conformal prediction.● Federated online learning. ● Consensus protocols for decentralized model training. ● Performance and scalability of FedML using Blockchain.

Distributed Computing Applications (DCA) research arena at UU.

Security and Trust in FedML

Data privacy

● In federated machine learning environment, data never leaves the premises. Only the model parameters (or weights) are shared between federated members

● Data owners have complete control over the datasets

● The training of incoming models can be offline or online within the data owner’s secure environment

Security● Different levels of security

○ Communication level

○ Service level

○ Host level

Example of communication security

Host Identity Protocol (HIP) Architecture

Trust building mechanisms for FedML

● FedML is inherently a distributed system with full control over the local environment

● Less or zero control over distributed datasets

● Contributions from different federated members can make or break the global model

● A transparent and efficient FedML framework allows different parties to work together

Blockchain Technology

What is Blockchain?

● Blockchain technology enables distributed public ledgers that hold immutable data in a secure and encrypted way and ensure that records can never be altered

● The trust in the system does not arise from the relationship between parties or through an intermediary but from the technology and the process of comparison in the network

● Three major types: ○ Public chain (Decentralized)○ Private chain (Centralized)○ Consortium chain (Partially Decentralized)

How Blockchain works?

● Block ○ contains important information

● Chain○ ensures that the content of the block remains trustworthy at all times

● Consensus algorithms○ Proof of Work (PoW) -> Resource hungry ○ Proof of Stake (PoS) -> Energy efficient○ Ripple -> Energy efficient

Current implementations of Blockchain

● There are a number of software platforms based on blockchain technology that enable developers to build and deploy decentralized applications

○ Ethereum, popular because of its smart contract functionality○ EOS, combines security of Bitcon and smart contract functionality of

Ethereum○ LISK, a flexible implementation based on Javascript ○ ….

Challenges and opportunities● In general, Blockchain technology has limited scalability

○ Limited number of transactions○ With large data blocks, system slowly leads to centralization○ Current limit on the block size is 1MB

● Most of the limitations are related to the public chains

● In case of FedML ★ Each federation has a limited scope based on a specific model training★ Consortium chains is a realistic approach for the model training ★ Blocks only hold important changes which reduce the size of the

complete chain

Blockchain and FedML● We are working to design a new platform for FedML that will hold features

of the Blockchain technology

● The aim will be to provide security, auditability and checkpointing for global model training

● The platform will allow different stakeholders to jointly train models in a more transparent and secure manner

Future perspective● Blockchain technology will allow different organizations to work more

efficiently ○ Small organizations often lack valuable datasets but are early adopters

of the new technologies ○ Large organizations often have huge datasets but are slow adapter of

the new technologies

● The use of smart contracts will add more functionalities in the system that can be used to build incentive based model training

● In future, a marketplace can be created based on usefulness and transparency of machine learning models

Federated learning in production?

Secure model communication,

anomaly detection, etc.

API Federated components

Global model serving

ML pipeline

APIML pipeline

APIML pipeline

Scaleout Studio | Developing Scaleout Store | Package & Deploying Scaleout Serve | Serving

Scaleout Federated Platform

ML studio

- Ingestion- Prepare & Analyse Data- Modeling & Testing- Training

ML workflow automation

- Automated ML Studio Pipelines

API

APIModel management

- Versioning- Annotation- Storage- Distribution

APIModel

serving

- Scaling- LB- SLA/OLA

Monitoring & Visualizations

API

API

Endpoint registry

Scaleout Federated Platform

Graphical User Interface Incl Pipeline Visualization

Au

then

tica

tio

n a

nd

Au

tho

riza

tio

n

Model Sharing

Joint Training

Federation Orchestration

Federation Identity & Security

Federation Cross Validation & Holdout Set

To learn more about our Scaleout work on production FedML, see MVP

presented at TestaCenter:

https://www.youtube.com/watch?v=K-JUNkAYs-4

https://www.youtube.com/watch?v=K-JUNkAYs-4

ContactAndreas Hellander, [email protected]

Salman Toor, [email protected]

andreas hellander, salman toor department of information ... · 4/23/2019 · promises to let...

Documents