andreas hellander, salman toor department of information ... · 4/23/2019 · promises to let...
TRANSCRIPT
Privacy-preserving federated machine learning
Andreas Hellander, Salman Toor
Department of Information Technology, Division of scientific computing, Uppsala University
Scaleout: www.scaleoutsystems.com
Federated Machine Learning
● Federated Conformal Prediction
● Algorithms for FedML
● FedML security, Blockchain
Cloud Computing
https://www.it.uu.se/research/group/dca
Data Engineering Sciences
● Hierarchical analysis of spatial
and temporal image data
(HASTE)
● Parallel, peer-to-peer streaming
● Intelligent storage backends
● Continuous analytics
Distributed Computing Applications @ IT/UU
Background
● Founded out of three research
teams at Uppsala University.
● Applied focus on large scale
production infrastructure in
computational science and
biotech research.
Expertise
● Cloud architecture
● Machine learning pipelines
● Continuous analytics
● Scientific data management
Cases
● SNIC cloud
● SciLifeLab
● IIS
● Rymdstyrelsen
● Safespring
Bridging the gap between research and production grade systems in machine learning
Scaleout
The centralized ML paradigm
Data Store 1
Machine learning model
Data Store 2
Data Store 3
Central Data Store
Queries
Predictions
1. Centralize data from different
sources (data lake, cloud).
2. Create ML model using centralised
data (cluster computing)
But in many cases we cannot move data
Private/Proprietary Data
Regulated Data
Big data
Central Data StoreMachine Learning
model
How can parties construct joint ML models without sharing/pooling data?
Federated machine learning
1. Train local machine
learning model on
local/private data.
2. Combine local model
updates into a global,
federated model.
Smart software on top of decentralized infrastructure/instruments
● Let’s a supplier of physical infrastructure/instruments build smart software to support all clients.
● Calibration, predictive maintenance etc.
● Customer A’s data is never shared with Customer B, or with the supplier.
● High-value, unique software offering for those using the FedML services.
Federated Model
Software services
Federated learning systemInfrastructure vendor
Integrity-preserving smart homes● Digital tools/video surveillance in
home care.
Train and deploy models based on homeowners’ private interactions without collecting central data.
Integrity preserving fleet management
● Model driver/staff behavior without compromising their integrity.
● Big data, poor connectivity
By Éric Chassaing - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8876959
Key benefit of federated learning
Promises to let parties collaborate to
build stronger models than what could
be attained the parties in isolation.
● This examples uses incremental learning
of linear models to do FedML.
● Stochastic Gradient Descent.
● One of many possible approaches to
decentralized model construction.
N. Gauraha, O. Spjuth, A. Hellander (2019) manuscript in preparation)
Standard ML on pooled data
Ok to share features?Ok to share
model/parameters?
Create one joint model
Combine predictions of separate model
Privacy-preserving/data protecting ML
No
Yes
No
Our focus area
Ok to share data?
No
Yes
Yes
Example, FedML on gboard
● Local model for search
suggestion, with context and
whether suggestion was clicked
● On device the history is
processed, and then only a
model update is suggested to
● Based on Federated Averaging
https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
Federated Averaging
From McMahan et al. https://arxiv.org/abs/1602.05629
1. Out of K alliance members/clients, pick a fraction C to do a global model update.
2. Perform E epochs of SGD on local minibatch of size B.
3. Average locally updated weights.
Different ways to do FedML:
● Federated averaging with stochastic gradient descent● Using incremental learners● Ensemble methods ● Hybrids between the above
FedML taxonomy: https://docs.google.com/spreadsheets/d/1SCwwkS_tUw-yAVMJZltJSt3NmhA_w6JG7xNJYkE8ORs/edit#gid=0
Privacy-preserving conformal prediction
Conformal prediction is a class of ML methods that give valid measures of model performance.
● Valid (based on a rigorous mathematical framework) prediction intervals/sets. ● Can be used with any standard machine learning method● No need for priors (unlike Bayesian learning)● Removes the need to talk about “domain of applicability”. ● Very interesting in the context of FedML since this class of methods gives a reliable way to measure
global model performance/improvement.
Ola SpjuthAssoc. Prof. at UU. Lead scientist AI at
Scaleout
Gauraha, N. and Spjuth, O. Synergy Conformal Prediction for Regression DiVA preprint. 1288708 (2019). URL: www.diva-portal.org/smash/get/diva2:1288708/FULLTEXT01.pdf
Gauraha, N. and Spjuth, O. Synergy Conformal Prediction DiVA preprint. 360504 (2018). URL: urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-360504
UN Handbook for Privacy-Preserving Techniques: https://docs.google.com/document/d/1GYu6UJI81jR8LgooXVDsYk1s6FlM-SbOvo3oLHglFhY/mobilebasic
Privacy-preservation properties of FedML?
● Input privacy simplified since data stays locally (handled according to local policies)
● Output privacy - depends on the algorithm, how easy it is to invert the model etc.
● What can be learned from the coordination of computation?
○ Different for federated averaging and ensemble methods.
● Differential privacy (add noise to data)
● Homomorphic encryption (compute directly on encrypted data)
● Secure multiparty computation (emulate a trusted third party)
● Secure enclaves (a hardware solution to private computations)
Privacy & securityApart from “standard security” (data at rest and in transit), a number of techniques can be used to enhance privacy:
Homomorphic encryption
● SEAL (Microsoft): https://github.com/Microsoft/SEAL
● HELib (IBM):https://github.com/shaih/HElib
● PALlSADE: https://git.njit.edu/palisade/PALISADE
Computations directly on encrypted data producing encrypted results.
● Outsourced secure computations. ● “Secure pooling of data”● Still not feasible for real world ML tasks.
● In FedML we do not need to outsource computations, except for parts such as secure aggregation of model weights / scores etc. For those parts of the algorithm, HE can be a viable option.
Secure multiparty computation(Secure computation, MPC, privacy preserving computation)
● No trust amongst parties P● Do not want to trust a third party to compute f ● MPC deals with protocols to emulate a trusted third party. ● Highly active area of research, hard problem for large N and large fraction
of dishonest members.
Parties P_1 .. P_N each with private data x_1,..x_N want to compute y = f(x_1, .., X_N)
In FedML, see e.g. the PySyft project, MPC in PyTorch: https://github.com/OpenMined/PySyft
Differential Privacy can protect against inference attacks
● Rigorous statistical technique to measuring and minimizing the privacy leakage from a statistical database.
● Add controlled noise to function we want to compute (e.g. Laplace mechanism).
● An interesting tradeoff between accuracy and the number of allowed queries to the model given epsilon.
● Related to the sensitivity of the function
Explored for FedML by e.g. Papernot et. al. https://arxiv.org/abs/1802.08908
Private aggregation of teacher ensemblesPapernot et al., https://arxiv.org/pdf/1802.08908.pdf
Differential privacy
● add noise to data (protects
against inference attacks)
Differential privacy & Homomorphic encryption in FedML
Homomorphic encryption
● Methods work on
encrypted data
Secure multiparty computation
● Aggregate/compute without a
third party trust provider/server.
Backdooring federated learning
Bagdasaryan et al. How to backdoor federated learning (2019) https://arxiv.org/pdf/1807.00459.pdf
● Big threat to a FedML comes from within the alliance / from compromized members.
● Large alliances can be expected to be relatively robust to data poisoning attacks.
● Bagdasaryan et al. shows how their proposed approach of model replacement can efficiently introduce backdoors in a global model.
● Secure aggregation/MPC makes it impossible to detect a malicious model update, and who submitted it!
What does it take to build a production federated learning system ?
● Decentralized computing / fog computing ● Information security/systems security expertise● Trust provider (third-party or decentralized protocol)● Machine learning algorithms adapted to the decentralized case● Protection against adversarial ML
○ Data poisoning○ Inference attacks○ …
A considerable increase in system and developer complexity compared to the standard paradigm!
Research challengesFedML is a research area that spans many differents areas of computer science and
mathematics.
Scalability and ML performance
How do we (re)design algorithms and
frameworks to scale out to the fog and edge?
Decentralized computation
How can we do FedML without a third-party
trust provider?
Adversarial ML
How can we make the system robust to
dishonest members and external threats?
Selection of ongoing research projects
● Privacy-preserving conformal prediction.● Federated online learning. ● Consensus protocols for decentralized model training. ● Performance and scalability of FedML using Blockchain.
Distributed Computing Applications (DCA) research arena at UU.
Security and Trust in FedML
Data privacy
● In federated machine learning environment, data never leaves the premises. Only the model parameters (or weights) are shared between federated members
● Data owners have complete control over the datasets
● The training of incoming models can be offline or online within the data owner’s secure environment
Security● Different levels of security
○ Communication level
○ Service level
○ Host level
Example of communication security
Host Identity Protocol (HIP) Architecture
Trust building mechanisms for FedML
● FedML is inherently a distributed system with full control over the local environment
● Less or zero control over distributed datasets
● Contributions from different federated members can make or break the global model
● A transparent and efficient FedML framework allows different parties to work together
Blockchain Technology
What is Blockchain?
● Blockchain technology enables distributed public ledgers that hold immutable data in a secure and encrypted way and ensure that records can never be altered
● The trust in the system does not arise from the relationship between parties or through an intermediary but from the technology and the process of comparison in the network
● Three major types: ○ Public chain (Decentralized)○ Private chain (Centralized)○ Consortium chain (Partially Decentralized)
How Blockchain works?
● Block ○ contains important information
● Chain○ ensures that the content of the block remains trustworthy at all times
● Consensus algorithms○ Proof of Work (PoW) -> Resource hungry ○ Proof of Stake (PoS) -> Energy efficient○ Ripple -> Energy efficient
Current implementations of Blockchain
● There are a number of software platforms based on blockchain technology that enable developers to build and deploy decentralized applications
○ Ethereum, popular because of its smart contract functionality○ EOS, combines security of Bitcon and smart contract functionality of
Ethereum○ LISK, a flexible implementation based on Javascript ○ ….
Challenges and opportunities● In general, Blockchain technology has limited scalability
○ Limited number of transactions○ With large data blocks, system slowly leads to centralization○ Current limit on the block size is 1MB
● Most of the limitations are related to the public chains
● In case of FedML ★ Each federation has a limited scope based on a specific model training★ Consortium chains is a realistic approach for the model training ★ Blocks only hold important changes which reduce the size of the
complete chain
Blockchain and FedML● We are working to design a new platform for FedML that will hold features
of the Blockchain technology
● The aim will be to provide security, auditability and checkpointing for global model training
● The platform will allow different stakeholders to jointly train models in a more transparent and secure manner
Future perspective● Blockchain technology will allow different organizations to work more
efficiently ○ Small organizations often lack valuable datasets but are early adopters
of the new technologies ○ Large organizations often have huge datasets but are slow adapter of
the new technologies
● The use of smart contracts will add more functionalities in the system that can be used to build incentive based model training
● In future, a marketplace can be created based on usefulness and transparency of machine learning models
Federated learning in production?
Secure model communication,
anomaly detection, etc.
API Federated components
Global model serving
ML pipeline
APIML pipeline
APIML pipeline
Scaleout Studio | Developing Scaleout Store | Package & Deploying Scaleout Serve | Serving
Scaleout Federated Platform
ML studio
- Ingestion- Prepare & Analyse Data- Modeling & Testing- Training
ML workflow automation
- Automated ML Studio Pipelines
API
APIModel management
- Versioning- Annotation- Storage- Distribution
APIModel
serving
- Scaling- LB- SLA/OLA
Monitoring & Visualizations
API
API
Endpoint registry
Scaleout Federated Platform
Graphical User Interface Incl Pipeline Visualization
Au
then
tica
tio
n a
nd
Au
tho
riza
tio
n
Model Sharing
Joint Training
Federation Orchestration
Federation Identity & Security
Federation Cross Validation & Holdout Set
To learn more about our Scaleout work on production FedML, see MVP
presented at TestaCenter:
https://www.youtube.com/watch?v=K-JUNkAYs-4
ContactAndreas Hellander, [email protected]
Salman Toor, [email protected]