blockchain and federated learning for privacy-preserved...

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2942190, IEEETransactions on Industrial Informatics

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Blockchain and Federated Learning forPrivacy-preserved Data Sharing in Industrial IoT

Yunlong Lu, Student Member, IEEE, Xiaohong Huang, Member, IEEE, Yueyue Dai, Student Member, IEEE,Sabita Maharjan, Member, IEEE, and Yan Zhang, Senior Member, IEEE

Abstract—The rapid increase in the volume of data generatedfrom connected devices in Industrial Internet of Things (IIoT)paradigm, opens up new possibilities for enhancing the qualityof service for the emerging applications through data sharing.However, security and privacy concerns (e.g, data leakage) aremajor obstacles for data providers to share their data in wirelessnetworks. The leakage of private data can lead to seriousissues beyond financial loss for the providers. In this paper,we first design a blockchain empowered secure data sharingarchitecture for distributed multiple parties. Then, we formulatethe data sharing problem into a machine learning problem byincorporating privacy-preserved federated learning. The privacyof data is well maintained by sharing the data model instead ofrevealing the actual data. Finally, we integrate federated learningin the consensus process of permissioned blockchain, so that thecomputing work for consensus can also be used for federatedtraining. Numerical results derived from real-world datasets showthat the proposed data sharing scheme achieves good accuracy,high efficiency, and enhanced security.

Index Terms—Data Sharing, Permissioned Blockchain, Feder-ated Learning, Privacy-preserved, Industrial IoT

I. INTRODUCTION

The amount of data generated by the connected devices inthe IIoT paradigm has witnessed a massive growth in Industry4.0. Along with the value the data brings, comes seriousconcerns about data privacy [1]. Data leakage may take placeduring data storage, data transmission and data sharing, whichmay lead to serious issues for both owners and providers. Inthis regard, existing work mainly focuses on utilizing aggre-gate information about the data, without breaking the privacyof the participants. They address the problem by making somemodifications to the key contributions of original data, suchas k-anonymity [2], l-diversity [3]. But most of the meth-ods assume that the attackers only have limited backgroundknowledge, where the data is still vulnerable to algorithm-based attacks or background knowledge attack. Differentialprivacy [4] provides the most reliable privacy guarantee, whichis generally considered strong enough to protect data from

This work was supported by Joint Funds of National Natural ScienceFoundation of China and Xinjiang under Project U1603261, and NationalNatural Science Foundation of China under Project 61602055.

Y. Lu, X. Huang are with the Institute of Network Technology, BeijingUniversity of Posts and Telecommunications, Beijing, China e-mail: ([email protected]; [email protected];).

Y. Dai is with the University of Electronic Science and Technology ofChina, Chengdu, China (email:[email protected]).

S. Maharjan is with Simula Metropolitan Center for Digital Engineeringand the University of Oslo, Norway, (e-mail: [email protected]).

Y. Zhang is with the University of Oslo, Norway, (e-mail:[email protected]).

privacy attacks. A Machine Learning Differentially Private(MLDP) [5] was proposed to publish data structures insteadof publishing queries and answers directly, in the constraintof differential privacy.

Data from IIoT applications may include sensitive informa-tion. In this regard, protecting data privacy is a key issue. In[6], the authors proposed a protection method that satisfiesdifferential privacy to protect location data privacy, withoutreducing much utility of data in Industrial IoT. There are alsosome works exploring the use of blockchain to enhance datasecurity in IIoT. In [7], the authors also integrated blockchaininto edge intelligence for resource allocation in IIoT. Thoughthe combination is promising, the machine learning methodscan be further improved. Therefore, some works exploitedMarkov models, which can illustrate activity transactionswithout having knowledge of the problem in hand [8], forresource allocation. For example, in [9], the authors leverageddeep reinforcement learning (DRL) for task offloading andtransmission scheduling. The using of new machine learningtechnology also brings new security threats such as cyberstealth attacks [10], which imposes new security requirements[11] for protecting data privacy in sharing process. In [12], theauthors implemented a protocol that turns a blockchain intoan automated access-control manager, to ensure that users canown and control their data. In [13], the authors proposed ablockchain-enabled efficient data collection and secure sharingscheme combining Ethereum blockchain and DRL to create areliable ad safe environment. Among these works, consensusprotocols are a core technical component to achieve consensusamong all participating nodes. In Proof-of-Work (PoW) [14],the miner that solves a mathematical puzzle first wins the rightto produce a block. However, heavy resource requirement forsolving those puzzles, limit the applicability of PoW basedconsensus mechanisms.

The concept of private multi-party data sharing has drawnmuch attention recently, as a promising approach to address theissue of computing and storage resource constraints. Severalworks on data sharing applications with multiple distributeddata owners, have been published recently, including thesharing of horizontally partitioned data [15] and monitoringover distributed data streams [16]. There is a wide range ofapplications on the collaborative use of data in distributedscenarios. For instance, the authors in [17] proposed LDPGen,a multi-phase technique, to ensure local differential privacy indecentralized social graphs. Mobile Edge Computing (MEC)is a powerful technique for such resource constrained appli-cations on distributed data. In [18], to efficiently and fairly




allocating the resources in industrial IoT-based applications,the authors proposed a forward central dynamic and availableapproach (FCDAA) by adapting the running time of sensingand transmission processes in IoT devices. In [19], a methodfor conserving position confidentiality of roaming position-based services (PBSs) users was proposed based on MECtechniques in real-time industrial informatics. Recently, fed-erated learning [20] has emerged for multiple data ownersto train a global model collaboratively without sharing theirraw data, respecting the privacy concerns of sharing data.In [21], the authors proposed an algorithm for client sideddifferential privacy-preserving federated optimization, to hideclients’ contributions during the training process. Based ona hierarchical architecture in which the server aggregatesthe users’ training updates, the authors in [22] proposed afederated learning based proactive content caching scheme.

Yet, the presence of a centralized curator in most of theexisting data sharing schemes increases the risk of dataleakage, especially in the application of distributed multipleparties. There are mainly two obstacles: one is, there can bea high volume of aggregated data from different parties to beprocessed by the curator, including some unknown fresh data;The other is, none of these parties fully trust others (includingthe curator), thus fearing data leakage.

To this end, the application of collaborative data sharingover distributed multiple parties in IIoT faces several chal-lenges. New collaborative mechanisms for distributed datasharing among multiple untrusted parties, are therefore forIIoT applications. In this paper, we propose a differentiallyprivate multi-party data model sharing method based on per-missioned blockchain. Instead of sharing raw data directly,we incorporate federated learning algorithms to map rawdata into corresponding data models, which addresses privacyconcerns in the learning phase through distributed trainingby local users. We also design a distributed architecture fordata sharing between multiple parties based on blockchain,in which blockchain enables secure data retrieval and ensuresaccurate model training. Our main contributions in this paperare as follows.

• We transform the data sharing problem into a machinelearning problem by leveraging federated learning tobuild data models and sharing the data models insteadof raw data.

• We propose a new blockchain empowered collaborativearchitecture to share data over distributed multiple partiesto reduce the risk of data leakage, through which dataowners can further control the access to shared data.

• We integrate differential privacy into federated learningto further protect data privacy. We also evaluate theeffectiveness of our proposed model with benchmark,open real-world datasets for data categorization.

The remainder of this paper is organized as follows. InSection II, we present our system model. In Section III, wedescribe permissioned blockchain and federated learning fordata sharing in detail. In Section IV, we present securityanalysis for our proposed scheme, and provide illustrativenumerical results. Finally, Section V concludes the paper and

discusses the future work.

II. SYSTEM MODEL

In this paper, we consider a common distributed data sharingscenario with multiple parties involved. Each participant ownshis own data and is willing to share it. Contributors canmake better use of their data by combining them together toimplement a collaborative task. For example, traffic predictioncan make greater progress by utilizing data from multiplesensors. An illustration of data sharing among various devicesis shown in Fig. 1. Our goal is to design a secure data

Data sharing

Data sharing

Fig. 1: An example of distributed data sharing

sharing mechanism which can share data among distributedmultiple users intelligently while also maintaining data privacyeffectively. We consider N parties (data holders) and a uniondataset D. For any party Pi, it holds a local dataset Di ∈ D.Each of the N parties agrees on sharing its data withoutrevealing any private information. Let R = r1, r2, ..., rm be thedata sharing requests with queries ri submitted by a requester,instead of return the raw data, we provide the computed resultstowards these queries for sharing. Then all the participantsrelated to the request work together with the correspondinglearning algorithm to train a global modelM, without leakingany private data. Finally, the trained global data model Mwill be returned to the data recipient. Leveraging the receivedmodel, data recipients can get answers R(M) towards theirdata sharing requests locally.

A. Treat ModelWe focus on collaborative data sharing, where K data

providers (owners) and one data requester work together toaccomplish a data sharing task. The data providers and datarequester are considered as dishonest. The proposed mecha-nism is vulnerable to three types of threats. The first is thequality of the provided data. Dishonest providers may providebiased and inaccurate results to the requester, reducing theusability of the entire shared data. The second is data privacy.Providers and receivers may try to infer the private data ofothers from shared data, which may lead to unwanted sensitivedata leakage from data providers. The threat of collusion alsoexists if a group of participants try to infer the data of otherparticipants. The third is data authority management. Once theraw data is shared, the data owner will lose control over thesedata and the data may be shared to other unauthorized entitiesby a dishonest participant.


B. Our Proposed Architecture

Our proposed data sharing architecture is shown in Fig. 2.The proposed system consists of two modules: permissionedblockchain module and federated learning module. Permis-sioned blockchain establishes secure connections among allthe end IoT devices through its encrypted records, whichis maintained by the entities equipped with computing andstorage resources, named super nodes, such as Base Stations(BS) and Road Side Unites (RSUs). There are two types oftransactions in our permissioned blockchain: retrieval transac-tions and data sharing transactions. For the privacy concernsand due to storage limitation, we use permissioned blockchainonly to retrieve the related data and manage the accessibility ofdata, instead of to record the raw data. Moreover, permissionedblockchain records all the sharing events of data, which cantrace the use of data for further audit.

Header

Record Pi

Header

Record Pj

Header

!

Header

Record Pk

Block iBlock i-1 Block i+1 Block i+2

⋮

Local model MK-1

Local model MK

Fig. 2: Architecture of secure data sharing scheme

Suppose all the parties that agreed for data sharing havebeen registered in the permissioned blockchain, by uploadingretrieval records to the blocks. A data requester launchesa data sharing request Req containing a set of queriesFx = {f1, f2, ..., fx} to its nearby super node SNreq . Fig. 3shows the working mechanism of our scheme. Nearby supernode SNreq first searches permissioned blockchain to checkwhether the request has been processed before. If there isa hit, the request will be forwarded to the node that hascached the results towards request Req. The cached resultsare then sent to the requester as a reply. Otherwise, for a newdata sharing request, the multi-party data retrieval process isexecuted to find the related parties according to the registrationrecords. We regard these parties as committee nodes whichare responsible for driving the consensus in permissionedblockchain. Then the committee nodes train a global datamodel M jointly by federated learning. Once the model istrained, the data requester r uses Req = {f1, f2, ..., fx} asthe input of theM and gets the corresponding sharing resultsM(Req). Data modelM can accept any query fx in the queryset Fx and provide a result M(fx) for the query. In addition,as a machine learning model, M can also make predictionson the fresh query that fy /∈ Fx.

III. BLOCKCHAIN AND FEDERATED LEARNING FORSECURE DATA SHARING

In this paper, we consider the problem of privacy-preservingdata sharing in decentralized multiple parties. Due to the

Blockchain

Data requesterRequest records

Cached

request

New

request

Multi-party retrievalFederated learning

Cached model

Learned model

record

record

Sharing request

Reply

ReplyCaching node

1

5

2

34

6

5

34

Fig. 3: The working mechanism of proposed method

constrained resources of edge users and their privacy concerns,we share the federated data model learned over decentralizedmultiple parties instead of the original data. The data modelcontains valid information towards the requests and minimizedprivate data of participants.

A. Normalized Weighted Graph

It is challenging for the IIoT end devices to output andmaintain structured data due to limited computing and storageresources. Instead, they are more likely to generate unstruc-tured data, e.g., in the form of text files. Study regarding suchdata is limited. To fill this gap, we focus on the unstructureddata - textual data in our data sharing scenarios. We definea two-step distance metric learning scheme for retrieval ofthe textual data, which can quantify the similarity of specifieddata.

To improve the computing and storage efficiency withconstrained resources, we leverage the graphs to representoriginal data for further process, as defined in Definition 1,which can retain more structure and context information.

Definition 1 (Weighted Context Graphs): A weighted contextgraph G = {V,E} comprises a set of nodes (key terms) Vand a set of edges E ⊆ V × V . Each node ni contains a textterm tni

and its weight wni , (ni, wni

). Each edge eij connectsnode ni and nj with a weight weij denoting the correlationdegree.

We use weight matrix A = [aij ] to represent the graph,where aij = wni

if i = j and aij = weij if i 6= j. Weleverage term frequency - inverse document frequency (TF-IDF) to construct our graphs. Thus, all the textual files aretransferred into weighted context graphs {G1, G2, ..., Gn}.

In the second step, we serialize the graphs. Although graphskeep much context information, they are difficult to be furtherprocessed as input by machine learning algorithms. We mapthe graphs into liner vectors by serializing them into a se-quenced vector. The graphs are first merged into a global graphG = G1∪G2, ...,∪Gn. For global graph G = {V,E}, let k bethe number of representative vertices. Then, the size of normal-ized attributes for nodes will be k and the size of normalizedattributes for edges will be k×(k−1)/2. Thus the normalizedvector Seq = V ∪E = {V1, ..., Vk}∪{E1, E2, ..., Ek(k−1)/2}.We leverage Jaccard similarity [23] as the distance function tocluster documents with k-means algorithm. With the assistanceof the normalized weighted graph and the defined distance




metric, we cluster dataset {D1, ..., Dn} into various categoriesaccording to textual similarity. We also divide the participatedusers into different groups according to their data.

B. Multi-party Data RetrievalSince most of the data is sensitive and the amount of data

is large, it is a resource intensive and risky task to put dataon the blockchain with its limited storage space. Thus, weutilize blockchain to retrieve data, while the real data is storedlocally by its owners. When a new data provider participatesin, its unique identity (ID) is recorded as a transaction in theblockchain, together with the profiles of its data, includingdata categories, data types and data size. All the profiles ofdata from multiple participants will be recorded in forms oftransactions, and will be verified by the blockchain nodesthrough adopting Merkle tree [24]. Each data sharing eventis also stored in the blockchain as a transaction. The detailedforms of the two transactions - the retrieval transactionsrecording data profiles of permissioned participants and thedata sharing transactions recording all the data sharing events- are shown in Fig. 4.

Header

Record Pi

Header

Record Pj

Header

!

Header

Record Pk

Block iBlock i-1 Block i+1 Block i+2

Data SizeRequester IDRequest type Data typeData categories TimestampData sharing record

TimestampData SizeData typeData categoriesProvider IDHash pointerRetrieval record

Fig. 4: The records of blockchain

The retrieval of related participants on blockchain towards adata sharing request is a fundamental problem to be addressedin the proposed model. Since there are many participants, thosewho possess data related to the request, should participatein data sharing to increase the accuracy of response results.Nonetheless, the retrieval process should not break the privacyof each participant. A distributed retrieval scheme is neededto quickly locate the requested data distributed among partic-ipants, which can collaboratively response to the request.

Inspired by previous work of Kademlia [25], we design amulti-party retrieval mechanism in blockchain. All participantsP are partitioned into various communities according to theirdata categories, that is, members of a community hold similarcategories of data. Each community maintains a local retrievaltable of log(n) records towards log(n) different communities.As for each node in the community, it stores the IDs of allits community members, together with the log(log(n)) nodesfor each of its log(n) closest (most related in data categories)community. In this way, the most related participants will bekept locally at the local retrieval table of Pi, as shown in Fig.5.

We extract a list of key terms from the data of everyparticipant as the representative features in the form of hash

…

Hash

pointer Pk

…

Maximum of K nodes

…bucket n

bucket 2

bucket 1

bucket 0Hash

pointer Pi

… …

… …Hash

pointer Pj

… …

Fig. 5: The local retrieval table

values. Furthermore, due to the limited communication re-source of IIoT devices, the physical distance between twonodes dp(Pi, Pj) should also be considered in the retrievalprocess. Then, based on Jaccard distance, the logic distancebetween their key terms will be

di(Pi, Pj) =

∑m,n∈{Pi∪Pj−Pi∩Pj}

(aPimn + a

Pjmn)∑

m,n∈Pi∪Pj

(aPimn + a

Pjmn)

·log(dp(Pi, Pj)),

(1)where aPi

mn, aPjmn are the elements of weighted matrix for

node Pi and node Pj , respectively. The ID of each participant(device) is generated according to the logic distance. That is,the more relative two nodes are, the longer their common IDprefix will be.

Given two nodes Pi and Pj with IDs Pi(id) and Pj(id)),respectively. The relevance distance between them is definedas

d(Pi, Pj) = Pi(id)⊕ Pj(id). (2)

When a user submits a data sharing request to its nearbynode Pi, all nodes in the same community with Pi send therequest to the nodes in their local routing table with a certaindistance to start the retrieving process. This process will beimplemented recursively until all nodes within the relevantdistance have been traversed. At the end of retrieval, we getthe related subset nodes towards the request, Ps ⊆ P , whichare also the committee nodes for running a consensus processto approve the data sharing results.

C. Data Sharing Process

Existing methods use encryption for data security. However,in data sharing scenarios, it is still risky for data holders toshare the original data due to various attacks towards theencryption. A more secure method is to share the answerstowards the requests, which can provide the requesters withvalid information and protect privacy of data holders. The dataproviders share learned data models with the requesters insteadof the original data.

When the data requester r initiates a sharing request Req,it submits the request to its nearby super node SNreq of thepermissioned blockchain. SNreq first searches the blockchainto find out whether the request has been processed before.If there is a lookup hit, then the cached data model M




calculated before is returned to the requester directly. Oth-erwise, node SNreq looks up the blockchain for related nodes- committee nodes , towards the sharing request r, throughthe aforementioned multi-party data retrieval process. Thecommittee nodes are responsible for executing the consensusprocess and learn the federated data modelM collaboratively.A committee node Pi learns a local data model mi for therequests from requesters, then it will send model mi to otherrelated participants, according to the local retrieval table ofPi. This process is repeated jointly on various related parties,until all related parties are traversed. The trained data modelM will be returned to requester r, as the answer to its datasharing requests.

The detailed steps of our data sharing scheme are as follows.

1) Initialization: Before a data provider Pi joins, a localclustering based on Jaccard similarity is executed to clusterits textual data into various categories. We serialize thekey terms as vectors in a certain order to represent acategory. The more similar two datasets are, the closertheir distance is. Then its nearby super nodes, to whichPi belongs, will search the blockchain to find the recordswhich are logically close (according to XOR distance)to it. For each participant, we generate its ID based onthe hashed vectors to ensure that participants holdingsimilar datasets have similar IDs. In addition, to enhancecomputing efficiency, we divide the participants in advanceby running a community partition process, where all nodesare partitioned into various communities according to theirdistances towards each other.

2) Registering retrieval records: Once Pi joins, it first sendsits public key PKr and its data profiles to the nearby supernode for registration. Then a data retrieval record for Pi

is generated and broadcasted by the node to other nodesin permissioned blockchain for verification. Other nodescollect all received records and verify them before writingthem into permissioned blockchain.

3) Launching data sharing requests: Data requester r posts adata sharing request Req = {f1, f2, ..., fx} to its nearbysuper node SNreq . Request Req contains the ID of r, therequested data category and the timestamp, which is signedby r with its private key SKr.

4) Data retrieval: Once a nearby node receives the datasharing request, it verifies the identity of requester r. Thenit searches permissioned blockchain to confirm whether therequest has been processed before. If there is a record, thecached model is returned as the reply. Otherwise, it runsthe multi-party retrieval process to find the related parties.

5) Data model learning: The related parties work collab-oratively to respond to the sharing request. They runfederated learning to train a global data modelM towardsthe request Req. The training set is generated based onlocal data D and corresponding query results fx(D),DT =< fx, fx(D) >. The learning global model M isthen returned to the requester as a reply and is cached bya node locally for future requests.

6) Generating data sharing records: The data sharing eventsbetween data requesters and data providers are generated as

transactions and broadcasted in permissioned blockchain.All the records are collected into blocks, which are en-crypted and signed by the collecting node.

7) Carrying out consensus: The consensus process is executedby the related nodes selected for data retrieval. Eachnode competes for the opportunity to write blocks to theblockchain through proof-of-work (PoW) protocol. Thenode who wins the competition broadcasts its block to othernodes for verification. Once the verification is passed, theblock is added to the permissioned blockchain which istamper-proof.

Combining federated learning with permissionedblockchain, the requested data can be retrieved and sharedsecurely in industrial IoT scenarios with distributed multipledata providers, which can improve the scale and quality ofshared data. However, the PoW consensus protocol incursboth high energy consumption and computation overhead,thus making it less practical for IoT devices to adopt. Toaddress this issue, we further propose a new consensus toimprove the utility and efficiency of computing work in theconsensus protocol.

D. Consensus: Proof of training Quality (PoQ)

Transferring the data sharing problem into model sharingbrings many benefits in data sharing. Sharing the data modelonly instead of original data, helps protect privacy of dataowners. In addition, the machine learning data models aremore effective to provide the required information for newsharing requests.

Directly using existing consensus such as PoW for datasharing either brings high cost of computing and communi-cation resources, or makes limited additional contribution todata sharing. To address this problem, we propose a federatedlearning empowered consensus - Proof of training Quality(PoQ) protocol. PoQ combines data model training with theconsensus process, which can make better use of the nodes’computing resources.

For a specific data sharing request, we select members ofthe consensus committee by retrieving the related nodes fora request in the blockchain. The committee is responsible fordriving the consensus process, as well as for learning of datamodels for requested data. The objective of federated learningis to train a global data model M, which can provide thevalid responseM(Req) for data sharing requests Req. ModelM can be trained by using a series of machine learningalgorithms, e.g., random tree, random forest and gradientboosting decision tree (GBDT). Once constructed, model Mcan generate the answers towards data queries, even if thequeries are fresh.

1) Differentially private federated learning: To protect dataprivacy during multi-party decentralized learning, we incorpo-rate a differential privacy preserved mechanism into federatedlearning. For two neighboring datasets D and D′ with at mostone different record, and a set of outcomes S, a randomizedalgorithm A achieves ε-differential privacy if

Pr[A(D) ∈ S] ≤ exp(ε) · Pr[A(D′) ∈ S], (3)




where ε is the privacy budget. The training process overdistributed data owners is shown in Fig. 6. The related parties{P1, P2, ..., Pn} are selected through multi-party retrieval inblockchain. The textual data they hold is transferred into nor-malized graph vectors V ecg = 〈v1, v2, ..., vk, e11, e12, ..., ekk〉.When there is a data sharing request R (including a seriesof queries), Pi will train a local data model mi based onV ecgi towards the request first. The V ecgi, composed ofsensitive terms and weights, is used to train a local model inthe federated learning process. Since the local model will beshared to other participants, to protect the privacy of V ecgi,we incorporate differential privacy in the learning phase totrain a mi from noised data. Then Pi will send model mi toother participants. Once mi is received, Pi+1 will train a newlocal data model ˆmi+1 based on received mi and its local data,then broadcast ˆmi+1 to other participants. The data models aretrained iteratively among participants. Finally, the global datamodel M will be generated where M = {m1 ∪ m2...∪ mn}.

!"#"$%&'()*$+ !"#"$%&'()*$, !"#"$%&'()*$-

!"#"$.&()' M

/&()'$+ /&()'$,

f1f2

……fx

!"#"$%&"'()*$

'+,-+%#%$./f1(M)f2(M)……fx(M)

0+,-+%#+1$'+%-2#%$.3

Fig. 6: The overview of training in distributed scenario

For any participant Pi, there are three steps to learn thelocal model mi:• Selecting training samples: The data owner selects related

data D towards the request set R and transfer them intonormalized graph vectors V ec.

• Differential private local model training: The noise cali-brated by sensitivity s is added to local data V eci . Thelocal data model mi is trained locally at Pi, by usingmachine learning algorithm on the selected noisy dataV eci.

• Collaborative multi-party learning: The Laplace mech-anism is applied on local data model mi to achievedifferential privacy

mi = mi + Laplace(s/ε), (4)

where s is the value of sensitivity, as shown in Eq. (5),

s = maxD,D′

||f(D)− f(D′)||1. (5)

Then the noise-added model mi is broadcasted as atransaction of the blockchain to other participants forfederated learning. This process is repeated iterativelyuntil the performance of the federated model achievesthe threshold or the training time runs out.

Algorithm 1 illustrates the overall process of our proposedscheme.

Algorithm 1 Differential private federated learning

Input: data request Req, related participants P , iterationtimes iter = 0

Output: data model M1: for each participant pi ∈ P do2: while accuracy ≤ Threshold do3: if iter = 0 then4: Construct a new differential data model mi based

on its noise-added vector data V eci5: Broadcast mi to other related participants accord-

ing to local retrieval tables6: else7: Construct differential private data model mi with

previously received models8: Broadcast mi to the other participants who are

engaged in the data sharing process9: iter = iter + 1

10: end if11: M = 1

k

∑k mi

12: end while13: end for14: return M to the requester

2) Training quality based consensus: The consensus pro-cess is executed by the selected committee based on the workof collaborative training. The committee nodes are a subset ofall the participants. The communication overhead is reducedby sending consensus messages only to the committee nodesinstead of all the nodes. However, the reduction in the numberof nodes also makes it more challenging to achieve consensus.In order to balance the overhead and the security, we providethe proof of training work for consensus in data sharing. Thecommittee leader is selected based on the quality of the trainedmodel. Since each committee node trains a local data model,the quality of the model should be verified and measuredduring the consensus process. We leverage prediction accuracyto quantify the performance of the trained local model. Morespecifically, in the classification during training, the accuracyis denoted by the fraction of correctly classified records. Whilein the task of regression, the accuracy is measured by meanabsolute error (MAE):

MAE(mi) =1

n

n∑i=1

|yi − f(xi)|, (6)

where f(xi) is the prediction value of model mi and yi is thereal value of the records. The lower the MAE of model mi

is, the higher the accuracy of mi will be.After the differentially private collaborative training, we get

the trained global data model M and local model mi foreach committee node. The consensus process is executed bythe committee. During responding to a data sharing request,a committee node Pi transmits its trained model mi to thenext committee node. The transmissions are recorded as modeltransactions tmi , together with its MAE(mi). The recordtuple is shown in Table I. The illustration of model transactionsin training process is shown in Fig. 7. To encrypt and sign




TABLE I: The record tuple of a model transaction

hashpointer

owner id receiver id modeldata

MAE timestamp

the messages, Pi has a pair of public and private keys(PKi, SKi). Then Pi broadcasts the encrypted transmissionE(SKi(tm), PKi) to other committee nodes. A committeenode Pj collects all model transactions and stores them locallyas candidate blocks. As a proof of training work, Pj verifiesall transactions it receives by calculating the MAE defined inEq. (6). The MAE for Pj , MAEu(Pj) is calculated as

MAEu(Pj) = γ ·MAE(mj) +1

n

∑i

MAE(mi), (7)

where MAE(mj) is the MAE of the locally trained modelmj , and γ is the weight parameter denoting the contributionof Pj to the global model, decided by the training data sizeof Pj and other participants γ = 1 + |dj |/

∑i |di|.

When the consensus process starts, the committee node withthe lowest MAEu at that time will be elected as the committeeleader through MAE-based voting. The leader is responsiblefor driving the consensus process among participated nodes.As mentioned earlier, the leader gathers all the transactions itreceived at the beginning, including the final data model M,to form a block Bk = (Hk, 〈tmi

〉,M), where Hk is the headerof Bk. Then the leader broadcasts Bi to all members of thecommittee for approval. In addition to the regular verificationon a block (e.g., the header format, block size, timestamp), thecommittee nodes also audit the block by verifying the modeltransaction track, as what the verification nodes do to verify atransaction’s amount in bitcoin. Each verifying node calculatesthe MAE(mi) for each model transaction and MAE(M). Ifthe calculated MAE is within a certain range, an approval willbe sent to the leader. If the block containing all transactionsis approved by every committee node, the leader will sendthe block data signed with its signature to all nodes. Then therecords will be stored in the blockchain, which are tamper-proof.

Input 0

Input 1

Output 1

Input 0

output

TX iTransaction i-1

(TX i-1)…mj-1 ∪ mj ∪…

…mk-1 ∪ mk ∪……mp-1 ∪ mp ∪…

…mq-1 ∪ mq ∪…

Fig. 7: Illustration of model transactions

The process for training work based consensus is illustratedin Fig. 8.

IV. SECURITY ANALYSIS AND NUMERICAL RESULTS

A. Security Analysis

The use of permissioned blockchain establishes a securemechanism for multiple parties without mutual trust. We

Leader

1. Local model mi

node Pi

Retrieval blockchain

node Pj

2. Transaction tmi

……

……

3. Broadcast block bi

3. Broadcast block bi

……

4. Approval reply

4. Approval reply

5. Store bi

Transactions

Broadcast

Verify

Store

M

Fig. 8: The consensus process of PoQ

integrate federated learning into the consensus process of per-missioned blockchain to address the aforementioned securitythreats.1) Achieving differential privacy: According to the definition

of differential privacy, if each step of data processingfollows the requirement of differential privacy, i.e., Eq.(3), the final results will satisfy differential privacy [26].Algorithm 1 shows that the privacy budget is only con-sumed in step 4 where noise is added to data vectors.The other steps, using the training methods (e.g., trainedtree) for calculating residuals, are only mapping operationsthat are data-independent and will not disclose any privateinformation. Therefore, Algorithm 1 satisfies ε-differentialprivacy.

2) Removing centralized trust: The permissioned blockchaintakes the place of a trusted curator to connect each par-ticipant through multi-party data retrieval. The centralizedtrust, which incurs high risk of data leakage, is no more re-quired in the proposed blockchain empowered data sharingscheme.

3) Guaranteeing the quality of shared data: To prevent thedishonest provider from sharing invalid data, the PoQconsensus process validates the quality of learned datamodels by other data providers, and only the qualifiedmodels are preserved.

4) Secure data management: Only the data retrieval is up-loaded to the permissioned blockchain while real data isstored locally by each data provider. Data owners can con-trol the authority of their own data. Moreover, permissionedblockchain uses a series of cryptographic algorithms suchas elliptic curve digital signature algorithm and asymmetriccryptography to guarantee the security of data.

B. Evaluation Setup

We conduct evaluations of the proposed secure data sharingscheme on two real-world data sets, which is widely used forevaluating text-related machine learning algorithms. The firstone is Reuters dataset [27], a benchmark dataset for classi-fication tasks. The dataset consists of a series of short filesin various topics appeared on Reuters newswire. It containsa total of 15732 files in 116 categories. The second one isthe 20 newsgroups data set [28], which is a collection ofapproximately 20000 newsgroup files. The data is partitionedinto 20 different groups, where each group is related to one




topic. The data in the two datasets is unstructured short text,which is quite different from the structured data in databases.We use the two datasets to simulate the large amount ofunstructured short data pieces in IIoT, such as the configurationfiles generated from various vehicular applications and thestatus log files.

We divide the sorted data set into shards and re-combinethe shards into subsets to simulate the distributed multipleparticipants in our data sharing scheme. Classification analysisis used to simulate the data sharing tasks. We implement ourimproved distributed gradient boosting decision tree on textualdata to execute the federated learning process over distributeddata sets.

C. Numerical Results

Receiver Operating Characteristic (ROC) curve is widelyused to illustrate the diagnostic ability of a classifier scheme,whose discrimination threshold is varied. We use the areaunder the ROC curve (AUC) to evaluate the accuracy ofthe model. The performance of our proposed mechanismin various distributed datasets, compared with a benchmarkmethod, Text Graph Convolutional Networks (Text GCN) [29],is shown in Fig. 9 and Fig. 10. From Fig. 9 we can concludethat compared with the benchmark method, most of our testinggroups obtain high accuracy with an average AUC value of0.918, which indicates that our proposed federated learningmechanism achieves high diagnostic ability. We can alsoobserve that the accuracy changes not so smoothly. The reasonis that the number of files in each subset is a fixed discretevalue, and the accuracy is also related to the characteristics offiles from different subset, which can not be a continuouslychanging value as the size of data increases. Fig. 10 showsthe AUC results with various number of data providers (3,6 and 9). It can be seen that the AUC results change littleas the number of data providers increases, which indicatesour proposed model is scalable. Since the global model infederated learning is aggregated from all local models, andeach local model is trained on local data, the number ofdata providers has little effect on the performance of theaggregated model. Fig. 11 shows that the running time ofour proposed mechanism varies from milliseconds to secondswith an average of 880ms in different sub-datasets. From Fig.12, we can see that, for the same dataset, the running timeincreases with the number of data providers involved. Thereason is that the more data providers, the more time it takesto implement the collaborative working process. Moreover,the results of running time show that the performance of theproposed federated learning based data sharing scheme is nearreal-time.

Through the above evaluation, we can observe that the in-crease in data providers has little effect on the accuracy of ourproposed scheme, while the running time increases evidently.The stable accuracy is due to that our scheme can executeparallel local training, where the federated mode ensures thestable learning accuracy. Yet the increase in data providersbrings more local models to be updated and computed, whichincreases the time for training and update transmission. Thus,

the running time increases with the increase of data providers.Despite the slightly increased running time, the participationof multiple data providers enlarges the scale of data for com-puting results, which enables the sharing of data to improvethe quality of vehicular applications.

0 50 100 150 200 250 300 350

Various data sets

0.6

0.7

0.8

0.9

1

AU

C

Our method

Our mean AUC

Text GCN

TGCN mean AUC

Fig. 9: AUC in various data sets (20News)

579 593 620 629 704 712 750 790 812 821 900 938 1110133618092110278128152903

Size of Data set

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

AU

C

3 Data providers

6 Data providers

9 Data providers

Fig. 10: AUC in various data sets (Reuters)

880 951 964 967 991 1019 1023 1027

Size of Data Set

400

600

800

1000

1200

1400

1600

1800

Runnin

g T

ime (

ms)

Running time

Mean running time

Fig. 11: Running time in various data sets (20News)

V. CONCLUSION

In this paper, we proposed a privacy-preserving data shar-ing mechanism for distributed multiple parties in industrialIoT applications, which incorporates federated learning intopermissioned blockchain. The illustrative numerical results,showed that our blockchain empowered data sharing schemeenhances the security during sharing process without requiringcentralized trust. Moreover, by integrating federated learning




579 596 704 712 747 751 779 798 812 821 899 910 938 106111201336180918882102213227812800283129032904

Size of Data set

0

500

1000

1500

2000

2500

3000

3500

4000R

un

nin

g T

ime

(m

s)

3 Data providers6 Data providers9 Data providers

Fig. 12: Running time in various data sets (Reuters)

into the consensus process of permissioned blockchain, we notonly improved the utilization of computing resource but alsoincreased the efficiency of the data sharing scheme. Numericalresults on two benchmark real-world datasets corroborated thatour proposed mechanism can enable secure data sharing withhigh efficiency and utility.

The combination of blockchain and federated learning is apromising way to enable secure and intelligent data sharingin IIoT. However, how to efficiently guarantee data privacyby applying blockchain technique is still an open issue, whichneeds to be further explored by analyzing more security threatsand developing more effective solutions. Moreover, how toimprove the utility of data models mapped from raw data,regardless of the specific computing tasks and machine learn-ing algorithms, is a critical problem to be addressed in datasharing. New intelligent mechanisms are required to improvedata utility. In addition, the limited resource of devices imposesnew challenges for improving the efficiency of data sharing inIIoT.

REFERENCES

[1] E. Sisinni, A. Saifullah, S. Han, U. Jennehag, and M. Gidlund, “Indus-trial internet of things: Challenges, opportunities, and directions,” IEEETransactions on Industrial Informatics, vol. 14, no. 11, pp. 4724–4734,Nov 2018.

[2] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna-tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,vol. 10, no. 05, pp. 557–570, 2002.

[3] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,“l-diversity: Privacy beyond k-anonymity,” in 22nd International Con-ference on Data Engineering (ICDE’06). IEEE, 2006, pp. 24–24.

[4] C. Dwork, “Differential privacy in new settings,” in Proceedings ofthe twenty-first annual ACM-SIAM symposium on Discrete Algorithms.SIAM, 2010, pp. 174–183.

[5] T. Zhu, G. Li, W. Zhou, and S. Y. Philip, “Differentially private datapublishing and analysis: a survey,” IEEE Transactions on Knowledgeand Data Engineering, vol. 29, no. 8, pp. 1619–1638, 2017.

[6] C. Yin, J. Xi, R. Sun, and J. Wang, “Location privacy protection based ondifferential privacy strategy for big data in industrial internet of things,”IEEE Transactions on Industrial Informatics, vol. 14, no. 8, pp. 3628–3636, Aug 2018.

[7] K. Zhang, Y. Zhu, S. Maharjan, and Y. Zhang, “Edge intelligence andblockchain empowered 5g beyond for industrial internet of things,” IEEENetwork Magazine, to be published.

[8] C. Alcaraz, L. Cazorla, and G. Fernandez, “Context-awareness usinganomaly-based detectors for smart grid domains,” in InternationalConference on Risks and Security of Internet and Systems. Springer,2014, pp. 17–34.

[9] K. Zhang, S. Leng, X. Peng, P. Li, S. Maharjan, and Y. Zhang,“Artificial intelligence inspired transmission scheduling in cognitivevehicular communications and networks,” IEEE Internet of Things, tobe published.

[10] L. Cazorla, C. Alcaraz, and J. Lopez, “Cyber stealth attacks in criticalinformation infrastructures,” IEEE Systems Journal, vol. 12, no. 2, pp.1778–1792, 2016.

[11] C. Alcaraz and J. Lopez, “Analysis of requirements for critical controlsystems,” International journal of critical infrastructure protection,vol. 5, no. 3-4, pp. 137–145, 2012.

[12] G. Zyskind, O. Nathan, and A. Pentland, “Decentralizing privacy: Usingblockchain to protect personal data,” in 2015 IEEE Security and PrivacyWorkshops, May 2015, pp. 180–184.

[13] C. H. Liu, Q. Lin, and S. Wen, “Blockchain-enabled data collectionand sharing for industrial iot with deep reinforcement learning,” IEEETransactions on Industrial Informatics, vol. 15, no. 6, pp. 3516–3526,June 2019.

[14] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2008.[15] S. Goryczka, L. Xiong, and B. C. Fung, “m-privacy for collaborative

data publishing,” IEEE Transactions On Knowledge And Data Engineer-ing, vol. 26, no. 10, pp. 2520–2533, 2014.

[16] A. Friedman, I. Sharfman, D. Keren, and A. Schuster, “Privacy-preserving distributed stream monitoring.” in NDSS, 2014.

[17] Z. Qin, T. Yu, Y. Yang, I. Khalil, X. Xiao, and K. Ren, “Generatingsynthetic decentralized social graphs with local differential privacy,” inProceedings of the 2017 ACM SIGSAC Conference on Computer andCommunications Security. ACM, 2017, pp. 425–438.

[18] A. H. Sodhro, S. Pirbhulal, and V. H. C. de Albuquerque, “Artificialintelligence-driven mechanism for edge computing-based industrial ap-plications,” IEEE Transactions on Industrial Informatics, vol. 15, no. 7,pp. 4235–4243, July 2019.

[19] A. K. Sangaiah, D. V. Medhane, T. Han, M. S. Hossain, and G. Muham-mad, “Enforcing position-based confidentiality with machine learningparadigm through mobile edge computing in real-time industrial infor-matics,” IEEE Transactions on Industrial Informatics, vol. 15, no. 7, pp.4189–4196, July 2019.

[20] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefficiency,” arXiv preprint arXiv:1610.05492, 2016.

[21] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federatedlearning: A client level perspective,” arXiv preprint arXiv:1712.07557,2017.

[22] Z. Yu, J. Hu, G. Min, H. Lu, Z. Zhao, H. Wang, and N. Georgalas,“Federated learning based proactive content caching in edge computing,”in 2018 IEEE Global Communications Conference (GLOBECOM).IEEE, 2018, pp. 1–6.

[23] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu,“Using of jaccard coefficient for keywords similarity,” in Proceedings ofthe International MultiConference of Engineers and Computer Scientists,vol. 1, no. 6, 2013.

[24] A. Kosba, A. Miller, E. Shi, Z. Wen, and C. Papamanthou, “Hawk:The blockchain model of cryptography and privacy-preserving smartcontracts,” in 2016 IEEE Symposium on Security and Privacy (SP),vol. 00, 2016, pp. 839–858.

[25] P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer infor-mation system based on the xor metric,” in Peer-to-Peer Systems,P. Druschel, F. Kaashoek, and A. Rowstron, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2002, pp. 53–65.

[26] C. Dwork, “A firm foundation for private data analysis,” Communica-tions of the ACM, vol. 54, no. 1, pp. 86–95, 2011.

[27] D. D. Lewis, “Reuters dataset,” http://www.daviddlewis.com/resources/testcollections/, 2019.

[28] “20newsgroups,” http://qwone.com/∼jason/20Newsgroups/, 2019.[29] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text

classification,” in Proceedings of the AAAI Conference on ArtificialIntelligence, vol. 33, 2019, pp. 7370–7377.




Yunlong Lu received the B.S. degree in electronicinformation science and technology from BeijingForestry University, Beijing, China, in 2012 and theM.S degree in School of Computer Science fromBeijing University of Posts and Telecommunications(BUPT), Beijing, China, in 2015. He is currentlyworking towards the Ph.D. degree in Computer Sci-ence and Technology with the Institute of NetworkTechnology, BUPT, and a visiting Ph.D. student withthe University of Oslo, Norway. His current researchinterests include blockchain, wireless networks, and

privacy-preserving machine learning.

Xiaohong Huang received her B.E. degree fromBeijing University of Posts and Telecommunications(BUPT), Beijing, China, in 2000 and Ph.D. degreefrom the school of Electrical and Electronic Engi-neering (EEE), Nanyang Technological University,Singapore in 2005. Since 2005, Dr. Huang hasjoined BUPT and now she is an associate professorand director of Network and Information Centerin Institute of Network Technology of BUPT. Dr.Huang has published more than 50 academic papersin the area of WDM optical networks, IP networks

and other related fields. Her current interests are Internet architecture, softwaredefined networking, and network function virtualization.

Yueyue Dai received the B.Sc. degree in com-munication and information engineering from theUniversity of Science and Technology of China(UESTC), in 2014, where she is currently pursuingthe Ph.D. degree. She is now a visiting Ph.D. studentwith the University of Oslo, Norway. Her currentresearch interests include wireless network, mobileedge computing, Internet of Vehicles, blockchain,and deep reinforcement learning.

Sabita Maharjan (M’09) received the Ph.D. degreein networks and distributed systems from the Sim-ula Research Laboratory, and University of Oslo,Norway, in 2013. She is currently a Senior Re-search Scientist at the Simula Metropolitan Centerfor Digital Engineering, Norway, and an AssociateProfessor at the University of Oslo. Her current re-search interests include wireless networks, networksecurity and resilience, smart grid communications,Internet of Things, machine-to-machine communica-tion, software-defined wireless networking, and the

Internet of Vehicles.

Yan Zhang is a Full Professor at the Departmentof Informatics, University of Oslo, Norway. Hiscurrent research interests include: next-generationwireless networks leading to 5G Beyond, green andsecure cyber-physical systems (e.g., smart grid andtransport). He received a Ph.D. degree in Schoolof Electrical & Electronics Engineering, NanyangTechnological University, Singapore. He is an Editorof several IEEE publications, including IEEE Com-munications Magazine, IEEE Network, IEEE Trans-actions on Vehicular Technology, IEEE Transactions

on Industrial Informatics, IEEE Transactions on Green Communications andNetworking, IEEE Communications Surveys & Tutorials, IEEE Internet ofThings, IEEE Systems Journal and IEEE Vehicular Technology Magazine.He serves as chair positions in a number of conferences, including IEEEGLOBECOM 2017, IEEE PIMRC 2016, and IEEE SmartGridComm 2015.He is IEEE VTS (Vehicular Technology Society) Distinguished Lecturer. He isa Fellow of IET. He serves as the Chair of IEEE ComSoc TCGCC (TechnicalCommittee on Green Communications & Computing). He received the award2018 “Highly Cited Researcher” according to Clarivate Analytics.

blockchain and federated learning for privacy-preserved...

Documents