big data cleaning based on mobile edge computing in...

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/335581169

Big Data Cleaning Based on Mobile Edge Computing in Industrial Sensor-Cloud

Article in IEEE Transactions on Industrial Informatics · September 2019

DOI: 10.1109/TII.2019.2938861

CITATIONS

30READS

934

6 authors, including:

Some of the authors of this publication are also working on these related projects:

Build Reliable and Secure SaaS Development and Deployment Platform View project

Iris biometrics View project

Tian Wang

TCL Corporation

309 PUBLICATIONS 4,729 CITATIONS

SEE PROFILE

Xi Zheng

Macquarie University

100 PUBLICATIONS 592 CITATIONS

SEE PROFILE

Arun Kumar

VIT University

285 PUBLICATIONS 4,285 CITATIONS

SEE PROFILE

All content following this page was uploaded by Xi Zheng on 13 September 2019.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/335581169_Big_Data_Cleaning_Based_on_Mobile_Edge_Computing_in_Industrial_Sensor-Cloud?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/publication/335581169_Big_Data_Cleaning_Based_on_Mobile_Edge_Computing_in_Industrial_Sensor-Cloud?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_3&_esc=publicationCoverPdf

https://www.researchgate.net/project/Build-Reliable-and-Secure-SaaS-Development-and-Deployment-Platform?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/project/Iris-biometrics?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Tian_Wang77?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/TCL_Corporation?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Xi_Zheng7?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/Macquarie_University2?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Arun_Kumar388?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/VIT_University?enrichId=rgreq-9348602fc67b5afa5e82c7c03f66bbc4-XXX&enrichSource=Y292ZXJQYWdlOzMzNTU4MTE2OTtBUzo4MDI1NTQ2NDQ1NTM3MjlAMTU2ODM1NTM0NjQ5Ng%3D%3D&el=1_x_6&_esc=publicationCoverPdf



1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2938861, IEEETransactions on Industrial Informatics

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Big Data Cleaning Based on Mobile EdgeComputing in Industrial Sensor-Cloud

Tian Wang, Haoxiong Ke, Xi Zheng, Kun Wang, Arun Kumar Sangaiah, Anfeng Liu

Abstract—With the advent of 5G, the Industrial Internet ofThings (IIoT) has developed rapidly. The industrial Sensor-CloudSystem (SCS) has also received widespread attention. In thefuture, a large number of integrated sensors that simultaneouslycollect multi-feature data will be added to industrial SCS.However, the collected big data is not trustworthy due to theharsh environment of the sensor. If the data collected at thebottom networks is directly uploaded to the cloud for processing,the query and data mining results will be inaccurate, which willseriously affect the judgment and feedback of the cloud. Thetraditional method of relying on sensor nodes for data cleaning isinsufficient to deal with big data, while edge computing providesa good solution. In this paper, a new data cleaning method isproposed based on the mobile edge node during data collection.An angle-based outlier detection method is applied at the edgenode to obtain the training data of the cleaning model, which isthen established through support vector machine (SVM). Besides,online learning is adopted for model optimization. Experimentalresults show that multi-dimensional data cleaning based onmobile edge nodes improves the efficiency of data cleaning whilemaintaining data reliability and integrity, and greatly reducesthe bandwidth and energy consumption of the industrial SCS.

Index Terms—Industrial internet of things, Industrial sensor-cloud, Data cleaning, Edge computing, Online machine learning.

I. INTRODUCTION

THE Industrial Internet of Things (IIOT) has gained con-siderable attention from both academia and industry. IIoT

contains not only Wireless Sensor Networks (WSNs) but alsoother constituent parts including Wireless Local Area Network(WLAN), Radio Frequency Identification (RFID) and mobileagents [1]. However, WSNs are key components of IIoT.

Above work was supported in part by grants from Open Fund of KeyLaboratory of Data mining and Intelligent Recommendation, Fujian ProvinceUniversity under Gant No. DM201902 and the General Projects of SocialSciences in Fujian Province under Gant No. FJ2018B038 and NationalNatural Science Foundation of China (NSFC) under Grant No. 61872154,No. 61772148 and No. 61672441 and Natural Science Foundation of FujianProvince of China (No. 2018J01092) and the Fujian Provincial Outstanding Y-outh Scientific Research Personnel Training Program. (Corresponding author:Arun Kumar Sangaiah.)

Tian Wang and Haoxiong Ke are with the College of Computer Sci-ence and Technology, Huaqiao University, Xiamen 361021, China (e-mail:cs [email protected], haoxiong [email protected]).

Xi Zheng is with the department of computing, Macquarie University, NSW,2109, Australia (e-mail: [email protected]).

Kun Wang is with the department of Electrical and Computer Engi-neering, University of California, Los Angeles, 90095, CA, USA (e-mail:[email protected]).

Arun Kumar Sangaiah is with the School of Computing Science and En-gineering, VIT University, Tamil Nadu, India (e-mail: [email protected];[email protected]).

Anfeng Liu is with School of Computer Science and Engineering, Cen-tral South University, Changsha 410006, Hunan, China (e-mail: [email protected]).

These sensor networks consist of large number of sensorsdeployed in all aspects of industry environment to providelow-cost monitoring services [2]. These smart sensors havebeen successfully deployed in industrial control, environmen-tal monitoring, smart grids, digital oil fields and intelligentindustry [3] [4]. Data collected by the sensor is periodicallyanalyzed by convergence sensor nodes for decision making.The Sensor-Cloud System (SCS) is a combination of WSNsand Cloud Computing, which has greatly enhanced the speedof WSNs data processing, the capability of data storage, theadaptability and other performances.

Undoubtedly, data collected by the underlying sensor net-work is the foundation of SCS and the basis of all applications[5]. In order to improve the reliability of data analysis, it isnecessary to collect data from a large number of monitoringareas. Most of sensor nodes are deployed in harsh naturalenvironments, which are vulnerable to malicious attacks andcan be artificially manipulated to produce invalid or evenmisleading data [6]. According to the statistics, less than49% of the information is valid and reliable. Abnormal datatransmission not only increases the node energy consumptionand requires more upload bandwidth, but also affects theanalysis results. In order to reduce the transmission bandwidthand energy consumption, data retention and data eradicationbecome very important. Therefore, data cleaning work hasappeared frequently. Data cleaning, namely, is to remove the”dirty” data, which is the last procedure for finding andcorrecting identifiable errors in data files [7].

However, in traditional WSNs, data cleaning relies onlyon sensor nodes themselves. If sensors only need to collecta single feature data, their own computing power can meetthe requirements. Due to the advent of 5G and the integra-tion of sensors, sensors themselves would generate a lot ofcomplex data [8]. If existing computing power of the nodesare continually used to clean large amounts of multi-featuredata, following problems will occur: first, the reliability ofdata will decrease; second, the transmission bandwidth willnot meet the increasing demand due to unreliable factors; third,more complex calculations will increase energy consumptionof nodes [9] [10].

In order to avoid these problems, a mobile edge nodedata cleaning technology combined with machine learningis introduced [11] [12]. The computing power of the edgenode can make up for the shortcomings of WSNs’s computingcapacity. Tasks that are difficult to accomplish with ordinarysensor nodes can be implemented at edge level. In addition,the edge node is closer to the underlying network, making iteasier to obtain data collected by sensor nodes. Theoretical




analysis and experiments validate that the proposed mecha-nism outperforms traditional data cleaning algorithms.

The main contributions of this work can be summarized asfollow:• A high-dimensional data cleaning method is proposed

for mobile edge nodes in WSNs, in which sensors areintegrated, to collect data of multiple features at thesame time. It can prolong the service life of WSNs andaccelerate the cleaning speed of abnormal data.

• A method for establishing a data cleaning model ispresented, which combines machine learning with theedge computing platform to optimize the cleaning modelin real time based on the anomalous data in the WSNs.Both theoretical and experimental results verify the ef-fectiveness of the method.

The rest of this paper is organized as follow, section IIgives an overview of various data cleaning techniques ex-isting for WSNs. Section III describes the traditional datacleaning model and the architecture of data cleaning used inthe network. Section IV introduces solutions based on edgecomputing. Section V evaluates the proposed method withsome experiments and Section VI gives a conclusion.

II. RELATED WORK

Spatial and temporal correlation of sensory data are impor-tant features of WSNs, which are widely used by researchersfor data cleaning. Temporal correlation is what the measuredvalue of the sensor node should remain relatively stable duringthis period (several or more measurement cycles) withoutdrastic changes. The spatial correlation means that sensornodes within a certain range of area in WSNs should havesimilar measurements at the same time, and the measuredvalues have similar trends over a period of time (several ormore of the measurement periods). Based on the characteristicsof the temporal and spatial correlation of data in WSNs, datacleaning is mainly divided into centralized cleaning methodand distributed cleaning method.

A. Centralized data cleaning method

In a centralized approach, all the data received at individualnodes is transmitted to central nodes which are responsiblefor processing the entire data received from the network anddetermine outliers or events. In [13], the author establishedthe sensor data cleaning model by using the centralized datacleaning method, recovered the missing data, and removed theoutliers by using the spatio-temporal correlation of the sensorydata. In [14], the author used a histogram-based method foroutlier detection and proposed collecting hints (in the form ofa histogram) about the data distribution, and using the hintsto filter out unnecessary data and identify potential outliers.

The centralized cleaning method cannot well meet the real-time requirements of WSN applications because it needs totransmit a large amount of sensory data to the sink node forcentralized processing. Meanwhile, the energy consumptioncaused by the transmission of a large amount of data alsomakes the centralized cleaning method inefficient and difficultto popularize.

B. Distributed cleaning method

In a distributed approach, as the name implies, the outlieridentification process is divided between the nodes of the net-work. Among the proposed techniques, sensor nodes transmitdata to the cluster head or aggregator of the network, andwhich is responsible for processing sufficient data receivedfrom all nodes. In [5], the author proposed an energy-efficientfiltering technique dedicated to periodic sensor applications.The first filter was applied to the sensor node to reduce the rawdata based on the Pearson coefficient metric. The second filterwas applied to the cluster head. It used K-nearest neighborclustering algorithm to eliminate data redundancy collected byneighboring nodes. In [15], the authors proposed a two-stagedata aggregation model dedicated to underwater WSNs. Afterprocessing data at each sensor, the authors used Euclidean andCosine distances at the aggregator layer to reduce packet sizeand minimize data redundancy.

In the cluster head node, data cleaning can also use themethod of data prediction. In [16], they addressed how todetermine the variance of measurement noise, a major in-put parameter of Kalman filtering, by analyzing past sensordata and using the estimated noise to improve the filteringaccuracy. More specifically, to estimate the measurement noisevariance, two analytical methods were proposed: the one wasa transform-based method using a wavelet transform and theother one was a learning-based method using a denoising autoencoder. The author in [17] developed an iterative distributedUnbiased Finite Impulse Response (dUFIR) filtering algorithmfor object tracking via WSNs with consensus on estimatedand shown that it have higher robustness than the distributedKalman Filter (dKF). The tracking problem was viewed as areal-time position estimation of an Unmanned Ground Vehicle(UGV). The extensive simulations were provided by using realsensor parameters and measurements of the UGV position withmissing data.

However, detecting outliers in large-scale IoT sensor data isa challenging task. In the latest research [8], they proposed aOne-class Support Tucker Machine (OCSTuM) and an OCS-TuM based on tensor Tucker factorization and a genetic algo-rithm called GA-OCSTuM. These methods extended one-classsupport vector machine to tensor space. OCSTuM and GA-OCSTuM were unsupervised anomaly detection approachesfor big sensor data. They retained the structural information ofdata while improving the accuracy and efficiency of anomalydetection.

Although most of the proposed techniques allow efficientdata reduction, they present several disadvantages. First, theyare very complex and require massive processing. Second, theyconsume a lot of energy in WSNs. In this paper, an innovativedata cleaning method is presented which is suitable for limitedresources sensor nodes.

III. ARCHITECTURE OF MOBILE DATA CLEANING WITHEDGE COMPUTING

In this chapter, traditional architecture and the networkarchitecture used in this paper will be introduced.




Integrated

Sensor

Mobile edge

nodeQ1

N1

N2

N3

N4

Q2

N5

N3

N4

Q2

N5

MovementCloud

Edge

Sensor Network

Fig. 1: The architecture of mobile cleaning

A. Our Architecture

B. Traditional Data Cleaning Model

The traditional data cleaning model is that each set ofsensor nodes sends their collected data to an intermediatenode in the local area which is called an aggregator. Eachaggregator will filter data according to the received data andthen send the filtered data to the sink node. In this model,the aggregator consumes a lot of energy in terms of filteringdata and uploading data to the sink node. The main purposeof WSNs is to forward data packets from event regions tothe sink. Unfortunately, sensor nodes are energy-constrained.When processing complex data, the computing capacity ofnodes in WSNs are obviously insufficient. The service time ofthe sensor will be shortened if keeping on using the traditionalcleaning model. In order to prolong the service time of sensorsand save sensor energy, mobile edge computing is introduced.

C. Mobile Edge Computing

Generally speaking, edge computing is a hierarchical ar-chitecture with three layers: the topmost layer is for cloudcomputing, which has powerful storage capacity and com-putation capability; the middle layer is for edge computing;and the bottom is the WSNs layer, whose main function is tocollect data and upload them to the edge server. In the previousstudy, a mobile edge node is proposed and a moving route wasdesigned for the edge node in WSNs, with an aim to maximizethe throughput and minimize the transmission latency [18].Energy for mobile edge node is relatively sufficient [19].Compared to traditional data collection in WSNs, mobile edgenodes can greatly reduce energy consumption and increaseefficiency in data collection. In our scheme, edge computingmodel is used and the three-layer structure is adopted [20][21].

In this paper, a data cleaning method based on mobileedge nodes in WSNs is proposed. The introduction of edgecomputing establishes a hierarchical architecture. The righthalf of Fig.1 shows three typical elements: cloud servers, edgeservers and WSNs from top to bottom. Different WSNs havedifferent mobile edge nodes to collect data, and each edge nodeexecutes data cleaning when collecting data. In each sensornetwork, mobile edge nodes will establish a correspondingcleaning model based on the collected correct data, namely,

a filter. Filters are able to quickly remove anomalous dataduring the collection process. After the data has been collected,all data should be uploaded to the cloud. The left half ofFig.1 shows how mobile edge nodes can collect and filterdata. The red dotted circle represents the collection range, theyellow-shaded circle represents each sensor node and the five-pointed star represents mobile edge node. Within the collectionrange, the edge node will conduct angle-based outlier detectionbased on temporal and spatial similarity. The cleaning modelis built based on the collected reliable data and the model iscontinuously iterated.

Unlike the traditional data cleaning model, individual cal-culation for each node is no longer needed. We only needto calculate at the mobile node. On the one hand, partialoffloading the work of the sensor nodes, aggregation nodesand cloud services to edge services can greatly improve theperformance of the cloud. On the other hand, if the aggregationnode is attacked during the process of uploading data to thesink node, all data uploaded will not be trusted. However, thissituation can be improved in our architecture because edgenodes have the ability to defend against malicious attacks.Collect, clean and upload data using edge nodes which canreduce intermediate processing and prevent attacks from ma-licious nodes so as to improve security. These mechanisms aresupported by our algorithms, which will be elaborated in thenext section. In brief, our architecture includes three parts andhas two major advantages compared to the traditional datacleaning: it can extend the lifetime of WSNs and improvecleaning efficiency.

IV. ALGORITHMS

In this section, core algorithms mentioned in this paper areintroduced. We will then explain how to get the training data tobuild the cleaning model, how to train the model and iterativelyoptimize the model.

A. Anomaly Detection Algorithm Based on High-dimensionalData Stream

1) Anomaly Detection: Before introducing the first algo-rithm, let’s take a look at the anomaly detection. Anomalydetection aims to quickly and effectively identify abnormalpoints from data sets. Based on WSNs and General Packet Ra-dio Service (GPRS) technology, the physical world is another




important research area in the field of data mining anomalydetection. The traditional anomaly detection algorithms aremainly divided into three types: statistical-based anomalydetection algorithm, improved distance-based anomaly de-tection algorithm and density-based local anomaly detectionalgorithm. They are mainly based on statistical theories withEuclidean distance as the abnormal evaluation criteria.

Data mining based on sensor technology can extract po-tential and valuable information from a large number offuzzy and complex data. Sensor network data exists in ahigh-dimensional form, which is massive, heterogeneous andnoisy, demanding and requires real-time responses. In thisregard, traditional data analysis algorithm is not effective indetecting abnormal points and the time complexity is too high,which has potential limitations. Anomaly detection of high-dimensional data streams faces more challenges [22]. In theenvironment of this study, each sensor collects multiple data.To detect outliers in a high dimensional space, the Angle-based Outlier Detection (ABOD) algorithm was introduced.This algorithm can effectively sift out abnormal data in high-dimensional data sets. It will be briefly described as below.

2) Angle-based Outlier Detection (ABOD): The compari-son of distances becomes more and more meaningless withincreasing data dimensionality. Fig.2 shows the principle ofthe ABOD, which contains a set of points to form a cluster,with the exception of point c, which is an outlier. For eachpoint o, we examine the angle 6 xoy for each pair of pointsx, y such that x 6= o, y 6= o. Note that if a point is inthe center of a cluster (e.g., a), the angles formed as suchdiffer widely. If a point is at the border of a cluster (e.g.,b), its angle variation is smaller. As for an outlier point (e.g.,c), the angle variable is substantially smaller.This observationsuggests that the variance of angles at a certain point can beused to determine whether the point is an outlier.

α βγ

ba

c

d

e

Fig. 2: Angle-based outlier

Combining angles and distance to model outliers. Math-ematically speaking, each point o has the distance-weightedangle variance as its outlier score. That is, given a set of points,D, for a point, o ∈ D, the Angle-based Outlier Factor (ABOF)is defined as

ABOF (o) = V ARx,y∈D,x6=o, y6=o〈−→ox,−→oy〉

dist(o,x)2dist(o,y)2(1)

where 〈 , 〉 is the scalar product operator, and dist( , ) is a normdistance.

Clearly, the further a point is away from clusters and thesmaller is the variance of angles of a point, the smaller the

ABOF. The ABOD computes the ABOF for each point andoutputs a list of points in the data set in the ascending orderof ABOF.

In our environment, sensors collect multi-feature data at thesame time, which thus has temporal and spatial similarities.During the collection process, the mobile edge node can detectabnormal data by using the ABOD algorithm. The input Dk

of ALGORITHM 1 represents the data collected by differentsensor nodes within the collection range of the edge node.Object D1 means the data collected by the first sensor node.on means data with n-dimensional features. Since there is notmuch research on angle-based outlier detection, there is notmuch references when setting the threshold f. With referenceto [23], we set the threshold value f to 0.3 from the interval[0,0.5]. When the value in ABOF data set is less than f, theABOF data is regarded as abnormal. Through adding the y tothe abnormal data as 1, and the non-abnormal data y to -1.All data with y are joined in to the set D as the output.

Algorithm 1 Angel-based Outlier DetectionInput:

sets of object D1, D2, . . . , Dk in collection range,Di = {o1, o2, . . . , on}(i = 1, 2, . . . , k);

threshold f (f > 0);Output:D : data in D1, D2, . . . , Dk with y;

Method:1: D ← {};2: for i = 1 to k do3: for j = 1 to n do4: if ABOF (oj) < f then5: yj = −1;6: else7: yj = 1;8: end if9: D ← D + (oj , yj);

10: end for11: end for12: return D;

In this section, edge node get the training data using theABOD algorithm which is based on temporal and spatialcorrelation between sensor nodes. The output data set D ofthe ALGORITHM 1 will be used as training data for buildingthe model in the next part.

B. Nonlinear Classfication

1) Nonlinear SVM Classfication: Support Vector Machine(SVM) was first proposed by Cortes and Vapnik in 1995. Itshows many unique advantages in dealing with small samples,nonlinear and high-dimensional pattern recognition and alsocan be applied to function fitting and solving other machinelearning problems.

Due to the limited sample information, the best compromisebetween the complexity (the accuracy of learning specifictraining samples) and learning ability (the ability to identifyany sample without error) of the model ought to be sought to




O

x(2)

x(1)

z(2)

z(1)O

(a) Low dimension

O

x(2)

x(1)

z(2)

z(1)O

(b) High dimension

Fig. 3: Division in high dimensional space

obtain the ability of best promotion (or generalization ability).SVM is already widely used as a classifier and directly usedin this paper. The SVM is based on linear partitioning, but itis conceivable that not all data can be divided linearly. Pointssuch as two categories in a two-dimensional space may requirea curve to divide their boundaries. Linear classifiers use a“hyperplane” to isolate positive and negative samples, and anonlinear classifier separates positive and negative samples bya combination of a ”hypersurface” or multiple curved faces.The principle of SVM is to map points in the low-dimensionalspace into a high-dimensional ones, making them linearlyseparable. Then the principle of linear partitioning is used tojudge the classification boundary. As it’s shown in Fig.3, itis a linear partition in the high-dimensional space and it is anonlinear partition in the original data space. In this section,the classification model is built by using the high-dimensionaldataset D produced in the part A of section IV.

2) Kernel Function: The kernel trick is applied to the SVM.The basic idea is to correspond the input control (Europeanspace Rn or discrete set) to a feature space (Hilbert space H)through a nonlinear transformation, which is in the input space.The hyper-surface model in Rn corresponds to the hyperplanemodel (SVM) in the feature space H, so that learning problemof the classification can be solved by the linear support vectormachine in the feature space.

Set X is the input space (a subset of the European spaceRn or a discrete set), and set H is the feature space (Hilbertspace) if there is a mapping from X to H

φ(x) : X→ H (2)

for all x, z ∈ X ,function K(x, z) meets the condition

K(x, z) = φ(x) · φ(z) (3)

Then K(x, z) is a kernel function, and φ(x) is a mappingfunction, where φ(x) · φ(z) is the inner product of φ(x) andφ(z).

Gaussian radial basis function is a locally strong kernelfunction, which can map a sample into a higher dimensionalspace. This kernel function is most widely used since itis robust irrespective of the sample size. The performanceof the kernel function is less dependent on its polynomialparameters. For the target environment address in this paper(e.g., environmental monitoring). Outliers and non-outliers arenonlinearly separable, and the data samples and features arelarge. Therefore, the Gaussian kernel function is the bestchoice.

The expression of the Gaussian kernel function is:

K(x, z) = e−‖x−z‖2

2σ2 (4)

Based on the train dataset D which is obtained in thepart A of section IV. The model was built by using theSVM of the Gaussian kernel function, and then the decisionfunction was constructed. ALGORITHM 2 elaborates howto construct the decision function in detail. The parameterC represents the establishment of a regularization parame-ter in the cleaning model to prevent the model from over-fitting. Many works have provided some form of experiencefor selecting regularization parameters. Based on their work,we use the grid search method. The grid search method isan exhaustive search method for finding optimal parametervalues. The output decision function f(x) of the algorithm canbe used for the next process of data collection and cleaning.When f(x) = −1, which means this data is abnormal andshould be directly discarded. When f(x) = 1, it is collectedand delivered to the cloud.

Algorithm 2 Nonlinear SVM Learning AlgorithmInput:

Training data set D = {(x1, y1), (x2, y2), . . . , (xN , yN )},among xi ∈ X = Rn, yi ∈ Y = {−1, 1}, i = 1, 2, . . . , N ;

Output:Classification decision function f(x);

Method:1: Select the Gaussian kernel function K(x, z) and appro-

priate regularization parameter C to solve the problem ofoptimization

minα

12

N∑i=1

N∑j=1

αiαjyiyjK(xi, xj)−N∑i=1

αi

s.t.N∑i=1

αiyi

0 ≤ αi ≤ C, i = 1, 2, . . . , N

find the optimal solution α∗ = (α∗1, α∗2, . . . , α

∗N )T ;

2: Select a positive component of α∗, 0 < α∗ < C, tocalculate

b∗ = yj −N∑i=1

α∗i yiK(xi · xj);

3: Construct decision function:

f(x) = sign

(N∑i=1

α∗i yiK(x · xi) + b∗

),

where sign(x) is sign (x) ={

1, x ≥ 0−1, x < 0

;

4: return f(x);

However the training sample is too small, the model cannotbe well applied to the whole WSNs. If this cleaning model isdirectly applied to the next process, it will lead to impreciseresults. Thus the next section will introduce how to optimizethe model to make it more robust.




f(x)

f '(x)

D, t<T D, t≥T

ABOD ABOD

f(x)

f(x)

First Round Second Round

Third Round

Fig. 4: Online machine learning model

C. Model Optimization

In this section, a method that can iteratively optimize themodel is introduced. Traditional machine learning algorithm isa batch mode, assuming that all training data is predeterminedand the classifier is obtained by minimizing the empirical errordefined on training data. This learning method has achievedgreat success based on a small sample, but when the data sizebecomes large, its computational complexity is high and theresponse is slow, which cannot be applied to meet high real-time requirements. Different from batch learning, online ma-chine learning assumes that training data continues to arrive.Usually, a training sample is used to update the current model,which can greatly reduce the spatial and time complexity ofthe learning algorithm and has strong real-time performance.The online machine learning process includes: presenting thepredicted results of the model, collecting feedback data andtraining the model to form a closed-loop system.

In the case of a large amount of high-dimensional data,an anomaly classification model is established. Due to thehuge amount of data in WSNs, if all data is used to establishthe cleaning model, the calculation amount is too large andis not suitable to the real-time scenario. So online machinelearning is adopted. As edge nodes move collection, trainingdata continues to arrive.

In WSNs, during the process of mobile data collection AL-GORITHM 1 and the decision function f(x) in ALGORITHM2 are both used to judge which are abnormal points for datagenerated by sensor nodes. Through comparing the resultsobtained by the two methods, the accuracy of the model f(x)can be calculated. Data (xi, yi) which is predicted wronglyby f(x) is added to the training set D to retrain the cleaning

model. Repeating the above operation until the accuracy off(x) prediction data anomaly reaches or is above the thresholdT (e.g. 98%), which is determined by the environment inWSNs. In the subsequent collection and cleaning process, theABOD algorithm is no longer needed, and the cleaning modelf(x) is directly used for data cleaning.

Fig.4 shows a partial mobile cleaning process. In the firstcollection range, the prediction accuracy t of the model f(x)can be calculated by comparing the result of the ALGORITH-M 1 and f(x). If t < T , put the data of the predicted errorinto D and optimize f(x). In the second collection range, twomethods are used again for the calculation. If t ≥ T , thereis no need to optimize the model. At the same time, f(x) isused directly to clean the abnormal data in the next collectionrange.

D. Mobile Data Cleaning Based on Edge Computing

In this section, the method of using mobile cleaning anoma-ly data at a edge node is showed. First, the edge nodecalculates the collected data set D according to ALGORITHM1, and the purpose is to classify the data and obtain the dataset of the training model. Secondly, according to the resultof ALGORITHM 1, the cleaning model f(x) is established(ALGORITHM 2), the purpose of which is to better performthe cleaning work. Finally, in the third section, a method ofiterative training model f(x) is proposed to adapt the modelto the WSNs environment.

V. EXPERIMENTAL RESULTS

In this section, we recommend the experimental Settingsand analyze the experimental results.

A. Experiment Settings

In order to validate our proposed mechanism, we evaluatethem under different parameters and analyze the results re-spectively. This paper deals with a stable data environmentand uses the dataset always for consistency of an environ-mental monitoring on kaggle. The experimental environmentparameters are shown in the TABLE I.

B. Model Establishment

Before evaluating the performance of the algorithm, let’slook at the effect of the classification model. Fig.5(a) shows

(a) ABOD classification (b) Model training (c) New data detection

Fig. 5: Model establishment




(a) (b) (c)

(d) (e) (f)

Fig. 6: Performance of three methods (a) Comparison of the dimension in delay. (b) Comparison of the dimension in energyconsumption. (c) Comparison of the number of data in delay. (d) Comparison of the number of data in energy consumption.(e) Comparison of the abnormal data ratio in delay. (f) Comparison of the abnormal data ratio in energy consumption.

TABLE I: Experiment Environment

Parameters Value

Operation System Windows 10Memory 8GBCPU Intel Core i7Programming Language PythonCompiler PyCharm 2018Dataset size 16MBThe amount of data 262920Number of sensors 100Number of the edge node 1Initial power of per sensor 100JInitial power of edge node 1KJTransmission time from sensor to edge node (ms) 6Transmission time from aggregator to sink node (ms) 500Consumption cost of per millisecond 0.001J

the classification of the collected data by using the ABODalgorithm. Red dots represent valid data and green representsabnormal ones. Fig.5(b) displays the SVM with Gaussiankernel function is used to train the result, which is generated bythe ABOD algorithm. The orange curve represents the trainingresult model f(x), and points outside the curve boundaryare abnormal data. It can be seen that the trained modelf(x) creates a certain error to the training data. This isalso in line with the nature of machine learning that thetraining results cannot be 100% fit the training set. Fig.5(c)exhibits the classification of newly collected data by usingboundaries. Obviously, there is a certain error in judging thenewly collected data using the model (for example, 9% in thefigure), and we can continuously improve the accuracy of themodel based on online machine learning.

C. Comparison Algorithms

To evaluate the proposed algorithm, three comparative algo-rithms are implemented under the same running environment.The first algorithm is the Traditional Cleaning Model (TCM)which processes the data at the aggregator and sends it to thesink node. The second algorithm is the Angel-based OutlierDetection Model (ABODM), which uses the ABOD algorithmto clean data during the movement without having to establishthe model. The third algorithm is called Mobile Data CleaningModel (MDCM), which is proposed in this paper. The timedelay and energy consumption of algorithms were comparedin three aspects of data dimension, volume and anomaly ratio.

Fig.6(a) and Fig.6(b) show the performance of the timedelay and energy consumption of three algorithms in datacleaning at different data dimensions. In this experiment, wecontrol the number of data generated by each sensor to 300and the ratio of abnormal data to 45%. The experimentalresults show that our algorithm has a significant reduction inenergy consumption and data cleaning delay for the entireWSNs when the dimension is higher. While the MDCMalgorithm has a higher latency than TCM and ABODM whenthe dimension is low. The reason lies in the small dimensionand the use of TCM. On that condition, the sensor node itselfcan meet the calculation requirements. However, as dimensionincreases, the requirements for computing power increases andcomputing capacity of the sensor node obviously cannot meetthe needs for calculation, so the delay of TCM increasesgreatly. ABODM and MDCM are calculated at the edgenodes with strong computing power. When the computationaltask becomes more difficult, the transmission delay by usingMDCM is significantly reduced. In terms of energy consump-tion, TCM requires the entire WSNs to calculate so that willconsume too much energy. ABOD and MDCM only calculate




at the edge node, thereby reducing the energy consumption ofWSNs.

Fig.6(c) and Fig.6(d) display the impact of the number ofdata collected by each sensor on the three algorithms. In thisexperiment, we set the data dimension collected by each sensorto 3 and the ratio of abnormal data to 45%. In the experimentalenvironment, 100 sensor nodes are arranged in the WSNs. Theexperimental results show that when the quantity of data issmall, its impacts on time delay and energy consumption issmall. When the amount of data is large, its effects of MDCMon reducing time delay and energy consumption is obvious.The reason why this method can reduce the time delay andenergy consumption is that the larger the amount of data, thebetter the effect of model fitting in MDCM.

Fig.6(e) and Fig.6(f) exhibit the effects of the anomalousdata ratio on time delay and energy consumption of differentalgorithms. In this experiment, we suppose the data dimensioncollected by each sensor to 3, and the number of data generatedby each sensor to 300. The experimental results show thatwhen the proportion of abnormal data is less than 35%, MD-CM has a large time delay in data cleaning. However, as theratio of abnormal data increases, MDCM shows its superiority.The reason lies in the proportion of abnormal data. When it islarger, the training samples turn to be more balanced, whichis conducive to the establishment of the model, and the timedelay and energy consumption are consequently reduced. Inaddition, the larger the proportion of abnormal data are, thesmaller the amount of data delivered by the edge nodes to thecloud after data cleaning, and the time delay of all algorithmsis reduced overall. However, in terms of energy consumption,the abnormal data ratio does not have any effect on TCM andABODM, because both algorithms are always performing alarge number of calculations during the data collection andcleaning process. The MDCM can achieve the cleaning effectwith a very small amount of calculation at this time.

Throughout the experimental results, we can clearly see thatMDCM has good performance in high-dimensional big dataand balanced abnormal ratio.

VI. CONCLUSION

WSNs will play an important role in the future developmentof IIoT by collecting surrounding conditions and environmentinformation. Thus, designing new data cleaning techniquesbecomes essential to eliminate meaningless or abnormal dataand enables related networks operate as long as possible. In theface of big data and the need for cleaning multi-dimensionalfeature data, a new method of data cleaning using mobileedge nodes is presented. ABOD algorithm is applied to getthe training data for establishing the cleaning model. Gaussiankernel function is used in our proposed SVM to fit the cleaningmodel. In the end, iterative optimization of the model withonline machine learning is carried out. This model is designedto speed up data cleaning and reduce the upload bandwidth andenergy consumption in SCS while maintaining data reliabilityand integrity. Compared to the state of the art, the experimentalresults show that our model is able to efficiently clean thehigh-dimensional anomaly data in industrial SCS, which has

obvious advantages in terms of delay, energy consumption andthe network lifetime.

REFERENCES

[1] X. Zhang and Z. Ge, “Local parameter optimization of lssvm forindustrial soft sensing with big data and cloud implementation,”IEEE Transactions on Industrial Informatics, 2019. DOI: 10.1109/TI-I.2019.2900479.

[2] J. Huang, L. Kong, G. Chen, M. Wu, X. Liu, and P. Zeng, “Towardssecure industrial iot: Blockchain system with credit-based consensusmechanism,” IEEE Transactions on Industrial Informatics, vol. 15, no. 6,pp. 3680–3689, 2019.

[3] T. Wang, H. Luo, J. Zheng, and M. Xie, “Crowdsourcing mechanism fortrust evaluation in cpcs based on intelligent mobile edge computing,” inACM Trans. Intell. Syst. Technol., 2019. Doi:10.1145/3324926.

[4] K. Wang, Y. Wang, Y. Sun, S. Guo, and J. Wu, “Green industrialinternet of things architecture: An energy-efficient perspective,” IEEECommunications Magazine, vol. 54, no. 12, pp. 48–54, 2016.

[5] H. Harb, A. Makhoul, and C. A. Jaoude, “En-route data filteringtechnique for maximizing wireless sensor network lifetime,” in 201814th International Wireless Communications & Mobile Computing Con-ference (IWCMC), pp. 298–303, IEEE, 2018.

[6] H.-N. Dai, Z. Zheng, and Y. Zhang, “Blockchain for internet of things:A survey,” IEEE Internet of Things Journal, pp. 1–19, 2019.

[7] B. Cao, J. Zhao, P. Yang, P. Yang, X. Liu, and Y. Zhang, “3d deploymentoptimization for heterogeneous wireless directional sensor networks onsmart city,” IEEE Transactions on Industrial Informatics, vol. 15, no. 3,pp. 1798–1808, 2018.

[8] X. Deng, P. Jiang, X. Peng, and C. Mi, “An intelligent outlier detectionmethod with one class support tucker machine and genetic algorithmtoward big sensor data in internet of things,” IEEE Transactions onIndustrial Electronics, vol. 66, no. 6, pp. 4672–4683, 2019.

[9] L. Qi, X. Zhang, W. Dou, and Q. Ni, “A distributed locality-sensitivehashing-based approach for cloud service recommendation from multi-source data,” IEEE Journal on Selected Areas in Communications,vol. 35, no. 11, pp. 2616–2624, 2017.

[10] Y. Chen, S. Tang, N. Bouguila, W. Cheng, J. Du, and H. L. Li,“A fast clustering algorithm based on pruning unnecessary distancecomputations in dbscan for high-dimensional data,” Pattern Recognition,vol. 83, pp. 375–387, 2018.

[11] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object detectionfor mobile augmented reality,” in MobiCom, ACM, 2019.

[12] C. Xu, K. Wang, P. Li, S. Guo, J. Luo, B. Ye, and M. Guo, “Making bigdata open in edges: A resource-efficient blockchain-based approach,”IEEE Transactions on Parallel and Distributed Systems, 2019. DOI:10.1109/TPDS.2018.2871449.

[13] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom,“Declarative support for sensor data cleaning,” in International Con-ference on Pervasive Computing, pp. 83–100, Springer, 2006.

[14] B. Sheng, Q. Li, W. Mao, and W. Jin, “Outlier detection in sensornetworks,” in Proceedings of the 8th ACM international symposium onMobile ad hoc networking and computing, pp. 219–228, ACM, 2007.

[15] K. T.-M. Tran, S.-H. Oh, and J.-Y. Byun, “Well-suited similarity func-tions for data aggregation in cluster-based underwater wireless sensornetworks,” International Journal of Distributed Sensor Networks, vol. 9,no. 8, p. 645243, 2013. DOI:10.1155/2013/645243.

[16] S. Park, M.-S. Gil, H. Im, and Y.-S. Moon, “Measurement noiserecommendation for efficient kalman filtering over a large amount ofsensor data,” Sensors, vol. 19, no. 5, p. 1168, 2019.

[17] M. Vazquez-Olguin, Y. S. Shmaliy, O. Ibarra-Manzano, J. Munoz-Minjares, and C. Lastre-Dominguez, “Object tracking over distributedwsns with consensus on estimates and missing data,” IEEE Access, 2019.DOI: 10.1109/ACCESS.2019.2905514.

[18] T. Wang, L. Qiu, G. Xu, A. K. Sangaiah, and A. Liu, “Energy-efficientand trustworthy data collection protocol based on mobile fog computingin internet of things,” IEEE Transactions on Industrial Informatics, 2019.10.1109/TII.2019.2920277.

[19] B. Cao, J. Zhao, P. Yang, Z. Lv, X. Liu, and G. Min, “3d multiobjectivedeployment of an industrial wireless sensor network for maritime appli-cations utilizing a distributed parallel algorithm,” IEEE Transactions onIndustrial Informatics, vol. 14, no. 12, pp. 5487–5495, 2018.

[20] D. Liu, Z. Yan, W. Ding, and M. Atiquzzaman, “A survey on secure dataanalytics in edge computing,” IEEE Internet of Things Journal, 2019.DOI: 10.1109/JIOT.2019.2897619.




[21] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision andchallenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646,2016.

[22] Y. Chen, X. Hu, and W. Fan, “Fast density peak clustering for large scaledata based on knn,” 2019. https://doi.org/10.1016/j.knosys.2019.06.032.

[23] N. Pham and R. Pagh, “A near-linear time approximation algorithm forangle-based outlier detection in high-dimensional data,” in Proceedingsof the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining, pp. 877–885, ACM, 2012.

Tian Wang received his BSc and MSc degrees inComputer Science from the Central South Universityin 2004 and 2007, respectively. He received hisPhD degree in City University of Hong Kong in2011. Currently, he is a professor in the HuaqiaoUniversity of China. His research interests includewireless sensor networks, fog computing and mobilecomputing.

Haoxiong Ke received his B.S. degree in Xi’anShiyou University of China in 2018. Currently, he isa master candidate in the National Huaqiao Univer-sity of China. His research interests include wirelesssensor networks, mobile computing, edge computingand machine learning.

Xi Zheng got PhD in Software Engineering from UTAustin. He specialised in Service Computing, IoTSecurity and Reliability Analysis. Published morethan 40 high quality publications in top journalsand conferencesPerCOM, ICSE, IEEE IoT Journal,ICCPS, IEEE Systems Journal, ACM Transactionson Embedded Computing Systems, IEEE Transac-tions on Vehicular Technology). Awarded the bestpaper in Australian distributed computing and doc-toral conference in 2017. Awarded Deakin Researchoutstanding award in 2016. Active reviewer for top

journals and conferences.

Kun Wang received two Ph.D. degrees in com-puter science from Nanjing University of Posts andTelecommunications, Nanjing, China, in 2009, andfrom the University of Aizu, Aizuwakamatsu, Japan,in 2018. He was a Post-Doctoral Fellow in UCLA,USA from 2013 to 2015, where he is a SeniorResearch Professor. He was a Research Fellow inthe Hong Kong Polytechnic University, Hong Kong,from 2017 to 2018, and a Professor in NanjingUniversity of Posts and Telecommunications. Hiscurrent research interests are mainly in the area

of big data, wireless communications and networking, energy Internet, andinformation security technologies.

Arun Kumar Sangaiah (M’ 09) received the Mas-ter of Engineering degree from Anna University,Chennai, India, in 2007, and the Ph.D. from theVellore Institute of Technology,Vellore, India, in2014.

He is currently an Associate Professor with theSchool of Computing Science and Engineering, Vel-lore Institute of Technology. He has authored orcoauthored more than 250 scietific papers in high-standard Science Citation Index (SCI) journals. Inaddition, he has authored/edited more than eight

books (Elsevier, Springer, Wiley, Taylor, and Francis) and 50 journal specialissues in the IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS,the IEEE COMMUNICATION MAGAZINE, the IEEE INTERNET OFTHINGS, the IEEE CONSUMER ELECTRONIC MAGAZINE, etc. He holdsone Indian patent in the area of computational intelligence. He is an EditorialBoard Member/Associate Editor for various international SCI journals.

His research interests include software engineering, Internet of Things,computational intelligence, wireless networks.

Anfeng Liu is a Professor of School of ComputerScience and Engineering, Central South Universi-ty, China. He is also a Member (E200012141M)of China Computer Federation (CCF). He receivedthe M.Sc. and Ph.D degrees from Central SouthUniversity, China, 2002 and 2005 respectively, bothmajored in computer science. His major research in-terests are Cyber-Physical Systems, Service network,wireless sensor network.

View publication statsView publication stats

https://www.researchgate.net/publication/335581169

big data cleaning based on mobile edge computing in...

Documents