modeling pm urban pollution using machine learning and ... · researcharticle modeling pm 2.5 urban...

15
Research Article Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters Jan Kleine Deters, 1 Rasa Zalakeviciute, 2 Mario Gonzalez, 2 and Yves Rybarczyk 2,3 1 University of Twente, Enschede, Netherlands 2 Intelligent & Interactive Systems Lab (SI 2 Lab), FICA, Universidad de Las Am´ ericas, Quito, Ecuador 3 DEE, Nova University of Lisbon and CTS, UNINOVA, Monte de Caparica, Portugal Correspondence should be addressed to Yves Rybarczyk; [email protected] Received 24 February 2017; Revised 23 April 2017; Accepted 11 May 2017; Published 18 June 2017 Academic Editor: Lei Zhang Copyright © 2017 Jan Kleine Deters et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Outdoor air pollution costs millions of premature deaths annually, mostly due to anthropogenic fine particulate matter (or PM 2.5 ). Quito, the capital city of Ecuador, is no exception in exceeding the healthy levels of pollution. In addition to the impact of urbanization, motorization, and rapid population growth, particulate pollution is modulated by meteorological factors and geophysical characteristics, which complicate the implementation of the most advanced models of weather forecast. us, this paper proposes a machine learning approach based on six years of meteorological and pollution data analyses to predict the concentrations of PM 2.5 from wind (speed and direction) and precipitation levels. e results of the classification model show a high reliability in the classification of low (<10 g/m 3 ) versus high (>25 g/m 3 ) and low (<10 g/m 3 ) versus moderate (10–25 g/m 3 ) concentrations of PM 2.5 . A regression analysis suggests a better prediction of PM 2.5 when the climatic conditions are getting more extreme (strong winds or high levels of precipitation). e high correlation between estimated and real data for a time series analysis during the wet season confirms this finding. e study demonstrates that the use of statistical models based on machine learning is relevant to predict PM 2.5 concentrations from meteorological data. 1. Introduction e effects of rapid growth of the world’s population are reflected in the overuse and scarcity of natural resources, deforestation, climate change, and especially environmental pollution. Currently, more than half of the global population lives in urban areas, and this number is expected to grow to about 66% by 2050, mostly due to the urbanization trends in developing countries [1]. According to the latest urban air quality database, 98% of cities in low and middle income countries with more than 100,000 inhabitants do not meet the World Health Organization (WHO) air quality guidelines [2]. A recent study using a global atmospheric chemistry model estimated that 3.3 million annual premature deaths worldwide are linked to outdoor air pollution, which is ex- pected to double by 2050, mostly due to anthropogenic fine particulate matter (aerodynamic diameter < 2.5 m; PM 2.5 ) [3]. Over the last decade, evidence has been growing that exposure to fine particulate air pollution has adverse effects on cardiopulmonary health [4]. A recent air quality study in Quito, the capital of Ecuador, concurs that long-term levels of fine particulate pollution are not only exceeding the WHO’s recommended levels of 10 g/m 3 but also are higher than the national standards of 15 g/m 3 [5]. And even though the overall levels of fine par- ticulate pollution have been decreasing due to active efforts of the local and national governments in the last decade, in some locations of the city the air quality has continued to deteriorate. e latter reflects the global trends of urbaniza- tion and motorization. In addition to the impact of urbanization and rapid population growth, the pollution levels in the cities are mod- ulated by meteorological factors [6]. Most importantly, the depth of mixing layer (the lower layer of troposphere mixing surface emissions) oſten depends on solar radiation and thus temperature in the area. e shallower the mixing depth is, Hindawi Journal of Electrical and Computer Engineering Volume 2017, Article ID 5106045, 14 pages https://doi.org/10.1155/2017/5106045

Upload: others

Post on 17-Mar-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Research ArticleModeling PM25 Urban Pollution Using MachineLearning and Selected Meteorological Parameters

Jan Kleine Deters1 Rasa Zalakeviciute2 Mario Gonzalez2 and Yves Rybarczyk23

1University of Twente Enschede Netherlands2Intelligent amp Interactive Systems Lab (SI2 Lab) FICA Universidad de Las Americas Quito Ecuador3DEE Nova University of Lisbon and CTS UNINOVA Monte de Caparica Portugal

Correspondence should be addressed to Yves Rybarczyk yrybarczykfctunlpt

Received 24 February 2017 Revised 23 April 2017 Accepted 11 May 2017 Published 18 June 2017

Academic Editor Lei Zhang

Copyright copy 2017 Jan Kleine Deters et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Outdoor air pollution costs millions of premature deaths annually mostly due to anthropogenic fine particulate matter (orPM25) Quito the capital city of Ecuador is no exception in exceeding the healthy levels of pollution In addition to the impact

of urbanization motorization and rapid population growth particulate pollution is modulated by meteorological factors andgeophysical characteristics which complicate the implementation of themost advancedmodels of weather forecastThus this paperproposes amachine learning approach based on six years ofmeteorological and pollution data analyses to predict the concentrationsof PM

25from wind (speed and direction) and precipitation levels The results of the classification model show a high reliability in

the classification of low (lt10120583gm3) versus high (gt25120583gm3) and low (lt10 120583gm3) versus moderate (10ndash25120583gm3) concentrationsof PM

25 A regression analysis suggests a better prediction of PM

25when the climatic conditions are getting more extreme (strong

winds or high levels of precipitation) The high correlation between estimated and real data for a time series analysis during thewet season confirms this finding The study demonstrates that the use of statistical models based on machine learning is relevantto predict PM

25concentrations from meteorological data

1 Introduction

The effects of rapid growth of the worldrsquos population arereflected in the overuse and scarcity of natural resourcesdeforestation climate change and especially environmentalpollution Currently more than half of the global populationlives in urban areas and this number is expected to grow toabout 66 by 2050 mostly due to the urbanization trendsin developing countries [1] According to the latest urban airquality database 98 of cities in low and middle incomecountries withmore than 100000 inhabitants do notmeet theWorldHealthOrganization (WHO) air quality guidelines [2]

A recent study using a global atmospheric chemistrymodel estimated that 33 million annual premature deathsworldwide are linked to outdoor air pollution which is ex-pected to double by 2050 mostly due to anthropogenic fineparticulate matter (aerodynamic diameter lt 25 120583m PM

25)

[3] Over the last decade evidence has been growing that

exposure to fine particulate air pollution has adverse effectson cardiopulmonary health [4]

A recent air quality study in Quito the capital of Ecuadorconcurs that long-term levels of fine particulate pollution arenot only exceeding the WHOrsquos recommended levels of10 120583gm3 but also are higher than the national standards of15 120583gm3 [5] And even though the overall levels of fine par-ticulate pollution have been decreasing due to active effortsof the local and national governments in the last decadein some locations of the city the air quality has continued todeteriorate The latter reflects the global trends of urbaniza-tion and motorization

In addition to the impact of urbanization and rapidpopulation growth the pollution levels in the cities are mod-ulated by meteorological factors [6] Most importantly thedepth of mixing layer (the lower layer of troposphere mixingsurface emissions) often depends on solar radiation and thustemperature in the area The shallower the mixing depth is

HindawiJournal of Electrical and Computer EngineeringVolume 2017 Article ID 5106045 14 pageshttpsdoiorg10115520175106045

2 Journal of Electrical and Computer Engineering

the less diluted the daily emissions get Therefore tempera-ture shows a reducing impact on fine particulatematter levelsthrough convection [7] In addition the formation and evolu-tion of photochemical smog are dependent on solar radiationand temperature meanwhile wind speed tends to helpventilate air pollutants andor transport them to other areaseven if the emission sources are not present in that region[8 9]This can result in increased levels of air pollution down-wind from the original source which directly depends onthe wind direction [8] Increased relative humidity has beenshown to make even fine particles heavier helping the drydeposition process of removal while precipitation has a directeffect of scavenging by wet deposition [7 8] In additionsome studies differentiate between the seasons as differentparameters have different effects during the year due to thecombination of conditions [8 9] Thus it is clearly impos-sible to rely on a single parameter to fully understand theurban pollution especially if the study area is in a nonho-mogeneous and complex terrain This fact justifies the elab-oration of models that take into account heterogeneous datato predict air quality

Currently three major approaches are used to forecastPM25

concentrations statistical models chemical transportand machine learning Statistical models which are mainlybased on single variable linear regression have shown a nega-tive correlation between different meteorological parameters(wind precipitation and temperature) and PM concentra-tions (PM

10 PM25 and PM

10) [7] Chemical transport and

Atmospheric Dispersion Modeling are numerical methodsand the most advanced ones are WRF-Chem and CMAQThese models can be used to predict atmospheric pollutionbut their accuracy relies on an updated source list that is verydifficult to produce [10] In addition complex geophysicalcharacteristics of locations with complex terrain complicatethe implementation of these models of weather and pollutionforecast mostly due to the complexity of the air flows (windspeed and direction) around the topographic features [1112] Unlike a pure statistical method a machine learningapproach can consider several parameters in a single modelThe most popular classifiers to forecast pollution from mete-orological data are artificial Neural Networks [13ndash15] Othersuccessful studies use hybrid or mixed models that combineseveral artificial intelligence algorithms such as fuzzy logicand Neural Network [16] or Principal Component Analysisand Support Vector Machine [17] or numerical methods andmachine learning [10]

Recent studies show that the machine learning approachseems to overcome the other two methods for forecastingpollution [9 10]This is the reason why it is increasingly usedto predict air quality [13 17ndash21] However the data miningdoes not only differ from one study to another in terms ofclassification algorithms but also regarding the used featuresSome of them consider a quite exhaustive list of meteoro-logical factors [15 16] whereas others proceed with a carefulselection [13 14 17 22] or do not even use climatic parametersat all [18] Sincemachine learning is a very promisingmethodto forecast pollution we propose applying this approachto predict PM

25concentration in Quito This prediction is

based on a selection of meteorological features for two main

reasons first because amodel using onlymeteorological datawhich can be easily obtained in any urban area is cheaperthan an air quality monitoring system and second because ageneral model that may work for any city is not realistic [10]which implies that a selection of meteorological parametersmust be performed in order to find the bestmodel for the cap-ital city of Ecuador Quito is located in the Andes cordillera inthe tropical climate zone characterized by two seasons withdifferent accumulation of precipitationHowever the temper-ature the pressure and even the amount of solar radiation donot vary much during the year Moreover the wind directionand speed highly depend on the topographic features of com-plex terrain in which a city is positioned and usually presentone of the biggest challenges in forecasting weather andair quality Therefore this research aims to study the con-nectivity between three selected meteorological factors windspeed wind direction and precipitation and PM

25pollution

in two districts located in northwestern QuitoIn this work we first present a spatial visualization of

the distribution of fine particulate matter trends according towind (speed and direction) and precipitation parameters intwo locations in QuitoThis part includes a description of thepreparation of the data for classification Then variousmachine learning models are exploited to classify differentlevels of PM

25 namely Boosted Trees and Linear Support

Vector Machines Finally a Neural Network regression and atime series analysis are applied to provide insight about theparametric boundaries in which the classification modelsperform adequately In the final section we draw up themainconclusions and suggestions for future work

2 Data Collection

21 Site Description Unlikemost of South America themosturbanized continent on the planet (81) Ecuador is oneof the few countries in the region with only 64 of totalpopulation living in urban areas [23] However the rate ofurbanization has increased over the past decade Quitosprawls north to south on a long plateau lying on the east sideof the Pichincha volcano (alt 4784masl meters above sealevel) in the Andes cordillera at an altitude of 2850masl(see Figure 1) According to the 2010 census Quitorsquos metroarea is currently 421795 km2 with a population over 2239191and is expected to increase to almost 28 million by 2020making the city the most populous city in the countryovergrowing Guayaquil [24] The city is contained within anumber of valleys at 2300ndash2450masl and terraces varyingfrom 2700 to 3000masl altitude Due to Quitorsquos locationon the Equator the city receives direct sunlight almost all yearround and due to its altitude Quitorsquos climate is mildspring-like all year round The region has two seasons dry(JunendashAugust average precipitation 14mmmonth) and wet(SeptemberndashMay average precipitation 59mmmonth) withmost of the rainfall in the afternoons Quitorsquos temperatureis almost constant around 145∘C with the prevailing windsfrom the east However due to a complex terrain thewinds inthe city are highly variable most of the year (dry season iswindier) challenging weather prediction in the region

Journal of Electrical and Computer Engineering 3

N

Cotocollao

(a) (b)

Belisario

Figure 1 Topographic map (b) of Quitorsquos urban area (green areas) and Google maps images (a) of the air quality measurement sites (reddots) Cotocollao and Belisario

For the purpose of this study the two northwestern airquality monitoring points are presented Cotocollao andBelisario (see red dots in Figure 1) These districts werechosen to show the variation and complexity of the predictionof fine particulate matter trends even within a relatively smallarea of Quito with similar topographical characteristics (ap-proximately the same altitude and directly east of the Pichin-cha volcano)

22 Air Quality Measurements Monitoring Network andInstrumentation The municipal office of environmentalquality Secretaria de Ambiente has been collecting air qualityand meteorological data since May 1 2007 in several sitesaround the cityThemeasurement sites run by the Secretariade Ambiente are located in representative areas throughoutthe city varying by altitudes depending on municipal dis-tricts We used the real meteorological and PM

25concen-

tration data from the two most northwestern automaticdata collection stations Belisario (alt 2835masl coord78∘2910158402410158401015840W 0∘1010158404810158401015840S) and Cotocollao (alt 2739maslcoord 78∘2910158405010158401015840W 0∘610158402810158401015840S) (see Figure 1) These two sitesare approximately 9 km apart from each other The Belisariomeasurement site is less than 100m west of a busy road(Avenida America) 200m northwest of a busy roundaboutand less than 1000m to the east of a major outer highway(Ave Antonio Jose de Sucre) which runs along the westside of the city intended to reduce the traffic inside the city(Figure 1)The Cotocollao monitoring site is located in a resi-dential area with only a few busier streets and the same outerhighway (Ave Antonio Jose de Sucre) 250m to the northBoth monitoring sites are inside of the ldquoPico y Placardquo zoneimplemented in 2010 which based on the last number of car

license plates limits rush hour traffic reducing the number ofpersonal vehicles by approximately 20during theweekdays

The monitoring stations are positioned on the roofs ofrelatively tall buildings Fine particulate matter (PM

25) mea-

surements are conducted using instrumentation validated bythe Environmental Protection Agency (EPA) of the UnitedStates For PM

25Thermo Scientific FH62C14-DHS Contin-

uous 5014i (EPA Number EQPM-0609-183) was used Thedetection limit for this instrument is 5120583gm3 for one-houraveraging The aerosol data is collected at 10 s intervals andfrom this then 10min 1-hour and 24-hour averages arecalculatedThe latter averaging data is presented in this workWind velocity is measured using MetOne010C and winddirection using MetOne020C instrumentation The windspeed sensor and wind direction starting threshold is022ms and the accuracies are 007ms and 3∘ respectivelyThe precipitation is measured using MetOne382 and ThiesClima54032007 equipment All meteorological parametershave been validated using VaisalaMAWS100 weather station

3 Data Preparation

In this section the method for the preparation of the datais presented in order to proceed with the classification Itincludes refining steps to discard useless data transforma-tions to visually examine and understand the data andcreation of an averaged intensitymap of the PM

25concentra-

tions with respect to the selected meteorological parameters(wind and precipitation)

31 Data Refinement For this study we analyzed the data ofsix years starting June 2007 and ending July 2013 The two

4 Journal of Electrical and Computer Engineering

20

15

10

5

0

N E

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)gt25

(휇gm

3)

N E

SWW

(a)

NE

SW

5

5

5

35

20

10

0

Prec

ipita

tion

(mm

) 20

15

10

5

0

gt25

(휇gm

3)

NE

SW

5

5

(b)

Figure 2 Data distribution for (a) Cotocollao and (b) Belisario in terms of wind direction wind speed precipitation and PM25

concentrations (color scale) The inner circle represents wind speeds up to 2ms and the outer circle represents wind speeds up to 4ms

datasets (one for eachmonitoring point) are composed out of2223 instances Each data point consists of 4 parametersindicating daily values of precipitation accumulation (mm)wind direction (0ndash360∘) wind speed (ms) and observed fineparticle concentrations (120583gm3)

The datasets are cleaned by discarding data points thatinclude any missing values These data points represent 28and 24 of the total data for Belisario and Cotocollaorespectively It has been demonstrated that missing data ofthese magnitudes do not influence the classification perfor-mance [25] In addition considering the very low numberof missing values it is preferable to remove them insteadof performing an interpolation taking into account thefollowing (i) we proceed with an analysis on discrete vari-ables (day-by-day) and not a time series forecasting and (ii)the PM

25concentrations are very inconstant from one day

to another Weekend days are also removed from the datasetbecause the distribution of PM

25concentrations during the

weekdays and weekends is very different for Quito Thiscould introduce an additional level of complexity in dataclassification as during theweekdays there are clear rush hourpeaks (morning and evening) while on Saturdays PM

25lev-

els increase between late morning and late afternoon hoursIn addition Sundays can be identified by a drop of PM

25

concentration These patterns are dictated by human activitychanges during the week therefore clearly showing PM

25

dependability on traffic After cleaning the final datasets arecomposed of 1527 instances for Belisario and 1536 instancesfor Cotocollao

32 Data Transformation To represent the data according toa wind rose plot the linear scale of wind direction (0ndash360∘)is transformed from polar to Cartesian coordinates whereangles increase clockwise and both 0∘ and 360∘ are north

(N) (see Figure 2)Thismathematical transformation (see (1))permits a more accurate feature representation of the data forwind direction around the north axis Otherwise winddirection angles slightly higher than 0∘ and slightly lower than360∘ would be considered as two opposing directions This isuseful for classification models that are implemented in thenext stage This relates to machine learning models thatimprove performance if there are continuous relationshipsbetween parameters (optimization smoother clustering task)[26] This transformation ensures both valid and more infor-mative representation of the original data In addition thisrepresentation can be completed by the precipitation levelswhich are plotted on the 119911-axis (Figure 2) The color range ismapped from concentrations 0120583gm3 to gt25 120583gm3 Thethreshold of 25 120583gm3 indicates the values fromwhich the 24-hour concentrations of PM

25are harmful according to inter-

national health standards

119909 = sin(Wind Direction360∘ sdot 2120587) sdotWind Speed

119910 = cos(Wind Direction360∘ sdot 2120587) sdotWind Speed

(1)

A visual inspection of the transformeddata shows that thewind directions corresponding to precipitation are north(N) for Cotocollao (Figure 2(a)) and east (E) for Belisario(Figure 2(b)) The stronger winds tend to take place betweensouth (S) and southeast (SE) for Cotocollao and betweensouthwest (SW) and SE in Belisario As expected in bothcases these stronger winds seem to account for relatively lowlevels of PM

25

33 Trend Analyses In order to obtain general trends in thedistribution of the PM

25concentrations as a function of

Journal of Electrical and Computer Engineering 5

wind speed and wind direction the data are used to generateconvolutional based spatial representations Convolution-based models for spatial data have increased in popularity asa result of their flexibility in modeling spatial dependenceand their ability to accommodate large datasets [27] Thisgenerated Convolutional Generalization Model (CGM) [28]is an averaged value of the PM

25pollution level (PL) inwhich

the regional quantity of influence per data point ismodeled asa 2D Gaussian matrix (see (2)) A Gaussian convolution isapplied (i) to spatially interpolate data in order to get a2D representation from the pointsrsquo coordinates calculated in(1) and (ii) to smooth the PL concentration values of thisrepresentation A Gaussian kernel is used because it inhibitsthe quality of monotonic smoothing and as there is no priorknowledge about the distribution a kernel density functionwith high entropy minimizes the information transfer of theconvolution step to the processed data [29]This 2DGaussianmatrix is multiplied by the PL of the given data point andadded to the CGM at the coordinates corresponding to thewind speed and direction of this point Then the quantity ofinfluence is added to the point The final step is to divide thetotal amount of each cell by the quantity of influence whichresults in a generalized average value

CGM (rows colums) = PL 136[[[[[[[[[

14641

]]]]]]]]]

[1 4 6 4 1] (2)

The general tendencies are as follows (i) strong windsresult in low PM

25concentrations and (ii) the strongest

winds generally come from the similar direction (SE forCotocollao and S for Belisario)The results of CGMs for bothsites are shown in Figure 3 as an overlay on top of the geo-graphic location of their respectivemonitoring stationsMainhighways are indicated in green The highest concentrationsof PM

25(from yellow to red) tend to be brought by the

winds coming from these main highways It is to note thathigher wind speeds for Cotocollao tend to be on the axis ofQuitorsquos former airport (grey-green area center of themap seeFigure 3) currently transformed into a city park This trafficand structure free corridor seems to accelerate wind speedswhichmay explain the reduction of PM

25concentrations due

to better ventilation of this part of the cityDuring the study average PM

25concentrations in Coto-

collao and Belisario are 156120583gm3 and 179 120583gm3 respec-tively both exceeding the national standards During thestudied six years the area of Belisario was more polluted withmore variation in PM

25concentrations (higher deviation

see Figure 4) and more turbulent (Figure 3) than CotocollaoThese factors could be the result of Belisario being moreurbanized

4 Classification Models

Machine learning models are used to separate the data indifferent classes of PM

25concentrations Supervised learning

1 km

N

(휇gm

3)

gt25

0

Figure 3 CGM visualization positioned on top of the geographiclocation of the respective monitoring stations (northwestern partof Quito) The northern CGM visualization is Cotocollao and thesouthern one is Belisario Main highways are represented in green

CotocollaoBelisario

01

Den

sity

010 15 20 25 305

Real value (휇gm3)

001

002

003

004

005

006

007

008

009

Figure 4 Distribution of PM25

concentrations (June 2007 to July2013) for Cotocollao and Belisario Dashed black line represents thenational standards and the class seperation boundary (15120583gm3)

techniques are applied to create models on this classificationtask Here we introduce Boosted Trees (BTs) and Linear Sup-port Vector Machines (L-SVM) A BT combines weak learn-ers (simple rules) to create a classification algorithm whereeach misclassified data point per learner gains weight Afollowing learner optimizes the classification of the high-est weighted region Boosted Trees are known for their

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

2 Journal of Electrical and Computer Engineering

the less diluted the daily emissions get Therefore tempera-ture shows a reducing impact on fine particulatematter levelsthrough convection [7] In addition the formation and evolu-tion of photochemical smog are dependent on solar radiationand temperature meanwhile wind speed tends to helpventilate air pollutants andor transport them to other areaseven if the emission sources are not present in that region[8 9]This can result in increased levels of air pollution down-wind from the original source which directly depends onthe wind direction [8] Increased relative humidity has beenshown to make even fine particles heavier helping the drydeposition process of removal while precipitation has a directeffect of scavenging by wet deposition [7 8] In additionsome studies differentiate between the seasons as differentparameters have different effects during the year due to thecombination of conditions [8 9] Thus it is clearly impos-sible to rely on a single parameter to fully understand theurban pollution especially if the study area is in a nonho-mogeneous and complex terrain This fact justifies the elab-oration of models that take into account heterogeneous datato predict air quality

Currently three major approaches are used to forecastPM25

concentrations statistical models chemical transportand machine learning Statistical models which are mainlybased on single variable linear regression have shown a nega-tive correlation between different meteorological parameters(wind precipitation and temperature) and PM concentra-tions (PM

10 PM25 and PM

10) [7] Chemical transport and

Atmospheric Dispersion Modeling are numerical methodsand the most advanced ones are WRF-Chem and CMAQThese models can be used to predict atmospheric pollutionbut their accuracy relies on an updated source list that is verydifficult to produce [10] In addition complex geophysicalcharacteristics of locations with complex terrain complicatethe implementation of these models of weather and pollutionforecast mostly due to the complexity of the air flows (windspeed and direction) around the topographic features [1112] Unlike a pure statistical method a machine learningapproach can consider several parameters in a single modelThe most popular classifiers to forecast pollution from mete-orological data are artificial Neural Networks [13ndash15] Othersuccessful studies use hybrid or mixed models that combineseveral artificial intelligence algorithms such as fuzzy logicand Neural Network [16] or Principal Component Analysisand Support Vector Machine [17] or numerical methods andmachine learning [10]

Recent studies show that the machine learning approachseems to overcome the other two methods for forecastingpollution [9 10]This is the reason why it is increasingly usedto predict air quality [13 17ndash21] However the data miningdoes not only differ from one study to another in terms ofclassification algorithms but also regarding the used featuresSome of them consider a quite exhaustive list of meteoro-logical factors [15 16] whereas others proceed with a carefulselection [13 14 17 22] or do not even use climatic parametersat all [18] Sincemachine learning is a very promisingmethodto forecast pollution we propose applying this approachto predict PM

25concentration in Quito This prediction is

based on a selection of meteorological features for two main

reasons first because amodel using onlymeteorological datawhich can be easily obtained in any urban area is cheaperthan an air quality monitoring system and second because ageneral model that may work for any city is not realistic [10]which implies that a selection of meteorological parametersmust be performed in order to find the bestmodel for the cap-ital city of Ecuador Quito is located in the Andes cordillera inthe tropical climate zone characterized by two seasons withdifferent accumulation of precipitationHowever the temper-ature the pressure and even the amount of solar radiation donot vary much during the year Moreover the wind directionand speed highly depend on the topographic features of com-plex terrain in which a city is positioned and usually presentone of the biggest challenges in forecasting weather andair quality Therefore this research aims to study the con-nectivity between three selected meteorological factors windspeed wind direction and precipitation and PM

25pollution

in two districts located in northwestern QuitoIn this work we first present a spatial visualization of

the distribution of fine particulate matter trends according towind (speed and direction) and precipitation parameters intwo locations in QuitoThis part includes a description of thepreparation of the data for classification Then variousmachine learning models are exploited to classify differentlevels of PM

25 namely Boosted Trees and Linear Support

Vector Machines Finally a Neural Network regression and atime series analysis are applied to provide insight about theparametric boundaries in which the classification modelsperform adequately In the final section we draw up themainconclusions and suggestions for future work

2 Data Collection

21 Site Description Unlikemost of South America themosturbanized continent on the planet (81) Ecuador is oneof the few countries in the region with only 64 of totalpopulation living in urban areas [23] However the rate ofurbanization has increased over the past decade Quitosprawls north to south on a long plateau lying on the east sideof the Pichincha volcano (alt 4784masl meters above sealevel) in the Andes cordillera at an altitude of 2850masl(see Figure 1) According to the 2010 census Quitorsquos metroarea is currently 421795 km2 with a population over 2239191and is expected to increase to almost 28 million by 2020making the city the most populous city in the countryovergrowing Guayaquil [24] The city is contained within anumber of valleys at 2300ndash2450masl and terraces varyingfrom 2700 to 3000masl altitude Due to Quitorsquos locationon the Equator the city receives direct sunlight almost all yearround and due to its altitude Quitorsquos climate is mildspring-like all year round The region has two seasons dry(JunendashAugust average precipitation 14mmmonth) and wet(SeptemberndashMay average precipitation 59mmmonth) withmost of the rainfall in the afternoons Quitorsquos temperatureis almost constant around 145∘C with the prevailing windsfrom the east However due to a complex terrain thewinds inthe city are highly variable most of the year (dry season iswindier) challenging weather prediction in the region

Journal of Electrical and Computer Engineering 3

N

Cotocollao

(a) (b)

Belisario

Figure 1 Topographic map (b) of Quitorsquos urban area (green areas) and Google maps images (a) of the air quality measurement sites (reddots) Cotocollao and Belisario

For the purpose of this study the two northwestern airquality monitoring points are presented Cotocollao andBelisario (see red dots in Figure 1) These districts werechosen to show the variation and complexity of the predictionof fine particulate matter trends even within a relatively smallarea of Quito with similar topographical characteristics (ap-proximately the same altitude and directly east of the Pichin-cha volcano)

22 Air Quality Measurements Monitoring Network andInstrumentation The municipal office of environmentalquality Secretaria de Ambiente has been collecting air qualityand meteorological data since May 1 2007 in several sitesaround the cityThemeasurement sites run by the Secretariade Ambiente are located in representative areas throughoutthe city varying by altitudes depending on municipal dis-tricts We used the real meteorological and PM

25concen-

tration data from the two most northwestern automaticdata collection stations Belisario (alt 2835masl coord78∘2910158402410158401015840W 0∘1010158404810158401015840S) and Cotocollao (alt 2739maslcoord 78∘2910158405010158401015840W 0∘610158402810158401015840S) (see Figure 1) These two sitesare approximately 9 km apart from each other The Belisariomeasurement site is less than 100m west of a busy road(Avenida America) 200m northwest of a busy roundaboutand less than 1000m to the east of a major outer highway(Ave Antonio Jose de Sucre) which runs along the westside of the city intended to reduce the traffic inside the city(Figure 1)The Cotocollao monitoring site is located in a resi-dential area with only a few busier streets and the same outerhighway (Ave Antonio Jose de Sucre) 250m to the northBoth monitoring sites are inside of the ldquoPico y Placardquo zoneimplemented in 2010 which based on the last number of car

license plates limits rush hour traffic reducing the number ofpersonal vehicles by approximately 20during theweekdays

The monitoring stations are positioned on the roofs ofrelatively tall buildings Fine particulate matter (PM

25) mea-

surements are conducted using instrumentation validated bythe Environmental Protection Agency (EPA) of the UnitedStates For PM

25Thermo Scientific FH62C14-DHS Contin-

uous 5014i (EPA Number EQPM-0609-183) was used Thedetection limit for this instrument is 5120583gm3 for one-houraveraging The aerosol data is collected at 10 s intervals andfrom this then 10min 1-hour and 24-hour averages arecalculatedThe latter averaging data is presented in this workWind velocity is measured using MetOne010C and winddirection using MetOne020C instrumentation The windspeed sensor and wind direction starting threshold is022ms and the accuracies are 007ms and 3∘ respectivelyThe precipitation is measured using MetOne382 and ThiesClima54032007 equipment All meteorological parametershave been validated using VaisalaMAWS100 weather station

3 Data Preparation

In this section the method for the preparation of the datais presented in order to proceed with the classification Itincludes refining steps to discard useless data transforma-tions to visually examine and understand the data andcreation of an averaged intensitymap of the PM

25concentra-

tions with respect to the selected meteorological parameters(wind and precipitation)

31 Data Refinement For this study we analyzed the data ofsix years starting June 2007 and ending July 2013 The two

4 Journal of Electrical and Computer Engineering

20

15

10

5

0

N E

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)gt25

(휇gm

3)

N E

SWW

(a)

NE

SW

5

5

5

35

20

10

0

Prec

ipita

tion

(mm

) 20

15

10

5

0

gt25

(휇gm

3)

NE

SW

5

5

(b)

Figure 2 Data distribution for (a) Cotocollao and (b) Belisario in terms of wind direction wind speed precipitation and PM25

concentrations (color scale) The inner circle represents wind speeds up to 2ms and the outer circle represents wind speeds up to 4ms

datasets (one for eachmonitoring point) are composed out of2223 instances Each data point consists of 4 parametersindicating daily values of precipitation accumulation (mm)wind direction (0ndash360∘) wind speed (ms) and observed fineparticle concentrations (120583gm3)

The datasets are cleaned by discarding data points thatinclude any missing values These data points represent 28and 24 of the total data for Belisario and Cotocollaorespectively It has been demonstrated that missing data ofthese magnitudes do not influence the classification perfor-mance [25] In addition considering the very low numberof missing values it is preferable to remove them insteadof performing an interpolation taking into account thefollowing (i) we proceed with an analysis on discrete vari-ables (day-by-day) and not a time series forecasting and (ii)the PM

25concentrations are very inconstant from one day

to another Weekend days are also removed from the datasetbecause the distribution of PM

25concentrations during the

weekdays and weekends is very different for Quito Thiscould introduce an additional level of complexity in dataclassification as during theweekdays there are clear rush hourpeaks (morning and evening) while on Saturdays PM

25lev-

els increase between late morning and late afternoon hoursIn addition Sundays can be identified by a drop of PM

25

concentration These patterns are dictated by human activitychanges during the week therefore clearly showing PM

25

dependability on traffic After cleaning the final datasets arecomposed of 1527 instances for Belisario and 1536 instancesfor Cotocollao

32 Data Transformation To represent the data according toa wind rose plot the linear scale of wind direction (0ndash360∘)is transformed from polar to Cartesian coordinates whereangles increase clockwise and both 0∘ and 360∘ are north

(N) (see Figure 2)Thismathematical transformation (see (1))permits a more accurate feature representation of the data forwind direction around the north axis Otherwise winddirection angles slightly higher than 0∘ and slightly lower than360∘ would be considered as two opposing directions This isuseful for classification models that are implemented in thenext stage This relates to machine learning models thatimprove performance if there are continuous relationshipsbetween parameters (optimization smoother clustering task)[26] This transformation ensures both valid and more infor-mative representation of the original data In addition thisrepresentation can be completed by the precipitation levelswhich are plotted on the 119911-axis (Figure 2) The color range ismapped from concentrations 0120583gm3 to gt25 120583gm3 Thethreshold of 25 120583gm3 indicates the values fromwhich the 24-hour concentrations of PM

25are harmful according to inter-

national health standards

119909 = sin(Wind Direction360∘ sdot 2120587) sdotWind Speed

119910 = cos(Wind Direction360∘ sdot 2120587) sdotWind Speed

(1)

A visual inspection of the transformeddata shows that thewind directions corresponding to precipitation are north(N) for Cotocollao (Figure 2(a)) and east (E) for Belisario(Figure 2(b)) The stronger winds tend to take place betweensouth (S) and southeast (SE) for Cotocollao and betweensouthwest (SW) and SE in Belisario As expected in bothcases these stronger winds seem to account for relatively lowlevels of PM

25

33 Trend Analyses In order to obtain general trends in thedistribution of the PM

25concentrations as a function of

Journal of Electrical and Computer Engineering 5

wind speed and wind direction the data are used to generateconvolutional based spatial representations Convolution-based models for spatial data have increased in popularity asa result of their flexibility in modeling spatial dependenceand their ability to accommodate large datasets [27] Thisgenerated Convolutional Generalization Model (CGM) [28]is an averaged value of the PM

25pollution level (PL) inwhich

the regional quantity of influence per data point ismodeled asa 2D Gaussian matrix (see (2)) A Gaussian convolution isapplied (i) to spatially interpolate data in order to get a2D representation from the pointsrsquo coordinates calculated in(1) and (ii) to smooth the PL concentration values of thisrepresentation A Gaussian kernel is used because it inhibitsthe quality of monotonic smoothing and as there is no priorknowledge about the distribution a kernel density functionwith high entropy minimizes the information transfer of theconvolution step to the processed data [29]This 2DGaussianmatrix is multiplied by the PL of the given data point andadded to the CGM at the coordinates corresponding to thewind speed and direction of this point Then the quantity ofinfluence is added to the point The final step is to divide thetotal amount of each cell by the quantity of influence whichresults in a generalized average value

CGM (rows colums) = PL 136[[[[[[[[[

14641

]]]]]]]]]

[1 4 6 4 1] (2)

The general tendencies are as follows (i) strong windsresult in low PM

25concentrations and (ii) the strongest

winds generally come from the similar direction (SE forCotocollao and S for Belisario)The results of CGMs for bothsites are shown in Figure 3 as an overlay on top of the geo-graphic location of their respectivemonitoring stationsMainhighways are indicated in green The highest concentrationsof PM

25(from yellow to red) tend to be brought by the

winds coming from these main highways It is to note thathigher wind speeds for Cotocollao tend to be on the axis ofQuitorsquos former airport (grey-green area center of themap seeFigure 3) currently transformed into a city park This trafficand structure free corridor seems to accelerate wind speedswhichmay explain the reduction of PM

25concentrations due

to better ventilation of this part of the cityDuring the study average PM

25concentrations in Coto-

collao and Belisario are 156120583gm3 and 179 120583gm3 respec-tively both exceeding the national standards During thestudied six years the area of Belisario was more polluted withmore variation in PM

25concentrations (higher deviation

see Figure 4) and more turbulent (Figure 3) than CotocollaoThese factors could be the result of Belisario being moreurbanized

4 Classification Models

Machine learning models are used to separate the data indifferent classes of PM

25concentrations Supervised learning

1 km

N

(휇gm

3)

gt25

0

Figure 3 CGM visualization positioned on top of the geographiclocation of the respective monitoring stations (northwestern partof Quito) The northern CGM visualization is Cotocollao and thesouthern one is Belisario Main highways are represented in green

CotocollaoBelisario

01

Den

sity

010 15 20 25 305

Real value (휇gm3)

001

002

003

004

005

006

007

008

009

Figure 4 Distribution of PM25

concentrations (June 2007 to July2013) for Cotocollao and Belisario Dashed black line represents thenational standards and the class seperation boundary (15120583gm3)

techniques are applied to create models on this classificationtask Here we introduce Boosted Trees (BTs) and Linear Sup-port Vector Machines (L-SVM) A BT combines weak learn-ers (simple rules) to create a classification algorithm whereeach misclassified data point per learner gains weight Afollowing learner optimizes the classification of the high-est weighted region Boosted Trees are known for their

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 3

N

Cotocollao

(a) (b)

Belisario

Figure 1 Topographic map (b) of Quitorsquos urban area (green areas) and Google maps images (a) of the air quality measurement sites (reddots) Cotocollao and Belisario

For the purpose of this study the two northwestern airquality monitoring points are presented Cotocollao andBelisario (see red dots in Figure 1) These districts werechosen to show the variation and complexity of the predictionof fine particulate matter trends even within a relatively smallarea of Quito with similar topographical characteristics (ap-proximately the same altitude and directly east of the Pichin-cha volcano)

22 Air Quality Measurements Monitoring Network andInstrumentation The municipal office of environmentalquality Secretaria de Ambiente has been collecting air qualityand meteorological data since May 1 2007 in several sitesaround the cityThemeasurement sites run by the Secretariade Ambiente are located in representative areas throughoutthe city varying by altitudes depending on municipal dis-tricts We used the real meteorological and PM

25concen-

tration data from the two most northwestern automaticdata collection stations Belisario (alt 2835masl coord78∘2910158402410158401015840W 0∘1010158404810158401015840S) and Cotocollao (alt 2739maslcoord 78∘2910158405010158401015840W 0∘610158402810158401015840S) (see Figure 1) These two sitesare approximately 9 km apart from each other The Belisariomeasurement site is less than 100m west of a busy road(Avenida America) 200m northwest of a busy roundaboutand less than 1000m to the east of a major outer highway(Ave Antonio Jose de Sucre) which runs along the westside of the city intended to reduce the traffic inside the city(Figure 1)The Cotocollao monitoring site is located in a resi-dential area with only a few busier streets and the same outerhighway (Ave Antonio Jose de Sucre) 250m to the northBoth monitoring sites are inside of the ldquoPico y Placardquo zoneimplemented in 2010 which based on the last number of car

license plates limits rush hour traffic reducing the number ofpersonal vehicles by approximately 20during theweekdays

The monitoring stations are positioned on the roofs ofrelatively tall buildings Fine particulate matter (PM

25) mea-

surements are conducted using instrumentation validated bythe Environmental Protection Agency (EPA) of the UnitedStates For PM

25Thermo Scientific FH62C14-DHS Contin-

uous 5014i (EPA Number EQPM-0609-183) was used Thedetection limit for this instrument is 5120583gm3 for one-houraveraging The aerosol data is collected at 10 s intervals andfrom this then 10min 1-hour and 24-hour averages arecalculatedThe latter averaging data is presented in this workWind velocity is measured using MetOne010C and winddirection using MetOne020C instrumentation The windspeed sensor and wind direction starting threshold is022ms and the accuracies are 007ms and 3∘ respectivelyThe precipitation is measured using MetOne382 and ThiesClima54032007 equipment All meteorological parametershave been validated using VaisalaMAWS100 weather station

3 Data Preparation

In this section the method for the preparation of the datais presented in order to proceed with the classification Itincludes refining steps to discard useless data transforma-tions to visually examine and understand the data andcreation of an averaged intensitymap of the PM

25concentra-

tions with respect to the selected meteorological parameters(wind and precipitation)

31 Data Refinement For this study we analyzed the data ofsix years starting June 2007 and ending July 2013 The two

4 Journal of Electrical and Computer Engineering

20

15

10

5

0

N E

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)gt25

(휇gm

3)

N E

SWW

(a)

NE

SW

5

5

5

35

20

10

0

Prec

ipita

tion

(mm

) 20

15

10

5

0

gt25

(휇gm

3)

NE

SW

5

5

(b)

Figure 2 Data distribution for (a) Cotocollao and (b) Belisario in terms of wind direction wind speed precipitation and PM25

concentrations (color scale) The inner circle represents wind speeds up to 2ms and the outer circle represents wind speeds up to 4ms

datasets (one for eachmonitoring point) are composed out of2223 instances Each data point consists of 4 parametersindicating daily values of precipitation accumulation (mm)wind direction (0ndash360∘) wind speed (ms) and observed fineparticle concentrations (120583gm3)

The datasets are cleaned by discarding data points thatinclude any missing values These data points represent 28and 24 of the total data for Belisario and Cotocollaorespectively It has been demonstrated that missing data ofthese magnitudes do not influence the classification perfor-mance [25] In addition considering the very low numberof missing values it is preferable to remove them insteadof performing an interpolation taking into account thefollowing (i) we proceed with an analysis on discrete vari-ables (day-by-day) and not a time series forecasting and (ii)the PM

25concentrations are very inconstant from one day

to another Weekend days are also removed from the datasetbecause the distribution of PM

25concentrations during the

weekdays and weekends is very different for Quito Thiscould introduce an additional level of complexity in dataclassification as during theweekdays there are clear rush hourpeaks (morning and evening) while on Saturdays PM

25lev-

els increase between late morning and late afternoon hoursIn addition Sundays can be identified by a drop of PM

25

concentration These patterns are dictated by human activitychanges during the week therefore clearly showing PM

25

dependability on traffic After cleaning the final datasets arecomposed of 1527 instances for Belisario and 1536 instancesfor Cotocollao

32 Data Transformation To represent the data according toa wind rose plot the linear scale of wind direction (0ndash360∘)is transformed from polar to Cartesian coordinates whereangles increase clockwise and both 0∘ and 360∘ are north

(N) (see Figure 2)Thismathematical transformation (see (1))permits a more accurate feature representation of the data forwind direction around the north axis Otherwise winddirection angles slightly higher than 0∘ and slightly lower than360∘ would be considered as two opposing directions This isuseful for classification models that are implemented in thenext stage This relates to machine learning models thatimprove performance if there are continuous relationshipsbetween parameters (optimization smoother clustering task)[26] This transformation ensures both valid and more infor-mative representation of the original data In addition thisrepresentation can be completed by the precipitation levelswhich are plotted on the 119911-axis (Figure 2) The color range ismapped from concentrations 0120583gm3 to gt25 120583gm3 Thethreshold of 25 120583gm3 indicates the values fromwhich the 24-hour concentrations of PM

25are harmful according to inter-

national health standards

119909 = sin(Wind Direction360∘ sdot 2120587) sdotWind Speed

119910 = cos(Wind Direction360∘ sdot 2120587) sdotWind Speed

(1)

A visual inspection of the transformeddata shows that thewind directions corresponding to precipitation are north(N) for Cotocollao (Figure 2(a)) and east (E) for Belisario(Figure 2(b)) The stronger winds tend to take place betweensouth (S) and southeast (SE) for Cotocollao and betweensouthwest (SW) and SE in Belisario As expected in bothcases these stronger winds seem to account for relatively lowlevels of PM

25

33 Trend Analyses In order to obtain general trends in thedistribution of the PM

25concentrations as a function of

Journal of Electrical and Computer Engineering 5

wind speed and wind direction the data are used to generateconvolutional based spatial representations Convolution-based models for spatial data have increased in popularity asa result of their flexibility in modeling spatial dependenceand their ability to accommodate large datasets [27] Thisgenerated Convolutional Generalization Model (CGM) [28]is an averaged value of the PM

25pollution level (PL) inwhich

the regional quantity of influence per data point ismodeled asa 2D Gaussian matrix (see (2)) A Gaussian convolution isapplied (i) to spatially interpolate data in order to get a2D representation from the pointsrsquo coordinates calculated in(1) and (ii) to smooth the PL concentration values of thisrepresentation A Gaussian kernel is used because it inhibitsthe quality of monotonic smoothing and as there is no priorknowledge about the distribution a kernel density functionwith high entropy minimizes the information transfer of theconvolution step to the processed data [29]This 2DGaussianmatrix is multiplied by the PL of the given data point andadded to the CGM at the coordinates corresponding to thewind speed and direction of this point Then the quantity ofinfluence is added to the point The final step is to divide thetotal amount of each cell by the quantity of influence whichresults in a generalized average value

CGM (rows colums) = PL 136[[[[[[[[[

14641

]]]]]]]]]

[1 4 6 4 1] (2)

The general tendencies are as follows (i) strong windsresult in low PM

25concentrations and (ii) the strongest

winds generally come from the similar direction (SE forCotocollao and S for Belisario)The results of CGMs for bothsites are shown in Figure 3 as an overlay on top of the geo-graphic location of their respectivemonitoring stationsMainhighways are indicated in green The highest concentrationsof PM

25(from yellow to red) tend to be brought by the

winds coming from these main highways It is to note thathigher wind speeds for Cotocollao tend to be on the axis ofQuitorsquos former airport (grey-green area center of themap seeFigure 3) currently transformed into a city park This trafficand structure free corridor seems to accelerate wind speedswhichmay explain the reduction of PM

25concentrations due

to better ventilation of this part of the cityDuring the study average PM

25concentrations in Coto-

collao and Belisario are 156120583gm3 and 179 120583gm3 respec-tively both exceeding the national standards During thestudied six years the area of Belisario was more polluted withmore variation in PM

25concentrations (higher deviation

see Figure 4) and more turbulent (Figure 3) than CotocollaoThese factors could be the result of Belisario being moreurbanized

4 Classification Models

Machine learning models are used to separate the data indifferent classes of PM

25concentrations Supervised learning

1 km

N

(휇gm

3)

gt25

0

Figure 3 CGM visualization positioned on top of the geographiclocation of the respective monitoring stations (northwestern partof Quito) The northern CGM visualization is Cotocollao and thesouthern one is Belisario Main highways are represented in green

CotocollaoBelisario

01

Den

sity

010 15 20 25 305

Real value (휇gm3)

001

002

003

004

005

006

007

008

009

Figure 4 Distribution of PM25

concentrations (June 2007 to July2013) for Cotocollao and Belisario Dashed black line represents thenational standards and the class seperation boundary (15120583gm3)

techniques are applied to create models on this classificationtask Here we introduce Boosted Trees (BTs) and Linear Sup-port Vector Machines (L-SVM) A BT combines weak learn-ers (simple rules) to create a classification algorithm whereeach misclassified data point per learner gains weight Afollowing learner optimizes the classification of the high-est weighted region Boosted Trees are known for their

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

4 Journal of Electrical and Computer Engineering

20

15

10

5

0

N E

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)gt25

(휇gm

3)

N E

SWW

(a)

NE

SW

5

5

5

35

20

10

0

Prec

ipita

tion

(mm

) 20

15

10

5

0

gt25

(휇gm

3)

NE

SW

5

5

(b)

Figure 2 Data distribution for (a) Cotocollao and (b) Belisario in terms of wind direction wind speed precipitation and PM25

concentrations (color scale) The inner circle represents wind speeds up to 2ms and the outer circle represents wind speeds up to 4ms

datasets (one for eachmonitoring point) are composed out of2223 instances Each data point consists of 4 parametersindicating daily values of precipitation accumulation (mm)wind direction (0ndash360∘) wind speed (ms) and observed fineparticle concentrations (120583gm3)

The datasets are cleaned by discarding data points thatinclude any missing values These data points represent 28and 24 of the total data for Belisario and Cotocollaorespectively It has been demonstrated that missing data ofthese magnitudes do not influence the classification perfor-mance [25] In addition considering the very low numberof missing values it is preferable to remove them insteadof performing an interpolation taking into account thefollowing (i) we proceed with an analysis on discrete vari-ables (day-by-day) and not a time series forecasting and (ii)the PM

25concentrations are very inconstant from one day

to another Weekend days are also removed from the datasetbecause the distribution of PM

25concentrations during the

weekdays and weekends is very different for Quito Thiscould introduce an additional level of complexity in dataclassification as during theweekdays there are clear rush hourpeaks (morning and evening) while on Saturdays PM

25lev-

els increase between late morning and late afternoon hoursIn addition Sundays can be identified by a drop of PM

25

concentration These patterns are dictated by human activitychanges during the week therefore clearly showing PM

25

dependability on traffic After cleaning the final datasets arecomposed of 1527 instances for Belisario and 1536 instancesfor Cotocollao

32 Data Transformation To represent the data according toa wind rose plot the linear scale of wind direction (0ndash360∘)is transformed from polar to Cartesian coordinates whereangles increase clockwise and both 0∘ and 360∘ are north

(N) (see Figure 2)Thismathematical transformation (see (1))permits a more accurate feature representation of the data forwind direction around the north axis Otherwise winddirection angles slightly higher than 0∘ and slightly lower than360∘ would be considered as two opposing directions This isuseful for classification models that are implemented in thenext stage This relates to machine learning models thatimprove performance if there are continuous relationshipsbetween parameters (optimization smoother clustering task)[26] This transformation ensures both valid and more infor-mative representation of the original data In addition thisrepresentation can be completed by the precipitation levelswhich are plotted on the 119911-axis (Figure 2) The color range ismapped from concentrations 0120583gm3 to gt25 120583gm3 Thethreshold of 25 120583gm3 indicates the values fromwhich the 24-hour concentrations of PM

25are harmful according to inter-

national health standards

119909 = sin(Wind Direction360∘ sdot 2120587) sdotWind Speed

119910 = cos(Wind Direction360∘ sdot 2120587) sdotWind Speed

(1)

A visual inspection of the transformeddata shows that thewind directions corresponding to precipitation are north(N) for Cotocollao (Figure 2(a)) and east (E) for Belisario(Figure 2(b)) The stronger winds tend to take place betweensouth (S) and southeast (SE) for Cotocollao and betweensouthwest (SW) and SE in Belisario As expected in bothcases these stronger winds seem to account for relatively lowlevels of PM

25

33 Trend Analyses In order to obtain general trends in thedistribution of the PM

25concentrations as a function of

Journal of Electrical and Computer Engineering 5

wind speed and wind direction the data are used to generateconvolutional based spatial representations Convolution-based models for spatial data have increased in popularity asa result of their flexibility in modeling spatial dependenceand their ability to accommodate large datasets [27] Thisgenerated Convolutional Generalization Model (CGM) [28]is an averaged value of the PM

25pollution level (PL) inwhich

the regional quantity of influence per data point ismodeled asa 2D Gaussian matrix (see (2)) A Gaussian convolution isapplied (i) to spatially interpolate data in order to get a2D representation from the pointsrsquo coordinates calculated in(1) and (ii) to smooth the PL concentration values of thisrepresentation A Gaussian kernel is used because it inhibitsthe quality of monotonic smoothing and as there is no priorknowledge about the distribution a kernel density functionwith high entropy minimizes the information transfer of theconvolution step to the processed data [29]This 2DGaussianmatrix is multiplied by the PL of the given data point andadded to the CGM at the coordinates corresponding to thewind speed and direction of this point Then the quantity ofinfluence is added to the point The final step is to divide thetotal amount of each cell by the quantity of influence whichresults in a generalized average value

CGM (rows colums) = PL 136[[[[[[[[[

14641

]]]]]]]]]

[1 4 6 4 1] (2)

The general tendencies are as follows (i) strong windsresult in low PM

25concentrations and (ii) the strongest

winds generally come from the similar direction (SE forCotocollao and S for Belisario)The results of CGMs for bothsites are shown in Figure 3 as an overlay on top of the geo-graphic location of their respectivemonitoring stationsMainhighways are indicated in green The highest concentrationsof PM

25(from yellow to red) tend to be brought by the

winds coming from these main highways It is to note thathigher wind speeds for Cotocollao tend to be on the axis ofQuitorsquos former airport (grey-green area center of themap seeFigure 3) currently transformed into a city park This trafficand structure free corridor seems to accelerate wind speedswhichmay explain the reduction of PM

25concentrations due

to better ventilation of this part of the cityDuring the study average PM

25concentrations in Coto-

collao and Belisario are 156120583gm3 and 179 120583gm3 respec-tively both exceeding the national standards During thestudied six years the area of Belisario was more polluted withmore variation in PM

25concentrations (higher deviation

see Figure 4) and more turbulent (Figure 3) than CotocollaoThese factors could be the result of Belisario being moreurbanized

4 Classification Models

Machine learning models are used to separate the data indifferent classes of PM

25concentrations Supervised learning

1 km

N

(휇gm

3)

gt25

0

Figure 3 CGM visualization positioned on top of the geographiclocation of the respective monitoring stations (northwestern partof Quito) The northern CGM visualization is Cotocollao and thesouthern one is Belisario Main highways are represented in green

CotocollaoBelisario

01

Den

sity

010 15 20 25 305

Real value (휇gm3)

001

002

003

004

005

006

007

008

009

Figure 4 Distribution of PM25

concentrations (June 2007 to July2013) for Cotocollao and Belisario Dashed black line represents thenational standards and the class seperation boundary (15120583gm3)

techniques are applied to create models on this classificationtask Here we introduce Boosted Trees (BTs) and Linear Sup-port Vector Machines (L-SVM) A BT combines weak learn-ers (simple rules) to create a classification algorithm whereeach misclassified data point per learner gains weight Afollowing learner optimizes the classification of the high-est weighted region Boosted Trees are known for their

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 5

wind speed and wind direction the data are used to generateconvolutional based spatial representations Convolution-based models for spatial data have increased in popularity asa result of their flexibility in modeling spatial dependenceand their ability to accommodate large datasets [27] Thisgenerated Convolutional Generalization Model (CGM) [28]is an averaged value of the PM

25pollution level (PL) inwhich

the regional quantity of influence per data point ismodeled asa 2D Gaussian matrix (see (2)) A Gaussian convolution isapplied (i) to spatially interpolate data in order to get a2D representation from the pointsrsquo coordinates calculated in(1) and (ii) to smooth the PL concentration values of thisrepresentation A Gaussian kernel is used because it inhibitsthe quality of monotonic smoothing and as there is no priorknowledge about the distribution a kernel density functionwith high entropy minimizes the information transfer of theconvolution step to the processed data [29]This 2DGaussianmatrix is multiplied by the PL of the given data point andadded to the CGM at the coordinates corresponding to thewind speed and direction of this point Then the quantity ofinfluence is added to the point The final step is to divide thetotal amount of each cell by the quantity of influence whichresults in a generalized average value

CGM (rows colums) = PL 136[[[[[[[[[

14641

]]]]]]]]]

[1 4 6 4 1] (2)

The general tendencies are as follows (i) strong windsresult in low PM

25concentrations and (ii) the strongest

winds generally come from the similar direction (SE forCotocollao and S for Belisario)The results of CGMs for bothsites are shown in Figure 3 as an overlay on top of the geo-graphic location of their respectivemonitoring stationsMainhighways are indicated in green The highest concentrationsof PM

25(from yellow to red) tend to be brought by the

winds coming from these main highways It is to note thathigher wind speeds for Cotocollao tend to be on the axis ofQuitorsquos former airport (grey-green area center of themap seeFigure 3) currently transformed into a city park This trafficand structure free corridor seems to accelerate wind speedswhichmay explain the reduction of PM

25concentrations due

to better ventilation of this part of the cityDuring the study average PM

25concentrations in Coto-

collao and Belisario are 156120583gm3 and 179 120583gm3 respec-tively both exceeding the national standards During thestudied six years the area of Belisario was more polluted withmore variation in PM

25concentrations (higher deviation

see Figure 4) and more turbulent (Figure 3) than CotocollaoThese factors could be the result of Belisario being moreurbanized

4 Classification Models

Machine learning models are used to separate the data indifferent classes of PM

25concentrations Supervised learning

1 km

N

(휇gm

3)

gt25

0

Figure 3 CGM visualization positioned on top of the geographiclocation of the respective monitoring stations (northwestern partof Quito) The northern CGM visualization is Cotocollao and thesouthern one is Belisario Main highways are represented in green

CotocollaoBelisario

01

Den

sity

010 15 20 25 305

Real value (휇gm3)

001

002

003

004

005

006

007

008

009

Figure 4 Distribution of PM25

concentrations (June 2007 to July2013) for Cotocollao and Belisario Dashed black line represents thenational standards and the class seperation boundary (15120583gm3)

techniques are applied to create models on this classificationtask Here we introduce Boosted Trees (BTs) and Linear Sup-port Vector Machines (L-SVM) A BT combines weak learn-ers (simple rules) to create a classification algorithm whereeach misclassified data point per learner gains weight Afollowing learner optimizes the classification of the high-est weighted region Boosted Trees are known for their

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

6 Journal of Electrical and Computer Engineering

Table 1 Binary classification with class separation at 15120583gm3Model Location

Belisario CotocollaoBT 832 676L-SVM 798 663

insensibility to overfitting and for the fact that nonlinearrelationships between the parameters do not influence theperformance A L-SVM separates classes with optimal dis-tance Convex optimization leads the algorithm to not focuson local minima As these two models are well establishedand inhibit different qualities they are used in this sectionAllcomputations and visualizations are executed in MathWorksMatlab 2015 Toolboxes for the classifications the statisticsandmachine learning processes are used in all the stages Fur-thermoreMatlabrsquos integrated tools for distribution fitting andcurve fitting are applied for the different analyses The initialparameters provided by theMatlab toolbox software are usedin this work ADAboost learningmethodwith a total amountof 30 learners and a maximum number of splits being 20 ata learning rate of 01 are the default parameters for the BTThe SVM is initialized with a linear kernel of scale 10 a boxconstrained level of 10 and an equal learning rate of 01

Fluctuations in yearly PM25

concentrations are not takeninto account in this classification process as a previousanalysis showed a small variation in fine particulate matterpollution levels during the studied period [5] A binary clas-sification is performed to set a baseline comparison betweenthe different sites Then a three-class classification is carriedout to assess the separability between three ranges of concen-trations of PM

25(based on WHO guidelines) and provide

insight into general classification rules

41 Binary Classification In this first classification two class-es are used which represent values above and below 15 120583gm3The latter value is selected as it is the National Air QualityStandard of Ecuador for annual PM

25concentrations (equiv-

alent to WHOrsquos Interim Target-3) [30] Due to the normaldistribution of the datasets as shown in Figure 4 a higheraccuracy for Belisario than Cotocollao is expected partiallybecause of a priori imbalanced class distribution A previousstudy using the same classification shows an accuracy ofonly 65 for Cotocollao by applying the treesJ48 algorithmwhich is a decision tree implementation integrated in theWEKA machine learning workbench [5]

Classification with both BT and L-SVM shows similarresults Table 1 presents the results of this first classificationThe implementation of the classification for Belisario outper-forms that of Cotocollao It also suggests that the extreme lev-els (low and high) of PM

25could be more straightforward to

classify with the current parameters implying a higher classseparability for the Belisario dataset (wider distribution)Tables 2 and 3 show that the concentrations above 15 120583gm3for both sites are better classified than those below the15 120583gm3 boundaryThis is less surprising for Belisario due to

Table 2 Confusion matrix of binary classification for Cotocollaousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 511 489 511489

gt15 203 797 797203

Table 3 Confusion matrix of Binary classification for Belisariousing a BT Rows represent the true class and columns represent thepredicted class

Class lt15 gt15 TPRFNR

lt15 490 510 490510

gt15 51 949 94951

the earlier mentioned class imbalance For Cotocollao how-ever the poor performance for this class can indicate that thisclass is less distinctive thus the model optimizes the classabove 15 120583gm3 Note that it is crucial to be able to classifynonattainment (PM

25gt 15 120583gm3) instances as wrongly

identified nonviolating national standards (PM25lt 15120583g

m3) levels would be a less costly errorIn Figure 5(a) Receiver Operating Characteristic (ROC)

curves comparison is shown for the binary classifiers pre-sented in Table 1 namely the BT and L-SVM classifiersFigure 5(a) depicts the ROC curves for Cotocollao datasetand Figure 5(b) the ROC curves for Belisario dataset Oncethe classifiers models are built for every dataset a validationset is presented to the model in order to predict the classlabel It is also of interest to have the classification scores of themodel which indicate the likelihood that the predicted labelcomes from a particular class The ROC curves are con-structed with this scored classification and the true labels inthe validation dataset (Figure 5)

ROC curves are useful to evaluate binary classifiers and tocompare their performances in a two-dimensional graph thatplots the specificity versus sensitivity The specificity mea-sures the true negative rate that is the proportion of negativesthat have been correctly classified true negativesnegatives =true negatives(true negatives + false positives) Likewise thesensitivity measures the true positive rate that is the propor-tion of positives correctly identified true positivespositives= true positives(true positives + false negatives) The areaunder the ROC curve (AUC) can be used as a measure ofthe expected performance of the classifier and the AUC of aclassifier is equal to the probability that the classifier willrank a randomly chosen positive instance higher than arandomly chosen negative instance [31] Figure 5(b) showsthe performance of the BT and L-SVM classifiers for theBelisario dataset The BT outperforms the L-SVM classifierin all regions of the ROC space with [AUC(BT) = 072] gt[AUC(L-SVM) = 066] which means a better performance

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 7

Specificity ()

0

20

40

60

80

100

020406080100

Sens

itivi

ty (

)

L-SVM AUC = 562BT AUC = 591

(a)

0

20

40

60

80

100

Sens

itivi

ty (

)

Specificity ()020406080100

L-SVM AUC = 659BT AUC = 718

(b)

Figure 5 ROC curves for Cotocollao (a) and Belisario (b)

for the BT classifier The BT classifier has a fair performanceseparating the two classes in the Belisario dataset

In Figure 5(a) the ROC curves and AUC are presented forthe Cotocollao dataset Again BT performs better than theL-SVM classifier with [AUC(BT) = 059] gt [AUC(L-SVM) =056]This time the classifiers for the Cotocollao dataset havea poor performance separating the two classes with a perfor-mance just slightly better when compared to a random clas-sifier with AUC = 05The classification result is clearly betterfor Belisario than for Cotocollao Thus a three-class classi-fication should identify if for both sites the extreme concen-trations could be better classified than themoderate ones andclarify the low performance for Cotocollao

42 Three-Class Classification To further analyze the differ-ences of multiple categories of concentration levels a three-class classification is performed using WHOrsquos guidelines forpollution concentrations as class boundaries According tothese guidelines health risks are considered low if PM

25lt

10 120583gm3 (long term annual WHOrsquos recommended level)moderate if 10 120583gm3 gt PM

25lt 25 120583gm3 and high if

PM25gt 25 120583gm3 (short term 24-hour WHOrsquos recom-

mended level) The objective is to identify if these mainpollution thresholds are indeed well separable and thus theweather parameters can account for PM

25pollution in these

three ranges of air qualityIn both studied districts the classes lt 10 120583gm3 and gt25120583gm3 are relatively small with approximately 10 of the

data compared to the class 10ndash25 120583gm3 Due to this fact analternative BT algorithm is used to take into account theseimbalanced classes This RusBoosted Tree (RBT) approach

Table 4 Confusion matrix of three-class classification for Cotocol-lao using aRBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 gt25 TPRFNR

lt10 763 163 74 763237

10ndash25 283 288 429 288712

gt25 63 203 734 734266

endeavors to find an even distribution of performance forall classes instead of finding a global optimum [32] Thisleads to a better representation of the separability The truepositive versus false negative rate (TPRFNR) is shown foreach class in the confusion matrices of Cotocollao (Table 4)and Belisario (Table 5)

Tables 4 and 5 show that the correctness in classifyingconcentrations lt 10 120583gm3 seems to perform adequatelyAlso the correct classification for concentrations gt 25 120583gm3 in Cotocollao is fair However the false positive rate ofthis classification is extremely high because 429 of the10ndash25 120583gm3 class gets classified as class gt 25 120583gm3 ForBelisario the separation of classes 10ndash25 120583gm3 and gt25 120583gm3 is deficient In both cases only the extreme low values canbe classifiedwellThus the hypothesis of the extreme concen-trations in PM

25being more straightforward to classify (see

Section 41) is only partially verifiedAnalyzing the wrongly classified samples of class 10ndash25120583gm3 shows that for samples classified as lt10 120583gm3 the

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

8 Journal of Electrical and Computer Engineering

002

008

Den

sity

014

16 2012 24Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(a)

002

011

Den

sity

02

16 20 2412Real value (휇gm3)

10ndash25휇gm3 classified as lt10 휇gm3

10ndash25휇gm3 classified as gt25 휇gm3

(b)

Figure 6 Wrongly classified samples of class 10ndash25 120583gm3 with their real value distributions for Cotocollao (a) and Belisario (b)

Table 5 Confusion matrix of three-class classification for Belisariousing a RBT Rows represent the true class and columns representthe predicted class

Class lt10 10ndash25 25 TPRFNR

lt10 848 95 57 848152

10ndash25 123 535 342 535465

gt25 65 451 484 484516

real values tend to be relatively close to 10 120583gm3 Thisevidence is even stronger for Belisario (Figure 6(b)) than forCotocollao (Figure 6(a)) This indicates a changeover in val-ues around the decision boundary The same does not applyto the wrongly classified samples that are grouped asgt25 120583gm3 As shown in Figure 6 these values aremostly nor-mally distributed around themean of class 10ndash25 120583gm3 Eventhough for Belisario the mean is shifted it is not evidentthat wrongly classified samples of class 10ndash25 120583gm3 into class25 120583gm3 tend to be closer to values of 25120583gm3 as thisshift is mainly caused by the fact that the mean value of theBelisario initial data is higher (see Figure 4)We can concludethat the low performance for Cotocollao in the previoussection (Section 41) is mainly caused by the fact that the clas-sifier tries to separate values in the range of 10ndash25 120583gm3 andgt25 120583gm3 which are poorly separable according to thethree-class classification

These results show that values of 10ndash25120583gm3 andgt25 120583gm3 are not well separable and thus not largely influenced bythe used meteorological parameters On the contrary lower

values seem to be largely predictable by wind and precipita-tion conditions This statement gains confidence by lookingat the wrongly classified data points discussed previously (seeFigure 6)

43 Classification Rules Binary classification between all dif-ferent classes with the use of RBTs provides general rulesfor classifying the different levels of PM

25in terms of the

parameter space Here the well performing rules in classi-fying PM

25concentrations lt 10 120583gm3 are discussed The

rules and their performance can be seen in Table 6This tableshows that rules separating classes lt 10 120583gm3 versus 10ndash25120583gm3 and lt10 120583gm3 versus gt25 120583gm3 have a high percent-age of accuracy On the contrary the separation between10ndash25 120583gm3 and gt25 120583gm3 is less accurate

Figure 7 provides a visualization of the data according tothe class separation in Table 6 for the example of CotocollaoThe RBT classification of the data as seen in Figures 7(a) and7(b) creates two clusters for class lt 10 120583gm3 In the case ofBelisario the RBT classifications result in identifying onlyone cluster for class lt 10 120583gm3

It is to note that for Cotocollao the performance increas-es drastically comparing the binary classifications of lt10 120583gm3 versus 10ndash25 120583gm3 and lt10 120583gm3 versus gt25 120583gm3(from 732 up to 889 see Table 6) In contrast the per-formance for Belisario for these two classifications does notdiffer (from 867 to 888) This indicates that the data forCotocollao are less separable at the 10ndash25 120583gm3 class than forBelisario

To sum up the outcomes of the classification models thebinary classification utilizing the National and InternationalAir Quality Standards as class labels (PM

25lt 15 120583gm3

PM25gt 15120583gm3) showed a high difference in performance

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 9

Table 6 Classification rules and pairwise comparisons between the different classes and their respective performance

Classification LocationCotocollao Belisario

lt10 120583gm3 versus10ndash25120583gm3

Classification rulesWind speed gt 25msWind direction = S-SE Wind speed gt 22ms

Wind direction = SE-SWWind direction = NW-NEPrecipitation gt 15mm

Classification performance732 (Figure 7(a)) 867

lt10 120583gm3 versusgt25 120583gm3

Classification rulesWind speed gt 2ms

Wind direction = S-SE Wind speed gt 2msWind direction = SE-SWWind direction = NW-NE

Precipitation gt 1mmClassification performance

889 (Figure 7(b)) 88810ndash25120583gm3 versusgt25 120583gm3 600 641

NE

SW

35

20

10

5

550

Prec

ipita

tion

(mm

)

E

S

lt10 휇gm3

10ndash25휇gm3

(a)

5

55

NE

SW

35

20

10

0

Prec

ipita

tion

(mm

)

E

lt10 휇gm3

gt25 휇gm3

(b)

Figure 7 Data split for three different classes (see Table 6) (a) lt10 120583gm3 versus 10ndash25 120583gm3 and (b) lt10 120583gm3 versus gt25 120583gm3 Both (a)and (b) are results for Cotocollao mapped in terms of wind direction wind speed and precipitation The inner circle represents wind speedsup to 2ms and the outer circle represents wind speeds up to 4ms

between the two sites In order to explain this difference andthemisclassifications the analysis was refined to a three-classclassification based on WHOrsquos guidelines regarding the con-sequences of PM

25concentrations on health risks as low

(PM25lt 10 120583gm3) moderate (PM

25= 10ndash25 120583gm3) and

high (PM25gt 25 120583gm3) This classification showed high

performance in categorizing low concentrations in contrast tohigh concentrationsNext we propose a regression analysis topinpoint the upper boundary of PM

25values for which the

weather parameters are still able to explain variation inpollution levels that are not described by the classificationanalysis

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

10 Journal of Electrical and Computer Engineering

Precipitation (mm)

5

250

0

CotocollaoBelisario

Aver

age e

rror

(휇gm

3)

(a)

Wind speed (ms)5

CotocollaoBelisario

5

00

Aver

age e

rror

(휇gm

3)

(b)

Figure 8 Decrease in average prediction error with increasing parameter values (precipitation and wind speed) for Cotocollao (orange) andBelisario (blue)

5 Regression Analyses

In this section an additional machine learning analysis basedon BT L-SVM and Neural Networks (NN) is used to per-form a regression for both sites Default parameters providedby the Matlab toolbox software are used to set up the modelsNN are appropriate models for highly nonlinear model-ing and when no prior knowledge about the relationshipbetween the parameters is assumed The NN consist of 10nodes in 1 hidden layer trained with a Levenberg-Marquardtprocedure in combination with a random data divisionIdentifying the correlation between the real and predictedvalues gives us the topological coherence between the inputand output parameter values In addition the error related tothe parameter values provides insight regarding the predic-tion confidence for determined weather conditions Also theanalysis of the data trend over time will inform on the appli-cability of a time series forecasting Finally the CGM is usedto remark on the possibility of optimizing the regression

51 Regression Models A regression is performed with threedifferent classifiers Bin sizes of 05 120583gm3 (0ndash35 120583gm3 range)are used for the models that output discrete class values (BTand SVM) This relatively small bin size permits thesemodels to perform regression as their output values closelyapproach continuous valuesThe additional parameters of themodels are set up as explained in the binary and three-classclassification (Sections 41 and 42) The models are trainedwith 10-fold cross-validation The test set is 20 of the

original data Unlike the NN continuous output values thediscrete output values of the other models can have an effecton the classification errorHowever as the bin size is relativelysmall we expect the errors related to these types of output tobe marginal

MSE = 1119899 sdot119899sum119894=1

(119910119894minus 119910119894)2 (3)

The mean squared error (MSE) is used to measure theclassification performance (see (3)) TheMSE is the averagedsquared error per prediction The mean absolute percentageerror (MAPE) is used to express the average prediction errorin terms of percentage of a data pointrsquos real value (see (4))TheMAPE function provides a more intuitive understandingof the performance

MAPE = sum119899119894=1 1003816100381610038161003816(119910119894 minus 119910119894) 1199101198941003816100381610038161003816119899 (4)

An analysis of the confidence levels in relation to the pre-cipitation and wind speed parameters is shown in Figure 8The prediction confidence rises when the parameter valuesincrease A level of confidence is explained as the averageprediction error (absolute difference between the real and thepredicted values root of MSE) at a certain interval withrespect to an input parameter In Figure 8 fitted lines repre-sent the predicted data in terms of their absolute error withrespect to precipitation and wind speed for both sites Thedecrease in errors can be seen with respect to increasing

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 11

180 200 220 240 260 280160Day counter

Predicted PM25 concentrationReal PM25 concentration

0

10

20

30

40

PM25

conc

entr

atio

n(휇

gm

3)

Precipitation Wind speedWave 1

10

20

30

40

Prec

ipita

tion

(mm

)

25

30

35

40

45

50

55

60

Win

d sp

eed

(ms

)

Figure 9 Neural Networkrsquos regressive prediction of Cotocollao PM25

concentration (light grey) compared to the real data (dark grey) duringthe wet season plotted against daily rain accumulation and wind speed thresholds gt1mm and gt25ms respectively (see Table 6 thresholdsobtained from 3-class classification) The dashed black line represents the national standards for PM

25annual concentrations

values of these specified input parameters It suggests that theprediction of PM

25concentration ismore reliable for extreme

than moderate climatic conditionsFigure 9 shows an example of the comparison of the

predictive models of PM25

concentration and the real PM25

concentration for Cotocollao during six months of a wetseason (first half of 2008) The graph shows the 5-point box-smoothed data to demonstrate the good prediction of thetendency of the PM

25concentrations Besides a certain gap

the estimated values seem to fairly correlate with the real dataThe correlation analysis shows a significant positive corre-lation between the real concentrations and the predictedconcentrations 119903(130) = 05 119901 lt 0000 Also the modelperformance is relatively good throughout the study periodThe correlation analysis for all of the data shows a significantpositive correlation between the real and predicted PM

25

concentrations 119903(1534) = 034 119901 lt 0000This visualization shows that the error of predicted

concentration seems to increase when PM25

concentrationincreases The reduction in both real and estimated PM

25

concentrations coincides with rain events and wind speedsabove the thresholds defined in Table 6 (gt1mm and gt25msresp)

The results of the MSE for the regression show that inboth city sites a NN performs the best (see Table 7) Thecorrelation analysis shows that there is a logarithmic relation-ship between the real particle concentration values and theprediction (Figure 10) It means that there is an overpredic-tion for low values and an underprediction for high valuesand an overall decrease in correlation as values get higherThecorrelation seems the best for values around 17120583gm3 forCot-ocollao and 19 120583gm3 for Belisario

To sum up the present input parameters do not welldescribe an increase in PM

25concentrations if these levels are

transcending values over 20120583gm3 as errors increase at thispoint and prediction values stagnateThus additional param-eters must be considered for the prediction of PM

25levels

Table 7 MSE andMAPE of the NN L-SVM and BT on regression

Model LocationBelisario Cotocollao

NN 221 (26) 407 (40)L-SVM 268 (28) 418 (41)BT 285 (30) 444 (42)

Table 8 MSE and MAPE of CGM and NN regression

Model LocationBelisario Cotocollao

CGM 156 (22) 150 (25)NN 221 (26) 407 (40)

beyond this concentration threshold since meteorologicalfactors alone are not able to account for the whole particulatematter concentrations For instance considering humanactivity (eg car traffic) which is the main source of pollu-tion should contribute to the reduction of the overpredictionand underprediction observed in our model

52 Optimization TheCGM as applied in Section 33 couldbe used in classification tasks In this section a 10-foldcross-validation on regression with this model is applied tocompare it with the best performing model (NN)

The results show a substantial reduction in MSE withthe CGM regression compared to the NN regression for thetwo city sites (see Table 8) It is to note that this diminution isparticularly high in the case of Cotocollao It seems that themodel is able to better handle the dense (see Figure 4) andnoisy (as stated in Section 43) data of Cotocollao than theNN The similar performance in both sites means that thismodel has the potential to be applied in various situa-tions with similar expected error rates Further development

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

12 Journal of Electrical and Computer Engineering

15 300

Real value (휇gm3)

0

15

30

Pred

ictio

n (휇

gm

3)

CotocollaoBelisario

Figure 10 Fitted lines representing the correlation between pre-dicted values and real values through aNN algorithm for Cotocollao(orange) and Belisario (blue)

should aid in qualifying the true robustness of this approachby exploiting the possibility of modeling with other spatialdependencies such as density of measurements and day-by-day shifts which represent the degree of freedom ofparameters related to readings of the previous day(s) Thelatter dependency could be combined with linear quadraticestimation (LQE) techniques such as Kalman filters to im-prove the precision

6 Conclusions and Perspectives

This study proposes a machine learning approach to predictPM25

concentrations from meteorological data in a high-elevation mid-sized city (Quito Ecuador) Standard levels offine particulate matter are classified by using differentmachine learning models This classification is performed onsix yearsrsquo records of dailymeteorological values of wind speed(ms) wind direction (0ndash360∘) and precipitation accumu-lation (mm) for two air quality monitoring sites located inQuito (Cotocollao and Belisario) Although these sites areboth in Quitorsquos urbanized area they exhibit differences inspread and dominance regarding wind features (speed anddirection) that account for high PM

25concentrations and

distribution of pollution levels over the years This could becaused by the fact that Belisario ismore urbanized thanCoto-collao and more importantly due to the extremely complexterrain of the city

For these two different districts the results show a highreliability in the classification of low (lt10 120583gm3) versushigh (gt25 120583gm3) and low (lt10 120583gm3) versus moderate

(10ndash25 120583gm3) PM25

concentrations We found well definedclusters within the parameter space for PM

25concentrationslt 10 120583gm3 The regression analysis shows that the used

parameters can predict PM25

concentrations up to 20120583gm3and the accuracy of the predictions is improved in condi-tions of strong winds and high precipitation for both Coto-collao and BelisarioThere is a significant positive correlationbetween the real concentrations and the predicted concen-trations for all the study period The slightly higher corre-lation during the rainy season confirms that the model canpredict PM

25concentrations better for more extreme weath-

er conditionsUsing a convolutional based spatial representation (CGM)

to perform regression shows improving performance com-pared to various used machine learning algorithms (NN L-SVM and BT) In addition to this model finding trends overperiods of time with the use of time series algorithms couldfurther improve the prediction and would make a long-termforecasting of PM

25concentrations possible [13]

Themain contribution of this study is to propose an alter-native approach to chemical transport numerical modelingsuch as WRF-Chem or CMAQ the performance of whichdepends on several input parameters (emission inventoryorography etc) and the accuracy of built-in meteorologicalmodels (WRF MM5) The application of numerical modelsfor complex terrain regions is challenging since importanttopographic features are not well represented [11 33] Thisproduces imprecisions in not only forecasting air quality butalso relevant meteorology [10 12 34 35] Here the proposedmodel provides a more reliable and more economical alter-native to predict PM

25levels as it only requires meteoro-

logical data acquisition In addition accurate meteorologicaltechnology is far more affordable compared to air qualitysensors that can exceed the price over 100 times Finally thismodel is based on the three basic meteorological parameters(wind speed wind direction and precipitation) which have astraightforward effect on pollutionThus by considering thatour model has a good prediction efficiency for a city of sucha complex topography we argue that it could be success-fully applied in other tropical locations (regions of reducedchanges in solar angle temperature and relative humidity)

Also this work provides an insight into the main limi-tations regarding PM

25prediction from meteorological data

andmachine learningThe classification and regression showthat concentrations gt 20120583gm3 seem to be influenced moreby additional parameters than the meteorological factorsused in this study For example although daily temperaturesolar radiation and pressure do not vary much during theyear theymightmake a difference if analyzed during differenttimes of the day causing different pollution levels in the cityAn interesting approach to tackle this limitation would be toconsider a hybrid model that would mix a numerical method(WRF-Chem or CMAQ) with machine learning algorithms[10]

Other climatic conditions and unusual impactful eventscausing higher pollution levels (festivities wild fires acci-dents seasonal variability or natural calamities) could alsoexplain changes in PM

25concentrations exceeding 20120583gm3

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

Journal of Electrical and Computer Engineering 13

Future work will consist of identifying the parameters orevents causing values above this threshold Furthermore weintend to improve our CGM and use it to classify outliers andfind their cause Considering the diverse machine learningmodels used in air quality prediction such asNeuralNetwork[13ndash15] regression [18] decision trees and Support VectorMachine [17] we applied and testedmost of these classifiers inthis study Alternative approaches to improve the accuracy ofourmodel would consist of performing a prediction based onan ensemble of different algorithms of data processing andmodeling [16 17 22]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

The authors would like to thank David R Sannino for editingthe text

References

[1] United Nations Department of Economic and Social Affairs(2015) World Population Prospects the 2015 Revision inPopulation Division edited UN

[2] World Health OrganizationMedia Centre (2016) Air pollutionlevels rising in many of the worldrsquos poorest cities httpwwwwhointmediacentrenewsreleases2016air-pollution-rising

[3] J Lelieveld J S Evans M Fnais D Giannadaki and A PozzerldquoThe contribution of outdoor air pollution sources to prematuremortality on a global scalerdquo Nature vol 525 no 7569 pp 367ndash371 2015

[4] C A Pope andDWDockery ldquoHealth effects of fine particulateair pollution lines that connectrdquo Journal of the Air and WasteManagement Association vol 56 no 6 pp 709ndash742 2006

[5] Y Rybarczyk and R Zalakeviciute ldquoMachine learning approachto forecasting urban pollution a case study of Quitordquo inProceedings of the IEEE Ecuador Technical Chapters Meeting(ETCM rsquo16) Guayaquil Ecuador 2016

[6] M A Pohjola A Kousa J Kukkonen et al ldquoThe spatial andtemporal variation of measured urban PM

10and PM

25in the

Helsinkimetropolitan areardquoWater Air and Soil Pollution Focusvol 2 no 5 pp 189ndash201 2002

[7] Y Li Q Chen H Zhao L Wang and R Tao ldquoVariations inpm10 pm25 and pm10 in an urban area of the sichuan basinand their relation to meteorological factorsrdquoAtmosphere vol 6no 1 pp 150ndash163 2015

[8] J Wang and S Ogawa ldquoEffects of meteorological conditions onPM25 concentrations inNagasaki Japanrdquo International Journalof Environmental Research and Public Health vol 12 no 8 pp9089ndash9101 2015

[9] F Zhang H Cheng Z Wang et al ldquoFine particles (PM25) ata CAWNET background site in central China chemical com-positions seasonal variations and regional pollution eventsrdquoAtmospheric Environment vol 86 pp 193ndash202 2014

[10] X Xi Z Wei R Xiaoguang et al ldquoA comprehensive evalu-ation of air pollution prediction improvement by a machinelearning methodrdquo in Proceedings of the 10th IEEE International

Conference on Service Operations and Logistics and InformaticsSOLI 2015 - In conjunction with ICT4ALL rsquo15 pp 176ndash181Hammamet Tunisia November 2015

[11] P A Jimenez and J Dudhia ldquoImproving the representationof resolved and unresolved topographic effects on surfacewind in the WRF modelrdquo Journal of Applied Meteorology andClimatology vol 51 no 2 pp 300ndash316 2012

[12] R Parra and V Dıaz ldquoPreliminary comparison of ozone con-centrations provided by the emission inventoryWRF-Chemmodel and the air quality monitoring network from the DistritoMetropolitano de Quito (Ecuador)rdquo in Proceedings of the 8thannual WRF Userrsquos Workshop NCAR Boulder Colo USA

[13] X Ni H Huang and W Du ldquoRelevance analysis and short-term prediction of PM25 concentrations in Beijing based onmulti-source datardquo Atmospheric Environment vol 150 pp 146ndash161 2017

[14] J Chen H Chen Z Wu D Hu and J Z Pan ldquoForecastingsmog-related health hazard based on social media and physicalsensorrdquo Information Systems vol 64 pp 281ndash291 2017

[15] J Zhang and W Ding ldquoPrediction of air pollutants concen-tration based on an extreme learning machine the case ofHong Kongrdquo International Journal of Environmental Researchand Public Health vol 14 no 2 p 114 2017

[16] P Jiang Q Dong and P Li ldquoA novel hybrid strategy for PM25concentration analysis and predictionrdquo Journal of Environmen-tal Management vol 196 pp 443ndash457 2017

[17] K P Singh S Gupta and P Rai ldquoIdentifying pollution sourcesand predicting urban air quality using ensemble learningmethodsrdquo Atmospheric Environment vol 80 pp 426ndash437 2013

[18] C Brokamp R Jandarov M B Rao G LeMasters and PRyan ldquoExposure assessment models for elemental componentsof particulate matter in an urban environment a comparison ofregression and random forest approachesrdquo Atmospheric Envi-ronment vol 151 pp 1ndash11 2017

[19] M Arhami N Kamali and M M Rajabi ldquoPredicting hourlyair pollutant levels using artificial neural networks coupled withuncertainty analysis by Monte Carlo simulationsrdquo Environmen-tal Science and Pollution Research vol 20 no 7 pp 4777ndash47892013

[20] A Russo F Raischel and P G Lind ldquoAir quality predictionusing optimal neural networks with stochastic variablesrdquoAtmo-spheric Environment vol 79 pp 822ndash830 2013

[21] M Fu W Wang Z Le and M S Khorram ldquoPrediction ofparticular matter concentrations by developed feed-forwardneural network with rolling mechanism and gray modelrdquoNeural Computing andApplications vol 26 no 8 pp 1789ndash17972015

[22] W Sun and J Sun ldquoDaily PM25

concentration prediction basedon principal component analysis and LSSVM optimized bycuckoo search algorithmrdquo Journal of Environmental Manage-ment vol 188 pp 144ndash152 2017

[23] United Nations Development Programme (UNDP) Humandevelopment report 2014 Sustaining Human Progress Reduc-ing Vulnerabilities and Building Resilience

[24] Instituto Nacional de Estadistica y Censos (INEC) Quito elcanton mas poblado del Ecuador en el 2020 2013

[25] E Acuna and C Rodriguez ldquoThe treatment of missing valuesand its effect on classifier accuracyrdquo inClassification Clusteringand Data Mining Applications D Banks F R McMorrisP Arabie and W Gaul Eds pp 639ndash647 Springer BerlinHeidelberg 2004

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

14 Journal of Electrical and Computer Engineering

[26] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYale rapid prototyping for complex data mining tasksrdquo inProceedings of 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining pp 935ndash940 Philadel-phia PA USA 2006

[27] C A Calder and N Cressie ldquoSome topics in convolution-based spatial modelingrdquo in Proceedings of the 56th Sessionof the International Statistics Institute International StatisticsInstitute Netherlands 2007

[28] F Fouedjio N Desassis and J Rivoirard ldquoA generalizedconvolution model and estimation for non-stationary randomfunctionsrdquo Spatial Statistics vol 16 pp 35ndash52 2016

[29] J Babaud A P Witkin M Baudin and R O Duda ldquoUnique-ness of the Gaussian kernel for scale-space filteringrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol8 no 1 pp 26ndash33 1986

[30] MA ldquoMinisterio Del Ambiente Norma de Calidad del AireAmbiente o Nivel de Inmision Libro VI Anexo 4 2015rdquo

[31] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[32] C Seiffert T M Khoshgoftaar J VanHulse and A NapolitanoldquoRUSBoost A hybrid approach to alleviating class imbalancerdquoIEEE Transactions on Systems Man and Cybernetics PartASystems and Humans vol 40 no 1 pp 185ndash197 2010

[33] P A Jimenez and J Dudhia ldquoOn the ability of the WRF modelto reproduce the surface wind direction over complex terrainrdquoJournal of Applied Meteorology and Climatology vol 52 no 7pp 1610ndash1617 2013

[34] A Meij A De Gzella C Cuvelier et al ldquoThe impact of MM5and WRF meteorology over complex terrain on CHIMEREmodel calculationsrdquo Atmospheric Chemistry and Physics vol 9no 17 pp 6611ndash6632 2009

[35] P Saide G Carmichael S Spak et al ldquoForecasting urbanPM10 and PM25 pollution episodes in very stable nocturnalconditions and complex terrain using WRF-Chem CO tracermodelrdquo Atmospheric Environment vol 45 no 16 pp 2769ndash2780 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Modeling PM Urban Pollution Using Machine Learning and ... · ResearchArticle Modeling PM 2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters JanKleineDeters,1

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of