publications.jrc.ec.europa.eupublications.jrc.ec.europa.eu/repository/bitstream... · european...
TRANSCRIPT
-
A of Toof AirB
l for Base
the Data
Spatasets
tio‐Ts for
2 0 13
Oliver KGerboles
ExampAT, Cand NL
empr Abn
Kracht, Hann
ple of PCZ, DE,L (2006
poral norm
nes I. Reu
PM10 d, FR, E6-2007)
Scremal Va
Report EUR
uter and M
datasetsES, IT, U
eeninalues
25787 EN
Michel
s ofUK
ngs
kracholTypewritten Text
kracholTypewritten Text
-
EuropeanCommissionJointResearchCentreInstitute for Environment and Sustainability ContactinformationOliverKracht,MichelGerbolesAddress:JointResearchCentre,ViaEnricoFermi2749,TP442,21027Ispra(VA),ItalyE‐mail:[email protected].:+390332785652Fax:+390332789931http://ies.jrc.ec.europa.eu/http://www.jrc.ec.europa.eu/LegalNoticeNeithertheEuropeanCommissionnoranypersonactingonbehalfoftheCommissionisresponsiblefortheusewhichmightbemadeofthispublication.EuropeDirectisaservicetohelpyoufindanswerstoyourquestionsabouttheEuropeanUnionFreephonenumber(*):0080067891011(*)Certainmobiletelephoneoperatorsdonotallowaccessto00800numbersorthesecallsmaybebilled.AgreatdealofadditionalinformationontheEuropeanUnionisavailableontheInternet.ItcanbeaccessedthroughtheEuropaserver http://europa.eu/.JRC78437EUR25787ENISBN978‐92‐79‐28286‐7(PDF)ISSN1831‐9424(online)doi:10.2788/81552Luxembourg:PublicationsOfficeoftheEuropeanUnion,2013©EuropeanUnion,2013Reproductionisauthorisedprovidedthesourceisacknowledged.Printed in 2013
-
Summary
In order to provide scientifically sound information for regulatory purposes andenvironmental impact assessment, long term meso‐ to large‐scale datasets of ambient airqualityprovidean importantmeans forairpollutionmonitoring, evaluationandvalidation.However,thecollectionofhighqualitydatasetswithsuitablespatialcoverageforairpollutionmanagement and decision support poses many challenges. It is thus critical to establishexpedient tools for the efficient assessment and data quality control of air pollutionmeasurementsinlargescalenationalandinternationalmonitoringnetworks.The European Environmental Agency collects, in the Air Quality Database named AirBase,measurementsofambientairpollutionatmorethan6000monitoringstationsfromover30countries. The quality of these data depends on the chosenmethod of measurements andQA/QCproceduresappliedbyeachcountry.Wepresentanovelmethodologytoautomaticallyscreen the AirBase records for internal consistency and to detect spatio‐temporal outliersnestedinthedata.We implemented a spatio‐temporal toolset for screening abnormal valueswhich considersbothattributevaluesandspatialrelationships.Thealgorithmsarebasedonanadaptionofthe“SmoothSpatialAttributemethod”thatwasfirstdevelopedfortheidentificationofoutliersintrafficsensors.Themethodreliesonthedefinitionofaneighbourhoodforeachairpollutantmeasurement, corresponding to a spatio‐temporal domain limited in time (e.g., +/‐ 2days)anddistance(e.g.,+/‐1degree)around locationx. It isassumedthatwithinagivenspatio‐temporaldomaininwhichtheattributevaluesofneighbourshavearelationshipduetotheemission, transport and reaction of air pollutants, abnormal values can be detected byextremevaluesoftheirattributescomparedtotheattributevaluesoftheirneighbours.The application of this method is demonstrated by a comprehensive simulation and dataanalysisstudybasedonthe2006and2007AirBasebackgroundstationrecordsofdailyPM10values for a selection of 8 countries (AT, CZ, DE, ED, FR, GB, IT and NL). These datasetscoveredarangeofdifferentcountrysizesandcomprisedbetween35561and166436recordseach.Fromthese,thecontentofabnormaldatapointsidentifiedrangedbetween2%and4.1%oftheindividualcountrydatasets.However, not all records did fulfill the selection criteria for being included into thecomputations.Furthermore,thesettingupoftheabnormalvaluestestcanalsoleadtosomemathematical deadends restricting theverifiability of individual records. In consequence acertainpercentageof thedatarecords (between9%and40%of the recordsper individualcountry)had tobe flaggedasnon‐verifiable.Thosedatapointshad tobeexcluded fromtheinvestigationandfromthescreeningforirregularitiesforsafetyoftheconclusionsThe implementedmethodcanbeof interestas thebasisofadataquality screeningsystemwhen countries report their measurements to the European Environment Agency. Beyondthis,itcanalsoprovideasimplesolutiontoinvestigatetheaccuracyofstationclassificationinAirBase.Seenfromanotherviewpoint, itcanaswellbeusedasatooltodetectirregularairpollutionemissionevents(e.g.theinfluenceoffires,winderosionevents,orotheraccidentalsituations).
-
Contents1 Introduction...........................................................................................................................................................5 2 Airbase......................................................................................................................................................................6 3 Methodology..........................................................................................................................................................6 4 Robustness,sensitivityandoptimisationofthescreeningtool....................................................10 4.1 Normalityofdatasetsandlogtransformation..............................................................................11 4.2 Optimisationoftheparametersusedintheabnormalvaluescreening............................16 4.2.1 Spatio‐temporallimitsoftheneighbourhood......................................................................16 4.2.2 Testthresholdforz‐test.................................................................................................................19 4.2.3 Limitvalueforincludingziinthecomputationofθ..........................................................21 4.2.4 Windowwidthforthecomputationofθ................................................................................23
4.3 Manualcalculations..................................................................................................................................25 5 Results....................................................................................................................................................................25 Annex:Z(Sx)2006/2007timeseriesandabnormaldatapointsidentificationsummaries
-
5
1 Introduction
TheEuropeanCommissionhasworked intensively on the implementationof a harmonizedprogramme for themonitoring of air pollutants. The harmonization program relies on theadoptedEuropeanDirectives2008/50/ECand2004/107/EC [1,2].Thesedirectivesdefineslimit and target values for air pollution that should not be exceeded. Exceedances of theselimits may have legal consequences that trigger mitigation plans. To avoid measurementartefacts triggering suchmeasures, the Directives endeavour to improve the quality of themeasurementsbydefiningdataqualityobjectives(DQOs)thatrepresentthehighestallowedrelative expanded uncertainty of measurements. The reference methods have beenstandardized by the European Committee for Standardization (CEN). These standardsdescribe themethodology tobeapplied for theestimationof themeasurementuncertainty.Thisestimationoftheuncertaintyofmeasurementsisalongandtediousprocedurethatmayrequireconsiderableexperimentalwork.From another perspective, it is possible to derive the uncertainty of spatially referencedmeasurements from the nugget effect of variogram analysis. The nugget effect representsfluctuationsofthemeasurementsonaverysmallscale(tendingtowards0).Gerbolesetal.[3]have shown the possibility to automatically derive the uncertainty of measurements ofambient air pollutants using an innovativemethod based on geostatistical analysis. Duringthis study, it became clear that abnormal values influence the geostatistical calculation.Therefore, a detectionmodulewas developed in order to exclude abnormal value stationsresponsibleforhighdiscrepanciesfromthegeostatisticalevaluations.Whenthemethodwaspresented at the meeting of the AQUILA Network of National Air Quality ReferenceLaboratories (Ispra, June 2010) Member States representatives and the EuropeanEnvironmentalAgencyofficerconsideredtheabnormalvaluemoduleasavaluabletoolabletosupplyimportantinformation.Thisreportgivesdetailsaboutaconsolidatedscreeningmethodforthedetectionofabnormalvalues, andanexampleofwarningsonabnormalvalues for2006‐2007 timeseriesofPM10datasetsinAirBase.Thisreportisintendedtothefollowingstakeholders: Localauthoritiesthatmayusetheindicatortochecktheconsistencyoftheirstations
measurementsystemorclassification The European Environment Agency (EEA), to take into account the robustness of
stationoutcomeswhenestimatingtrendsandstatisticsaboutairpollutioninEurope ResearcherandscientistsusingdataofAirBaseinparticularmodellersinchargeofthe
validation of models compared to field measurements. They could use the qualityindicators provide by our method to better understand differences between airpollutionestimationandfieldmeasurements.
1 Directive2004/107/ECoftheEuropeanParliamentandoftheCouncilof15December2004relatingtoarsenic,cadmium,mercury,nickelandpolycyclicaromatichydrocarbonsinambientair.OfficialJournalL23,26/01/2005.2 Directive2008/50/ECoftheEuropeanParliamentandtheCouncilof21May2008onAmbientAirQualityandCleanerAirforEurope,OfficialJournaloftheEuropeanUnionL152/1of11.6.20083 M.GerbolesandH.I.Reuter,Estimationofthemeasurementuncertaintyofambientairpollutiondatasetsusinggeostatisticalanalysis,EUR24475EN,ISBN978‐92‐79‐16358‐6,ISSN1018‐5593,DOI10.2788/44902,2010.
-
6
Due to the envisioned group of final users, a free and extensible simulation platformwasconsidered an important point to start from. All computer codes were created in the RenvironmentwhichisfreelyavailableundertheGNUGeneralPublicLicense[4].
2 Airbase
The European Environmental Agency (EEA) maintains a database on behalf of theparticipatingcountriesthroughoutEurope,theEIONETnetwork.Memberstates(MS)areduetoreportonthebasisoftheCouncilDecision97/101/EC[5],withamendments2001/752/EC[6].Between2006and2007,over6738stationsareinthisdatabase,eachprovidingdifferentcomponents of multi‐annual time series of air quality measurements starting in 1981.Geographically, the stations are spread all over Europe with data collected in 36 differentcountries,including27EuropeanUnionMemberStates.The location of measuring stations of the EIONET network is clustered in general due tonature of themeasuring network. About 155 parameters are reported in AirBase, rangingfrom the concentrations of inorganic/organic gases, particulate matter concentrations andwet and dry depositionwith their speciation. IN 2008, about 66% of all values in AirBasecomes from four different parameters: O3 (21.2%), NO2 (17.2%)/NO (8.2%), SO2 (18.8%),carbonmonoxide(9.4%)andParticulateMatter(PM109.0%,PM2,50.5%,blacksmoke1.1%TotalSuspendedParticulate–2.9%andPb/Cd/As/Ni1.5%).ThequalityofthedatadependsonthechosenmeasurementmethodandQA/QCproceduresapplied by each country. The data in AirBase has undergone additional quality controlperformedduring theuploadof thedata from theMS toEEAsdatabaseusinga specificallydesigned software calledDEM (DataExchangeModule). TheEuropeanTopic Centre onAirandClimateChange(ETC/ACC)isalsoinvolvedindataqualitychecking.
3 Methodology
Theabnormalvalueprocedurewasimplementedbasedonalreadyexistingliterature.Chang‐TienLu[7]haveoutlinedandclassifiedseveralalgorithms[8,9,10,11,12,13,14,15,16]as4 R Development Core Team (2011): R: A language and environment for statistical computing. http://www.R‐project.org/ 5 Council Decision 97/101/EC of 27 January 1997 establishing a reciprocal exchange of information and data from networks and individual stations measuring ambient air pollution within the Member States, Official Journal L 035 , 05/02/1997 P. 0014 ‐ 0022 6 Commission Decision 2001/752/EC of 17 October 2001 amending the Annexes to Council Decision 97/101/EC establishing a reciprocal exchange of information and data from networks and individual stations measuring ambient air pollution within the Member States. 7 Chang‐Tien Lu, Dechang Chen, Yufeng Kou, "Detecting Spatial Outliers with Multiple Attributes," ictai, pp.122, 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03), 2003. 8 M. Ankerst, M. Breuning, H. Kriegel and J. Sander. Optics: Ordering points to identify the clustering structure in Proceedings of the 1999 ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, Pennsylvania, USA, pages 49‐60, 1999. 9 V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley, New York, 3rd Ed. 1994. 10 M. Breuning, H. Kriegel, R. T. Ng and J. Sander. OPTICS‐OF: Identifying Local Outliers in Proc. Of PKDD ’99, Prague, Czech Republic, Lectures Notes in Computer Science (LNAI 1704), pp 262‐270, Springer Verlag, 1999. 11 R. Johnson. Applied Multivariate Statistical Analysis, Prentice Hall, 1992. 12 E. Knorr and R. Ng. Algorithms for Mining Distance‐Based Outliers in Large Datasets in Pric. 24th VLDB Conference, 1998. 13 M. Kraak and F. Ormeling. Cartographer: Visualization of Spatial Data. Longman, 1996 14 F. Preparata and M. Shamos. Computational Geometry: An Introduction. Springer Verlag, 1998. 15 I. Ruts ans P. Rousseeuw. Computing Depth Contours Of Bivariate Point Clouds. In Computational Statistics and Data Analysis, 23:153‐168, 1996. 16 D. Yu, G. Shekholeslami and A. Zhang. Findout: Finding Outliers in Very Large Datasets. In Department of Computer Science and Engineering State University of New York at Buffalo, Technical report 99‐03, http://www.cse.buffalo.edu/tech‐reports/, 1999.
-
7
summarised in Figure 1. Two families of abnormal value detection methods can bedistinguished.First theoneswhichcalculatesstatisticof thedistributionofpollutant inonedimension and ignore geographical location [9, 11]. The second family, the spatial‐setabnormalvaluedetectionmethods, considerbothattributevaluesandspatial relationships.Fromwithin this family we used the “Smooth Spatial Attributemethod” [7] that was firstdevelopedfortheidentificationofabnormalvaluesintrafficsensors.Thismethodisthoughttobefitfortheidentificationofabnormalvaluesinagivenhomogeneousdatasetofairqualitydatathatrepresentsinasimilarwayaquantitymeasuredintimeandspace.
TheSmoothSpatialAttributemethodreliesonthedefinitionofaneighbourhoodforeachairpollutantmeasurement. It corresponds to a spatio‐temporal domain limited in time (+/‐ 2days)anddistance(+/‐1sphericaldegrees)aroundalocationx.TheneighbourhoodisbetterunderstoodbyobservingthediagraminFigure2.Wehypothesisethatwithinagivenspatio‐temporal domain the non spatial attribute values (air pollutants) of neighbours have arelationship due to the distribution/transport/emission and reaction of air pollution. Theobjectiveof themethod is thatabnormalvalueswillbedetectedbyextremevaluesof theirattributevaluecomparedtotheattributevaluesof theirneighbours.Themaincomputationcost of themethod is dominated by the large amount ofmultiple calculations of statisticalpropertiesperneighbourhood.Animportantconstrainofthemethodisthenormalityofthedistributionoftheattributevaluesofneighbours.TakingintoaccountthehighspatialvariabilityofPM10concentrationsaroundindustrialandtraffic stations, it was decide to apply the screening method for detection of possibleabnormal values to the sole stations of background type, but for all area types (urban,suburbanandrural).In the following text, we use x to denote a spatial object which attributes are (i) theconcentration of a pollutant, and (ii) its location.Within each neighbourhood of x, severalmeasurements xn,i of the same compounds, performed at different locations and differenttimes,areavailable.Equation1allowsthecomputationofaweightedaverageofallavailable
Figure 1: Several methods to detect abnormal values in multi dimensional datasets ([7])
-
measuremcorresponNotethatneighborhsettings.
Figure 2: D± 1 spherica
Theweighnormalizeparametereach neigdimension The
isttim
The(xn,dec
Thenormdeviations(excludingdistancezj, estimat
ments (nonndtotheinvouralgorithood exten
Definition ofal degree an
htingfactored Euclidears characteghbourhoodnalmultivaespatio‐temthelongitumeindays.espatio‐te,i,1,xn,i,2,xn,icimaldegremalizedEucsoftheattg the centrzero).sj2isated using
n‐spatial atversedistathmallowsnd in time
f a spatial-temnd ± 1 day
rswiarecaan distancerize thedd stationariatevectomporalposdeindecim
mporalpoi,3),wherexees,andxn,ilideandisttributevalural stationanunbiasethe sam
ttributes oanceinspacforadyna, in case o
mporal neig
alculatedue, and (B)istance in(xn,i). Theors:sitionofx(maldegree
sitionsof txn,i,1 is thei,3isthetimtance iscomuesxn,i,1,xn,xto avoidedestimatomple varian
8
f xn,i) withceandtimeamicexpanof insuffici
ghborhood of
singtwod) the inverspaceandspatial att
(thecentraes,x2isthe
thexn,i (thelongitude
meindays.mputedusi,2andxn,i,3division b
oroftheponce of in
hin each nebetweenxnsionofinmient data b
f sampling s
ifferentmerse squaretimebetwtributes of
lstation)islatitudein
eneighbouindecimal
singEquati3overthenby zero foropulationvandependent
eighbourhoxn,iandx.maximumfbeing retri
site x with an
ethods:(A)ed Mahalanween thecef x and xn
sdefinedbndecimald
urhoodstatldegrees,x
on2wherneighbourhr theweigharianceoftt and ide
ood. The w
fivetimesoieved with
n interval of
)theinversnobis distaentral statin,i are defi
y(x1,x2,x3)degrees,an
tions)aredxn,i,2 is the
resjare thehoodsetofht of spatiotheattribuentically d
weights wi
ofthebaseh the base
f selection of
sesquaredance. Bothon (x)andined as 3‐
),wherex1ndx3isthe
definedbylatitude in
estandardfnstationso‐temporaltevariabledistributed
i
ee
f
dhd‐
1e
yn
dsled
-
9
observations(henceusingthedenominatorn–1)(Equation3).TheMahalanobisdistanceiscomputed using Equation 4 where S is the covariance matrix of the xn,i of the wholeneighbourhoodset(excludingthecentralstationx).Asasimplecontrolstep, thenormalizedEuclideanweighting factorsshallbesymmetricallyaround the point in time of observation. This cannot necessarily be expected for theMahalanobis Distance based weighting factors, which makes them more difficult to bechecked.OnemaynoticethatifthecovariancematrixSisdiagonal,theMahalanobisdistancereduces to thenormalizedEuclideandistance.Forcontrolpurposes,wesetup thecode forthecomputationof theMahalanobisdistance inawaythat it canbemodifiedbyartificiallysettingallnon‐diagonalelementsofStozero.TheweightingfactorswicanfinallybecalculatedbycomputingtheinverseofthesquareofthenormalizedEuclidiandistanceorthesquareoftheMahalanobisdistance.
,1
1
n
i n ii
n n
ii
w xx
w
Equation1
2
3, ,
, 21
, n i j jnormalized Euclidian n ij j
x xD x x
s
Equation2
22 , , , ,1
11
n
j n i j n i ji
s x xn
Equation3
1, , ,,T
Mahalanobis n i n i n iD x x x x S x x Equation4
nSx f x f x Equation5
n
n
Sx
Sx Sxzs
Equation6
,1
1
n
i n ii
n n
ii
w SxSx
w
Equation7
2,1
1
1n
n
i n i ni
Sx n
ii
w Sx Sxns
n w
Equation8
1.96 1.96iz
Equation9After a log‐transformation of non‐Gaussian data, we compute the weighted average ix according to Equation 1 and the differences Sx between the non‐spatial attribute valuef(x)(pollutant concentration) at locationx and the averageattributevalueof itsneighboursaccordingtoEquation5.
-
10
Withineachneighbourhood,theSxvaluesarenormalisedtocenterdataat0withastandarddeviationof1usingEquation6.Inthisequation, Sx andsSxaretheweightedaverageandtheweightedstandarddeviationofallSxiattributevaluescalculatedoverallstationswithintheneighbourhoodofx[17]. Sx andsSxarecalculatedusingEquation7andEquation8wheren’is thenumberofnon‐zeroweightswithin thewi vectorof lengthn.Noteby calculating theweightsfromtheinverseofthesquaredspatio‐temporaldistances,thewiarealwaysnon‐zeroandthereforen=n’inourapplication.Anotherapproachcouldhavebeentoestimate x andsoverthewholedataset[18].However,since air pollution time series exhibit a strong seasonality effect, applying such a methodwould have led to an overestimation for Sx and sSx, resulting in a number of undetectedpossibleabnormalvalues(falsenegative)whenapplying theabnormalvalue test (Equation9).Finally, the test fordetectinganabnormalvalue,given inEquation9, searches forzi valuesexceedingalimitvalueθconsistingofthemovingaverageoffiveconsecutivezivaluesplusapredefinedthresholdof1.96,correspondingtoaconfidenceintervalinwhich95%ofzivaluesshouldlay.Somelimitationswereapplied: Incaseof|zi|exceedingavalueof1.96,ziwasnottakenintoaccountforthecalculation
ofthemovingaverage. In case ofθ estimatedbasedon less than threezi values, amoving averagewasnot
calculated.Thustheabnormalvaluetestwasnotperformedatthisposition.Asa furtherrestriction,outlierswereonlyflaggedwhenthereferencepointneighbourhoodcontainedaminimumnumberofdatapoints(thresholdsetto20datapoints).In contrast to the paper by Lu [7], we precisely did not use an absolute value of the z‐transformation. Indeed the sign of the abnormal value is of interest to us as we want tounderstand ifa station ismeasuring to lowquantitiesor tohighquantitiescompared to itsneighbourhoodstationswiththesameclassification(backgroundstations).Bycomparingtheresultoftheziagainstthemovingaverageoftheziplus/minusthethresholdvalue,abnormalvaluescanbeidentified.
4 Robustness,sensitivityandoptimisationofthescreeningtool
Among AT, CZ, DE, ED, FR, GB, IT and NL, a few negative values were observed in theAIRBASE_2007PM10datasets(70valuesforFranceoutof168153recordsand9forGBout48872 records). These values were discarded because they disturb the process oftransformationofdatasetsfornormalisation.The design of the outlier test implies some limitations and can lead tomathematical deadends: Lackofminimum20dataintheneighbourhood. The Mahalanobis distance calculation requires an inversion of the S‐matrix. The S‐
matrix,however,revealedtobenon‐invertibleforsomedatacases.Forthisreason,theuseofnormalizedEuclideandistancewasintroducedasafirstalternativesolution.
17 Ref: Shekhar et al “A Unified approach to detecting spatial outliers” page 141, Example 1 18 Dissertation of Yufeng Kou – “Abnormal Pattern Recognition in Spatial Data”, page 19, lines 4 to 8
-
11
Otherstatisticalparametersmightaswellnotberetrievableincaseofcolinearitiesinthespatialstructureoftheneighbourhoodofadatapoint.
Thefirsttrailingdaysandlastdaysofatimeseriescannotbetestedbecauseθvaluescannotbecomputed.
More generally,when less than 3 zi values are available to calculateθ, computationstops and abnormal datapoint thresholding cannot be performed for this datapoint.Wehoweverobservedaconsiderableamountof|zi|suspectedtobehigherthan1.96whichareacceptedforsafetyoftheconclusions
All these shortcoming cases are summarized under the data category “non‐verified data”.However it is possible that a considerable part of these unverified values corresponds toabnormalvalues.Thismightespeciallybethecasewhencalculationsstopbecauseofseveralzivaluesexceedingthethresholdvaluesarediscarded,whichinconsequencecanpreventthecontinuouscomputationofthe5daysmovingaverageofθ.InairbaseafewstationsreportPM10valuesformorethanonemethodofmeasurements.Forexample, a few stationsmay use onemanualmethod integrated over 24 and an automaticmethodproducinghourly values. In some cases, stations report values from twoautomaticmethod.However,itwascheckedthatwithinthetableofdailyvaluesonlyoneuniquemethodwas used per station and per day, making unnecessary to check the robustness of thescreeningtoolatstationswithmultiplemeasuringmethods.4.1 NormalityofdatasetsandlogtransformationOurtestforabnormalvalueslooselyassumesthatthePM10datasetsarenormallydistributed.A significant violation of the assumption of normality could increases the chances of un‐reliabledetectionsconsistingeitheraTypeI(falsepositive)orTypeII(falsenegative)error,dependingonthenon‐normality.Thenon‐normalityofPM10datasetsisarealfeatureduetothenatureofairpollutantthat iseasilyobserved(seeFigure3),ratherthancausedbydataentryerror,missingvaluesorpresenceofoutliervalues.Misclassificationof stationsmightalso be a source of skewness, e. g. traffic or industrial stations wrongly classified asbackground stations. Visual inspection of Figure3 shows right‐skeweddistributions (meanvalue higher than themodevalue)with skewness coefficients of 2.51 (DE), 2.38 (FR), 2.25(GB)and1.87(IT).
-
A commonThe squarnestedinwasaddednumbersbFigure 4 ssomeskew
Figure 3: D
n transformre root ofsomedatadtomovebetween0shows thatwness(0.95
Density of PM
mation forevery valusets(seeatheminimand1bect the distri5forDE,0.
M10 datasets
normalisinue was takannex1).Amumvalueocoming largbutions of97forFR,0
12
in Airbase fo
ngdata isken after dAconstanteofthedistrgerwhilenf square‐ro0.86forGB
for DE, FR, G
theso‐calldiscardingequalto(1ributionabnumbers aot transforBand0.78f
GB and IT in
edsquarethe few ne–minimubove1 inoabove1wormedPM1forIT).
n 2006-2007
root transegative PMmPM10peordertoavouldbecom0datasets
7
sformation.M10 valuesercountry)oidhavingme smaller.still show
.sg.w
-
Figure
SinceasimoftheinitiforexampwasappliePM10valufor thesqaddingacMoreover,transformthatraiseis a genelogarithmcharacterivaluesofλ19 Osborne, J. Evaluation 15,
e 4: Density o
mplesquarialdistribuple,logarithed(seeFiguueswerediuareroot tconstanteq, we have
mationsorBnumberstoralisationic and inveized as x1/2λhavebeenW. “Improving, no. 12 (2010):
of square-ro
reroottrantionsofPMhmicorinvure5).Astiscardedprtransformaqualto(1–investigatBox‐Coxtraoanexponof a grouperse transf2, inverse tnsetbyan Your Data Tra 1–9.
ot transform
nsformationM10values,versetransfthelogarithriortotranation,wemminimumed the useansformatinent(seeEqp of otherformation.transformanoptimizati
ansformations: A
13
med PM10 dat
nwasineffe,moresophformation.hmofanynnsformationmovethemPM10perce of anotheion[19].Poquation10r transformFor exampations canionalgorith
Applying the B
tasets for DE
ectivetocohisticatedtAnaturallnullornegan.Additionminimumvountry).er class ofowertransfwhereλ≠mations whple, a squabe characthmableto
Box‐Cox Transfo
E, FR, GB a
ompletelyrtechniqueslogarithmoativenumbnally,andfoalueof the
f transformformations0).Theboxhich includare root trterized as xminimizet
ormation.” Prac
and IT in 200
removethecouldbeaofPM10daberisundeforthesameedistributi
mations calsaretransfx‐Coxtransdes the sqansformatix‐1 and sotheskewne
ctical Assessme
06-2007
eskewnessappliedlikeataplus0.5fined,suchereasonasonto1by
led powerformationssformationquare root,ion can beforth. Theessofeach
nt, Research &
sehsy
rsn,eeh
&
-
distributioandITres0.03and0transform
Figure 5: Dand Italy in
Comparinboth transthePM10However,doesnotedistributioEquation7Anyhow,implemeneffectivein
on.Thefollspectively.0.01,respemationof‐0
Density of Bn 2006-2007
g the skewsformationdatasets.asshownensurethaon, too. A7andEquait is likelyntationofthndetecting
lowingλvaConsequenectively.Th.03,0.141,
ox-Cox tran
wness of lons successfu
inFigure6teach indiLog‐transfation8requy that brehez‐testprgabnormal
alueswerentlytheskeesevalues‐0.38and‐
'10
PMPM
nsformed PM
og transformfully reach
6,asymmeividualneigformationuirethattheeching throvidedthavalues.
14
eobtained:ewnessofDcanbecom‐0.16forDE
110 M
M10 datasets
medandBthe goal o
etricaldistrghbourhoowithin eacheSxvaluese normalitatthethres
0.093,0.1DE,FR,GBmparedtoE,FR,GBan
in Airbase f
Box‐Cox traof producin
ributionforoddatasetch neighbosbetweenty assumpsholdvalue
0,0.13andandITdectheskewnndIT,resp
for Germany
ansformedng symmet
rthewholewillaswelourhood isneighbourhption doese1.96setin
d0.14forDcreasedtonessfiguresectively.
E
y, France, G
values sugtrical distri
e2006‐20llpresenta impossiblhoodsarecnot jeopa
nEquation
DE,FR,GB0.01,0.17,softhelog
quation10
Great Britain
ggests thatibutions of
07datasetaGaussianle becauseconsistent.ardize the9remains
,g
n
tf
tne.es
-
Figure 6: H02-20006 (D
Histrogram oDE and FR)
of PM10 valu and 01/02/2
ues and of th2007 (GB an
15
heir logarithnd IT) in thei
hmic transfoir neighbour
ormation of rhood
selected stattions on 01--
-
16
4.2 OptimisationoftheparametersusedintheabnormalvaluescreeningThechoiceofdifferent functionalparametersthataffect theoutcomeoftheabnormalvaluescreening has been investigated. This includes the temporal/spatial limits of theneighbourhood(initially±2days,±1ºlongitudeand±1ºlatitude),thethresholdvalue1.96setinEquation9,thetestvalueforacceptingvaluesinthemovingaverageofθandthewidthofwindowusedtocalculatethecriteriaforthemovingaverageofθ(5consecutiveziallowingfor2missingvalues).Thesensitivityofthescreeningresultstothesevalueswasinvestigatedby simulations usingPM10datasets. The findings from this sensitivity analysis allow for anoptimizedselectionofparametervalues,andforavalidationofparameterselection.4.2.1 Spatio‐temporallimitsoftheneighbourhood
For these simulations, the neighbourhood domainwas systematically adjusted in time andspace.We testedall combinationsofneighbourhoodsizes from±1 to±4days in timeandfrom±1to±4degrees in longitudeandlatitude.Byextendingthelimitsofneighbourhoodoutside the given station conditions, these simulation increased the probability of falsedetectionofabnormalvalues.Validationoftheneighbourhoodlimitswasperformedforallselectedcountries(AT,CZ,DE,ES,FR,GB,ITandNL)forallbackgroundstationofallareatypes(rural,urbanandsuburban)usingthePM10datasetsof2006to2007.TheresultsofthesesimulationsaregiveninTable1andFigure7.Note that for thesesimulationsnodynamicexpansionof the timeandspatiallimitsofneighbourhoodhavebeenallowed.On the contrary to initial anticipation, the selection of the time and spatial limits of theneighbourhood,doesnothaveastrongeffectonthenumberofdetectedabnormalvalues.Infact,therelativestandarddeviations,whichappeartobeindependentofthetotalnumberofabnormalvalues,withintheresponsesurfacevaluesare10%(AT),11%(CZ),14%(ES),4%(FR),6%(GB),10%(IT),and15%(NL),respectively.Table1showsthatbetweenthesmallestandlargestneighbourhood,thetotalnumberofabnormalvaluesisonlytwiceasbigforNL.Itcan be concluded that the weighting algorithms presented in chapter 3 make the methodreasonably independent of the preselected extent of the neighbourhood. The effect of theweighting factors ismuch stronger than the preselected limitations of the spatio‐temporalneighbourhoodboundaries.An absolute definition of abnormal values is not feasible. Consequently, we do not havereferencedatafortheoptimumnumberofabnormalvaluestobecomparedtotheoutputofthescreeningtool.Onlyexpertjudgementorrationalindicators(i.elackofcontinuityofthetotalnumberofabnormalvalues)canbeusedtoselectthebestcombinationofspatiallimitsand time limits. Since the screening tool could be used as a warning system for doubtfulvalues by various stakeholders, a combination of limits producing reasonably high figuresshould be selected. At the same time, the extent of the neighbourhood should be asparsimoniousaspossibletosaveonCPUtimeofthecomputationsandinordertoproducez’indicatorsthatarecharacteristicofmeasurementsinthevicinityoftestedstations.Asmentionedabove,forthescatteringofthenumberofabnormalvaluesallcombinationsoftime and space limits produce comparable numbers of abnormal values. However, thevariationsalongthetimeandspatialdimensionsaredifferent.Amultipleanalysisofvarianceshowed20that country is themain influenceaffecting thenumberofabnormalvalueswhiletime window had double an effect compared to the space window. Moreover, one may20 Note that FR was discarded from this analysis because it gave a high number of abnormal values,
-
17
observeseveralsteepdecreasesofthenumberofabnormalvaluesoccuringatatimelimitof1dayand1 sphericaldegree.Consequently, itwasdecided to select a timewindowof twodays(withadditionalpossibilityofexpansion)toavoidclosenesstothesteepgradient.ForATandIT,onecanalsoobservethatthevariationofthetotalnumberofabnormalvaluesfluctuatemorealongthespacedimension.Itislikelythatorography,characterisedbyarapidchangebetweenmountainsandvalleysforthesetwocountries,producesthesefluctuations.Followingthisobservationandinordertolimitpossiblefalsepositivesandfalsenegatives,itwasdecided to set the spatial limitsof theneighbourhood to the smallest spacedimensionwithout the possibility of expansion. These figures represent, in our view, the bestequilibrium between avoiding unverified data, high number of detected abnormal values,avoidingtheextremefigurescharacterisedbyalackofcontinuityofthenumberofabnormalvaluesandlimittheCPUtimeneededtoperformthesecalculations.Table 1: Effect of changing the spatial and temporal limits on the detection of abnormal values for Germany for the background - urban - 2007 - PM10 out of 236797 total records -constant threshold for the z value and constant value for the rolling mean value
Timewindow[days]
Spatialwindow[°]
AT CZ DE ES FR GB IT NL
±1 ±1 611 1240 2899 506 4959 471 837 146±1 ±2 594 1227 2693 844 5508 570 926 248±1 ±3 579 1190 2444 892 5473 569 939 238±1 ±4 582 1170 2321 885 5388 584 1141 236±2 ±1 714 1214 3058 688 5750 553 825 227±2 ±2 566 1127 2586 803 5821 523 917 304±2 ±3 546 1054 2227 773 5704 515 937 280±2 ±4 564 1020 2082 771 5769 522 1071 278±3 ±1 661 1100 2939 726 5883 511 809 316±3 ±2 544 1030 2396 756 5854 535 933 294±3 ±3 503 961 2100 720 5616 530 913 266±3 ±4 543 919 1930 714 5755 507 1022 266±4 ±1 611 1014 2707 713 5552 492 788 293±4 ±2 509 937 2240 694 5568 543 913 277±4 ±3 482 911 1999 665 5375 511 871 257±4 ±4 528 897 1864 658 5465 491 967 255
-
Figure 7: InfDE, FR, GBand temporalength of the
nfluence of timB, IT and NL ial extend are e edge of ± 1,
me and spatiain 2006-2007.given in extethus 2° in lon
al extent in the. Note the diffension aroundngitude and 2
18
e determinatio
fferent axis ord a centerpoi° in latitude.
on of abnormrientation per int. Example
mal values for graph. Note given, a spat
PM10 datasetalso that the stial extend of
ts for AT, CZ,spatial extent
f 1describes a
t a
-
19
4.2.2 Testthresholdforz‐test
Thetestthresholdtodetectabnormalvaluesshouldbefromastatisticalpointofviewaround1.96forasimplez‐test.However, theexperimentshaveshownthat thisvaluemightbetooconservative.WerunaseriesofexperimentsusingtheresultsofscreeningsforAT,CZ,DE,ES,FR, GB, IT and NL to further investigate this parameter. We observed that for thresholdshigherthan3thenumberofidentifiedabnormalpointsrapidlyconvergestowardszero.Overthewholerangeofthresholdvalues,thenumberofunverifiedvaluesremainsconstant.Figure8showsthatthetestthresholdhighlyaffectstheoutputofthescreeningtoolregardingthe number of abnormal values. However, like for optimisation of the limits of theneighbourhood, without reference values for the number of abnormal values, we cannoteasily decide which threshold to use. Further investigations are needed to find rules andmechanismtosetthisparameter.Furthermore, theselectionof thisparameterwillstronglydependonthespecificobjectivesoftheintendedapplication.
-
20
Figure 8: Percentage of abnormal values with respect to different choices for the z-test threshold
0%
2%
4%
6%
8%
10%
12%
14%
0 1 2 3 4 5 6
0%
2%
4%
6%
8%
10%
12%
14%
16%
Austria (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
0 1 2 3 4 5 6
0%
2%
4%
6%
8%
10%
12%
Czech Republic (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 1 2 3 4 5 6
0%
2%
4%
6%
8%
10%
12%
Germany (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
Spain (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
5%
10%
15%
20%
25%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
France (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
40%
United Kingdom (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
14%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
40%
Italy (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
Netherlands (2006 - 2007)
Threshold value for z test
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
-
21
4.2.3 Limitvalueforincludingziinthecomputationofθ
The z‐test for detecting abnormal values (Equation 9) is based on the computation ofθ, amoving average of 5 consecutive zi values. zi values are included into themoving averageprovidedthattheirvaluesdonotexceedapredefinedthresholdwhichiscurrentlysetto1.96.All|zi|exceedingavalueof1.96arediscardedfromthecomputationofthemovingaverage.Thisproducesunverifiedrecordswhenseveralconsecutiveziarerejected,hencerestrictingacontinuouscalculationofθ.Figure9showstheinfluenceofthethresholdforacceptingzivalues.Tuningthisparameterindirectionof“strict”values(lowthreshold)causesalargenumberofunverifiedrecordsintheevaluation.The influence on the number of identified abnormal points is complex and indicates thesuperimposition of two or more effects. First, the reduction of the number of unverifiedrecords(byusinglessstrictthresholdvalues)seemstobedirectlyconnectedtoanincreaseofidentified abnormal records (examples of ES, FR, and IT). This indicates that a largeproportion of abnormal records have been hidden within the non‐verifiables. Second,however, towards higher threshold values the effect can also be opposite (decrease ofidentifiedabnormalrecordsintheexamplesofDE,GB,andNL).Asanotherimportantobservation,itisnotfeasibletosetittothehighestnumberofabnormalvaluesandlowestnumberofunverifiedrecords.
-
22
Figure 9: Effect of the upper limit value (currently 1.96) for including zi-values into the moving average computation of θ
0%
0.5%
1%
1.5%
2%
2.5%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
40%
Austria (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
0.5%
1%
1.5%
2%
2.5%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
Czech Republic (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
0.5%
1%
1.5%
2%
2.5%
3%
0 1 2 3 4 5 6
0%
5%
10%
15%
20%
25%
30%
35%
40%
Germany (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
4%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
60%
70%
Spain (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
1%
2%
3%
4%
5%
6%
7%
8%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
60%
France (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
0.5%
1%
1.5%
2%
2.5%
3%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
60%
United Kingdom (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
4%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
60%
70%
Italy (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
0.5%
1%
1.5%
2%
2.5%
0 1 2 3 4 5 6
0%
10%
20%
30%
40%
50%
Netherlands (2006 - 2007)
Limit value for accepting zi values
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
-
23
4.2.4 Windowwidthforthecomputationofθ
The effect of the width of the time window of the moving average (θ) on the number ofdetectedabnormalvalueswasstudiedfortheresultsofthescreeningtoolforAT,CZ,DE,ES,FR,GB, IT andNL. In these calculations,weassumed that foranywindowwidth theactualpercentage of requiredminimum number of valid zi for partial calculations of themovingaveragewassetto60%.Figure10indicatesthatthetimewindowofthemovingaverageshouldnotbesettovalueslower than 4 days to avoid a strong decrease of the percentage of detected abnormal.Conversely, for timewindowwidthover5days, only a slight increaseof thepercentageofabnormal values takes place. This latter effect might be due to instability of weatherconditionsoverlongertimespans,thereforethetimewindowindaysshouldberathershort.Therefore we choose a value of 5, as this seems to be a good compromise over stablethresholding and not indicating to many abnormal values due to false positives. Thisparameterseemsnottoinfluencethepercentageofunverifiedrecordsalthoughsomenoisecanbeobservedfortimewindowoflessthan10days.
-
24
Figure 10: Influence of the moving windows width used for the moving average computation of θ
0%
0.5%
1%
1.5%
2%
2.5%
3%
0 5 10 15 20 25 30 35 40 45
0%
5%
10%
15%
20%
Austria (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
0 5 10 15 20 25 30 35 40 45
0%
2%
4%
6%
8%
10%
12%
14%
16%
Czech Republic (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
0 5 10 15 20 25 30 35 40 45
0%
2%
4%
6%
8%
10%
12%
14%
16%
Germany (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
0 5 10 15 20 25 30 35 40 45
0%
10%
20%
30%
40%
50%
60%
Spain (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
0 5 10 15 20 25 30 35 40 45
0%
5%
10%
15%
20%
25%
30%
35%
40%
France (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
1%
2%
3%
4%
5%
6%
0 5 10 15 20 25 30 35 40 45
0%
5%
10%
15%
20%
25%
30%
35%
40%
United Kingdom (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s% of abnormal values
% of unverified records
0%
1%
2%
3%
4%
5%
0 5 10 15 20 25 30 35 40 45
0%
10%
20%
30%
40%
50%
Italy (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records0%
0.5%
1%
1.5%
2%
2.5%
0 5 10 15 20 25 30 35 40 45
0%
5%
10%
15%
20%
25%
Netherlands (2006 - 2007)
Moving-Window width [days]
% o
f abn
orm
als
in v
erifi
ed re
cord
s
% o
f unv
erifi
ed re
cord
s
% of abnormal values
% of unverified records
-
25
4.3 ManualcalculationsChecks by manual calculation were performed for a set of stations in different countriesincluding stations FR34032, FR34052 on day 2006‐02‐01, DETH025 on day 2006‐02‐01,GB0643Aonday2007‐02‐01andIT1186Aonday2007‐02‐01.Thecheckhavebeencarriedoutbothusing twoversionsof theAIRBASEversion,onewithdatasetsending in2007andanotherversionforwhichdatasetsendingin2010.Thecheckconsistedinconfirmingthelistofextractedstationswithinneighbourhoodsfortheselecteddateswithinthespatiallimitsoftheneighbourhoodandforthecorrectcombinationof station type (background) and area type (all area type: urban, suburban and rural).Equation1toEquation9werecomputedforthemanualcalculationsandtheirresultsagreedwiththeresultsofthescreeningtool.AfewdifferenceswereobservedbetweenAirbase2007and2010,mainlyconsistingofafewstationspresentinAirBase_2010thatweremissinginAirbase_2007.Moreover,thevaluesofstationsintheneighbourhoodofDETH025wereslightlydifferentinAirBase2010(abouthalfof the valuesdifferedby less than0.2µg/m³without changing theoutput of the screeningtools).StationGB0788AhadPM10valuesof36,24,31and35µg/m³inAirbase_2007and34,22, 20 and 25 µg/m³ in Airbase_2010. Moreover, when extracting the neighbourhoodAirbase_2007,thefollowingstationsweremissing: DEHE055andDETH042fortheneighbourhoodofDETH025 IT0940AandIT1672AfortheneighbourhoodofIT1186A
5 Results
AcompletesetoftimeseriesplotsofdailyPM10abnormalvaluesforthebackgroundstationsof AT, CZ, DE, ES, FR, GB, IT and NL are given in Annex 1. The graphs in Annex 1 areconsidered to be useful for local authorities in order to question the consistency of thedetected abnormal values of their stations. Modellers can use this information whenestimatingtheperformanceofmodelscomparedtofieldmeasurements.Table2summarizestheoutcomeofthescreeningtoolappliedpercountry.Fromatheoreticalperspective, a screening procedure that looks at extreme values within normalizeddistributions implies that a certain percentage of abnormal value detections should beexpected. However, because of the different data transformations employed, we cannotanticipate a detection of 5 % of abnormal values corresponding to the selected level ofconfidence.Infact,takingallcountriesintoconsideration,thepercentagesofabnormalvalueidentifications rangesbetween1.5 and4.1%.However, once thematter of unverified datawillbesettledown, thenumberofabnormalvaluesperstationmay increasewhena largernumberofextremezivaluesareacceptedintheestimationofθ.WehavelookedatcorrelationbetweenthepercentagesofabnormalvaluespercountryanddifferentvariablesinTable2.Tooursurprise,thehighestpercentagesofunverifieddatawerenotcorrelatedwiththedensityofmonitoringstationsofeachcountrynorthehomogeneityofPM10measurementmethod(gravimetry,TEOMorβ‐ray)percountrynorthehomogeneityofarea types of stations per country (urban, suburban or rural). At a first glance, one mayobservethatthepercentageofabnormalvaluesisgenerallyhigherforthecountriesreportingthehighestnumberofrecords.Finally,byvisualinspectionofthegraphsoftheannex,ruralsitesappear toproducemoreabnormalvalues than forurbanorsuburbanareas indicatingthatthepresenceofruralstationsinthe“Allbackground”categoryshouldbefurtherstudied.The above conclusions are somehow premature. We would like to emphasize that the
-
26
reportedfiguresaresomewhatdependentontheparametervalueschosen inthetools,andthatthesearestillgoingtobefine‐tunedfurther.Forthenextdevelopmentsofthemethod,wewanttogiveadefinitiveevaluationofwhatcanbeachievedwiththescreeningtools.Ourshorttermobjectiveconsistsof: Investigate if unverified records partly represent abnormal values; decrease the
percentage of unverified records by modification of the calculation of θ movingaverage(e.g.byapplyingaKolmorogovZurbenkotypeoffilter).
Compare the current screening tool using normalised Euclidan distance with thefindings using the Mahalanobis distance. Investigate which power of the inversedistance(currently2)isbestsuitedtoestimatetheweightingfactors.Infact,stationsmayhaveoneverycloseneighbour.Theresultingproblemisthattheweightingfactorsforthisonecloseneighbouraregettingverylarge,andtheneighbourhoodmeanistoomuchdependentonthisonesingleattributevalue.
Validate currently optimised parameter values (neighbourhood limits, averagingwindows for θ, threshold value for the z‐test and for accepting zi values) by spikingPM10datasetstoartificiallyproduceoutliers.Studythepossibilitytoimprovethetoolbysettingitsparametersperindividualday.
Currently,spatialdistancesare indecimaldegrees,butshouldratherbeevaluatedinkilometres. Therefore we will implement a geodetic projection procedure forcoordinatetransformations.
Currently the base station is not part of the selection for the calculation ofneighbourhood statistics. This limitation is a consequence of inverse distances forweighting factors calculations becoming undefined otherwise. We will trycircumventingorimprovingthiscalculationlimitation.
Study if includingruralstations in the“allbackground”categoryof testedstations isappropriate as this type of area in the “All background” categoryproduce toomanyabnormal values. Evaluate the possibility to run the screening for the sole urban,suburbanandruralareatypesandforthetrafficandIndustrialtypesofstations.
Evaluate the feasibility of an iterative procedure, where once an abnormal value isdetected,immediatecorrectionsaremadesuchasreplacingtheattributevalueofthisabnormaldatapointbytheaverageattributevalueofitsneighboursandupdatingthesubsequentcomputation.Theeffectofthesecorrectionsistoavoidnormalpointsclosetothetrueabnormalpointstobeclaimedaspossibleabnormalpoints,too.
Determination of abnormal values for all PM10 datasets and for the last version ofAirbase over the 10 last available years for all countries having sufficient PM10records.
Ourmiddletermobjectiveis: Listandmapofstationscontinuallyproducingzindicatorshigherorbelowtheother
stationsintheirneighbourhoodinordertocheckstationclassifications. ApplythescreeningtoolstoNO2andO3datasets,iffoundfeasible.
-
27
Investigation of transboundary effects on PM10 records; cluster effect will beevaluated by including stations belonging to more than one country into theneighbourhoodofstationsnearborders.
Re‐evaluate the measurement uncertainty for PM10, according to the methoddeveloped inGerbolesandReuter,2010[3]andtakingadvantageof theconsolidatedscreeningtool.
Finally,ourlongtermobjectiveislinkedwithinvestigationslike: Applicationofthescreeningtoolforcheckingofdataqualityintheframeworkofnear
torealtimedatareporting. Evaluatetheperspectivesandfeasibilitiestodevelopthescreeningtoolintoanonline‐
applicationforoperationaluseandaccessibilitybyindividualstationmanagers.
-
28
Table 2: Summary of the output of the screening tool per country including numbers and density of background stations, total number of records, percentages of unverified records and detected abnormal values, types of measuring methods and area type of stations
Backgr. Stations Density [stations / 10³ km²]
Records Unverified records Abnormal
data* Affected Stations Gravimetry TEOM
Beta ray
Unknown and others
Urban area
Suburban area
Rural area
AT 63 0.75 40471 5697 (14%)
722 (2.1%)
57 (90%)
20 % 56 % 22 % 1 %,
Reflect. 1 % 31% 35% 34%
CZ 96 1.22 64996 6545 (10%)
1214 (2.1%)
87 (91%)
30 % 70 % 0 % 32% 24% 45%
DE 240 0.67 160083 16575 (10%)
3070 (2.1%)
224 (93%)
29 % 8 % 40 % 22%, Chrom. 1 % 42% 31% 27%
ES 134 0.26 59668 24980 (42%)
729 (2.1%)
81 (60%)
39 % 3 % 30 % 0 %,
DOAS 9 %, AAS 20 %
33% 33% 34%
FR 286 0.52 165443 49385 (30%)
6306 (5.4%)
259 (91%)
85 % 15 % 59% 35% 6%
GB 56 0.24 35561 12342 (35%)
600 (2.6%)
41 (73%)
5 % 94 % 1 % 0 % 84% 6% 10%
IT 108 0.36 49656 18527 (37%)
871 (2.8%)
82 (76%)
20 % 8 % 59 % 4 %,
Cond. 1%, Neph. 6%
59% 26% 14%
NL 24 0.58 16135 3004 (19%)
227 (1.7%)
22 (92%)
100 % 0 % 37% 32% 32%
*Percentages of the verified records TEOM: tapered element oscillating microbalance Cond.: conductimetry Neph.: nephelometry Chrom.: chromatography DAOS: differential optical absorption spectrometry AAS: atomic absorption spectrometry Reflect.: reflectometry
-
ANNEX:
-
Z(Sx) 2006 / 2007 time series
and
abnormal datapoint identification summaries
Austria
-
−4
−3
−2
−1
0
1
2
3
4
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
AT0002R (background, rural)long = 16.766 deg E, lat = 47.77 deg N
z(s x
) ●●●●
●
●
●
●
●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●●
●●
●
●
●
●●
●●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●●
●
●
●●
●●
●
●
●
●
●
●●●
●
●●●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●●
●●
●
●●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●●●
●
●●
●
●●●
●
●●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
● ● ●●
●
●
● ● ●normal point threshold limits abnormal point non−verifiable
number of datapoints investigated for AT0002R: 728
identified abnormal datapoints: 17
abnormal datapoints content: 2.34 %
abnormal datapoints station ranking = 17 within a total of 63 stations investigated for AT
non verifiable datapoints: 0
−3
−2
−1
0
1
2
3
4
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
●
AT0003A (background, urban)long = 14.678 deg E, lat = 47.179 deg N
z(s x
)
●
●
●●
●
●●●
●
●●●
●●
●
●
●
●
●
●
●●●●●●●●
●●●●●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●●
●
●●
●●
●
●
●
●●●
●
●
●●●
●●●
●●●
●
●●
●●●
●
●
●
●
●●
●●
●
●●
●●●
●●
●●
●
●
●
●
●●●
●
●●●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●●●
●●
●●
●
●
●●●
●●●●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●●●
●
●
●●●●
●●●●●
●
●
●●
●
●
●●●
●
●●●
●
●●●●
●
●●
●●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●●●●●●●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●
●●
●●●●●●●
●●●
●●●
●●●●●●
●
●●●●
●
●●
●
●
●
●●●
●●●
●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●
●●●
●
●●
●●●
●
●
●●
●
●●●
●
●●
●●●●●●●●●
●
●●●
●
●●●●●●
●●●●●
●
●
●
●
●
●●●●
●
●●
●●
●●●
●
●
●●●
●●●
●●●●
●
●●●
●●●●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●●●●
●
●●
●●●●
●
●●
●
●●●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●●●●●●●●
●
●
●●●●●●●●
●
●
●
●●
●
●●●●
●●
●●●●●
●
●●
●●●
●●
●
●
●
●
●●
●
●●●●
●●●●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●●●●
●
●
●
●●●●
●●●
●●
●●●●
●●
●
●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●
●
●
●
●●
●●
●●
●●
●●
●
●●
●
●
●
●●●
●●●
●●●●●●●
●●●
●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●●●
●
●●●
●●
●●
●
●
●
●
●●●●●●●
●●●
●●●
● ● ●normal point threshold limits abnormal point non−verifiable
number of datapoints investigated for AT0003A: 725
identified abnormal datapoints: 3
abnormal datapoints content: 0.41 %
abnormal datapoints station ranking = 49 within a total of 63 stations investigated for AT
non verifiable datapoints: 1
−16
−14
−12
−10
−8
−6
−4
−2
0
2
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
●
●
●
●●●●●●●
●●●
●●●●●●●●●
●
●
●
●
●●
●
●
●●
●●●●
●
●●●●
●●●●
●
●
●●
●
●
●●
●
●
●
●●
●●●●
●●
●●●
●●
●●●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●●
●●
●
●●●
●●
●●
●●●
●
●●●●
●
●●
●●
●
●●
●
●●
●●●●●
●●●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●●●
●●
●
●●●
●●●
●
●
●●●
●
●
●
●
●
●●
●
AT0005R (background, rural)long = 12.972 deg E, lat = 46.68 deg N
z(s x
)
●
●
●●
●●●●
●●●
●
●
●●
●
●●
●●
●●●●●
●●●●●
●●
●●
●●●
●●
●●●●
●●●
●
●●
●
●
●
●
●●
●●●
●
●●●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●●
●
●●●●
●
●●●●●
●
●
●
●●●●
●
●
●●●●
●●
●●●
●
●
●●
●
●●
●●●
●
●●●● ●●●
●
●
●
●
●●●● ●●
●
●
●
● ● ●normal point threshold limits abnormal point non−verifiable
number of datapoints investigated for AT0005R: 682
identified abnormal datapoints: 12
abnormal datapoints content: 1.76 %
abnormal datapoints station ranking = 25 within a total of 63 stations investigated for AT
non verifiable datapoints: 526
−3
−2
−1
0
1
2
3
4
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
●
AT0012A (background, urban)long = 14.036 deg E, lat = 48.165 deg N
z(s x
)
●
●
●●
●
●●●
●
●
●●
●
●●●
●●●●
●
●●●●●
●●●●●●●●●●●
●
●●●
●
●●●
●●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
●●●
●
●
●
●
●
●
●●●●●●●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●
●●●●●●●●●●●●●●
●●
●●●●●
●
●
●●
●
●
●●
●
●●●
●●
●
●●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●●
●
●
●
●
●●●
●
●●●●●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●●
●●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●●●●
●●●
●●●●●●●
●
●●
●
●●●
●●
●●
●●
●
●●●●
●
●●
●
●
●
●
●●
●
●
●
●●●●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●●●
●●
●
●●
●
●●●●
●
●
●
●●
●●●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●●
●●●●●●●●●●●
●●
●
●
●
●
●●
●●●
●●
●●
●
●
●
●
●●●
●
●
●●●
●●
●●●●●●●
●
●●●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●●
●
●
●●●●●●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●●●●●●
●●●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●●
●●
●●●
●
●
●●●●●●
●●●
●●●●
●
● ● ●normal point threshold limits abnormal point non−verifiable
number of datapoints investigated for AT0012A: 726
identified abnormal datapoints: 1
abnormal datapoints content: 0.14 %
abnormal datapoints station ranking = 56 within a total of 63 stations investigated for AT
non verifiable datapoints: 1
−3
−2
−1
0
1
2
3
4
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
●
●
AT0016A (background, suburban)long = 14.239 deg E, lat = 48.225 deg N
z(s x
)
●
●
●
●
●●●●●●
●●
●
●●●
●●
●
●
●
●●
●●●
●●
●●●●●●●●●●
●
●
●
●●●
●●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●●●●●●
●
●●●
●●
●●●●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●
●●
●
●
●
●
●●●●●
●
●
●●
●
●
●●●●
●
●●●●
●●●●
●●
●●●●●●●●●
●●
●
●
●●
●●●
●●
●
●
●●●●●●●
●●
●●
●●●
●●
●●●●
●●
●●
●
●●
●●●●
●●●●
●●
●
●●
●
●●●●●●●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●
●
●●
●●●●●●
●
●
●
●
●
●●●
●●●●●
●●●
●
●
●●●
●●●
●
●
●
●●●
●●
●
●
●●●●
●
●
●
●●●
●●●
●
●
●●
●
●●
●
●●●
●●●●
●
●●●●●●●
●
●●●●●●●●●
●
●●●
●●
●●
●
●
●
●●●●●●●●●●
●
●
●
●●
●
●
●●●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●●●●
●●
●
●●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●●
●
●●●●●●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●●
●
●
●●
●
●●●
●●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●●●●
●
●
●
●●●●●●
●●
●●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●●●
●
●●
●●●●●
●
●
●●
●●
●●
●●●●
●●
●
●
●
●
●
●●●●●●●●●
●
●
●
●●
●
●●
●●●
●
●
●
●●
●●●
●●
●
●
●
●●
●
●●
●
●
●
●●●
●●
●●
●
●
●●
●
●
●●●
●●●●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●
●●●
●
●
●
●●●
●
●●●●●●●●●
●
●●●
●●
●●
●●
●●●●●
●●
●●●●
●
● ● ●normal point threshold limits abnormal point non−verifiable
number of datapoints investigated for AT0016A: 724
identified abnormal datapoints: 1
abnormal datapoints content: 0.14 %
abnormal datapoints station ranking = 55 within a total of 63 stations investigated for AT
non verifiable datapoints: 2
−8
−6
−4
−2
0
2
4
Jan 2006 Jul 2006 Jan 2007 Jul 2007 Jan 2008
●
●
AT0020A (background, suburban)long = 16.303 deg E, lat = 48.236 deg N
z(s x
)
●
●●●●●●●
●●●
●
●
●●●
●●
●●
●
●●●
●●●●●
●●●
●
●
●●
●●●
●●
●
●●
●
●
●
●
●
●●●●
●
●
●
●●●●●
●
●
●
●
●
●●●●●●
●
●●●●
●●
●●
●
●●
●●
●●
●
●
●
●
●●●
●
●●●●●●
●●
●
●
●●
●
●●
●●
●●
●●●
●
●●●
●●
●
●●●●●
●●
●●●
●
●●
●●
●●●
●●
●
●●
●
●●●●
●●●
●●
●●
●●
●●●
●●
●●
●●
●●
●
●
●●●●●
●
●
●
●
●●●●●●●
●
●●●
●
●
●●
●●●
●●●●●
●●●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●●
●●●●●●●
●●
●●
●●●
●●●
●
●
●
●
●
●
●
●●●●●●●●●
●
●
●●●
●●●
●●●
●
●
●
●
●
●●●
●
●
●●
●●
●●●
●
●●●
●
●
●
●
●
●
●●●●
●
●
●●
●
●
●●●
●●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●●●
●
●●●●
●
●
●●●●●
●●
●
●
●
●
●●
●●●●●●●
●●●●
●
●
●
●●
●●
●
●
●
●●●●●●
●●
●
●
●●●●●●●
●●●●●
●
●
●●●●●●
●
●
●●●●●●
●●
●
●●●
●
●●●●
●
●
●●●●●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●●