benchmark database inhomogeneous data, surrogate data and synthetic data

34
Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema M eteo ro lo g ica l In stitu te Bonn

Upload: nira

Post on 12-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Benchmark database inhomogeneous data, surrogate data and synthetic data. Victor Venema. Content. Introduction to benchmark dataset Some results Some questions about exercise Question about future work Analyse and publish the results. Benchmark dataset. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Benchmark database

inhomogeneous data, surrogate data and

synthetic data

Victor Venema

M e te o ro lo g ic a l

I n stitu te

B o n n

Page 2: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Content

Introduction to benchmark dataset

Some results Some questions about exercise Question about future work Analyse and publish the results

Page 3: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Benchmark dataset1) Real (inhomogeneous) climate records

Most realistic case Investigate if various HA find the same breaks

2) Synthetic data For example, Gaussian white noise Insert know inhomogeneities Test performance

3) Surrogate data Empirical distribution and correlations Insert know inhomogeneities Compare to synthetic data: test of assumptions

Page 4: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Creation benchmark – Outline talk

1) Start with homogeneous data

2) Multiple surrogate and synthetic realisations

3) Mask surrogate records

4) Add global trend

5) Insert inhomogeneities in station time series

6) Published on the web

7) Homogenize by COST participants and third parties

8) Analyse the results and publish

Page 5: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

1) Start with homogeneous data

Monthly mean temperature and precipitation Later also daily data (WG4), maybe other

variables (pressure, wind)

Homogeneous, no missing data Longer surrogates are based on multiple copies Generated networks are 100 a

Page 6: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

2) Multiple surrogate realisations

Multiple surrogate realisations– Temporal correlations– Station cross-correlations– Empirical distribution function

Annual cycle removed before, added at the end Number of stations, 5, 9 or 15 Cross correlation varies as much as possible

Page 7: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

5) Insert inhomogeneities in stations

Independent breaks Determined at random for every station and time 5 Breaks per 100 a Monthly slightly different perturbations Temperature

– Additive– Size: Gaussian distribution, σ=0.8°C

Rain– Multiplicative– Size: Gaussian distribution, <x>=1, σ=10%

Page 8: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Example break perturbations station

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 9: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Example break perturbations network

Year

Temperature perturbations

1920 1940 1960 1980 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

-1

-0.5

0

0.5

1

1.5

Page 10: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

5) Insert inhomogeneities in stations

Correlated break in network One break in 10 % of networks In 30 % of the station simultaneously Position random

– At least 10 % of data points on either side

Page 11: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Example correlated break

Year

Correlated break

1920 1940 1960 1980 2000

2

4

6

8

10

12

14

16

18

20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

Page 12: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

5) Insert inhomogeneities in stations

Outliers Size

– Temperature: < 1 or > 99 percentile– Rain: < 0.1 or > 99.9 percentile

Frequency– 50 % of networks: 1 %– 50 % of networks: 3 %

Page 13: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Example outlier perturbations station

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-10

-5

0

5Outliers

Page 14: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Example outliers network

Year

Outliers

1900 1920 1940 1960 1980 2000

2

4

6

8

10

12

14

16

18

20

-10

-5

0

5

10

Page 15: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

5) Insert inhomogeneities in stations

Local trends (only temperature) Linear increase or decrease in one station Duration: between 30 and 60a Maximum size: Gaussian distribution, σ=0.8°C Frequency: once in 10 % of the stations

Page 16: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Example local trends

Year

Local trends

1900 1920 1940 1960 1980

2

4

6

8

10

12

14

16

18

20-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 17: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

6) Published on the web

Inhomogeneous data are published on the COST-HOME homepage

Everyone is welcome to download and homogenize the data

http://www.meteo.uni-bonn.de/ venema/themes/homogenisation

Page 18: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

7) Homogenize by participants

Return homogenised data Should be in COST-HOME file format (next slide)

– For real data including quality flags

Return break detection file– BREAK– OUTLI– BEGTR– ENDTR

Multiple breaks at one data possible

Page 19: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Typical errors

The file format needs to be perfect! Forgetting the station-file that describes which

stations belong to the homogenised network Changing the file names in this station file to

homogeneous data files ► (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file Please, keep directory structure of the benchmark

like it is, also for partial contributions – The only difference is the main directory

All files are tab-delimited ASCII files

Page 20: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

COST-HOME file format – network file

Page 21: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Typical errors

The file format needs to be perfect! Forgetting the station-file that describes which

stations belong to the homogenised network Changing the file names in this station file to

homogeneous data files (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file ► Please, keep directory structure of the benchmark

like it is, also for partial contributions – The only difference is the main directory

All files are tab-delimited ASCII files

Page 22: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Detected breaks file

Page 23: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Typical errors – see discussion

The file format needs to be perfect! Forgetting the station-file that describes which

stations belong to the homogenised network Changing the file names in this station file to

homogeneous data files (Forgetting to return the files with the quality flags) The sizes of the breaks are not in the break file Please, keep directory structure of the benchmark

like it is, also for partial contributions – The only difference is the main directory

All files are tab-delimited ASCII files ►

Page 24: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

COST-HOME file format – monthly data

Page 25: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

ContributionsParticipant Algorithm Remarks

1. José Guijarro Climatol 6 Versions with different settings

2. Péter Domonkos CM-D, MASH-D, NSHT-D

3 Versions / detection algorithms

3. Michele Brunetti Brunetti Detection Craddock based; 2 surrogate temp. networks

4. Dubravka Rasol & Olivier Mestre

PRODIGE All surrogate temp.; 13 surrogate precip. Networks

5. Matthew Menne & Claude Williams

Automated pairwise hom.

2 Versions; “all” temp. Networks (part of real #3 is missing)

6. Christine Gruber & Ingeborg Auer

HOCLIS 1 Surrogate temp. & 1 surrogate precip.

7. Gregor Vertacnik MASH All surrogate temp.

8. Petr Stepanek AnClim 1 Surrogate temp. & 1 surrogate precip.

9. Lucie Vincent Vincent 1 Surrogate temp.

10. Enric Aguilar NSHT Not in the right format yet

Page 26: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

No. homogenised networks - algorithmTable 1. Number of homogenised networks per algorithm

Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.

PRODIGE 32 0 32 0

Brunetti 2 0 2 0

MASH 25 0 25 0

Vincent 1 0 1 0

HOCLIS 1 0 1 0

AnClim 2 0 2 0

Climatol A 92 12 40 40

Climatol C 92 12 40 40

Climatol D 92 12 40 40

Climatol E 92 12 40 40

Climatol F 92 12 40 40

ClimatolG01 2 0 2 0

APHa2 42 5 18 19

APHa1 42 5 18 19

CM-D 5 0 5 0

MASH-D 5 0 5 0

SNHT-D 5 0 5 0

Page 27: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

No. homogenised networks – input data

Table 3. Summary data: Number of homogenised networks per network

Network No. networks Temp. netw. Precip. netw.

All 624 371 253

Real 70 40 30

Surrogate 316 193 123

Surrogate #1 29 17 12

Surrogate ~#1 287 176 111

Synthetic 238 138 100

Synthetic #1 12 7 5

Synthetic ~#1 226 131 95

Page 28: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Mean no. outliers per stationTable 21. Mean number of outliers per station for every algorithm

Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.

PRODIGE 0.0 NaN 0.0 NaN

Brunetti 3.4 NaN 3.4 NaN

MASH 16.1 NaN 16.1 NaN

Vincent 0.0 NaN 0.0 NaN

HOCLIS 6.0 NaN 6.0 NaN

AnClim 5.5 NaN 5.5 NaN

Climatol A 3.6 0.2 4.0 4.2

Climatol C 88.5 56.4 94.8 91.9

Climatol D 54.4 34.7 58.0 56.7

Climatol E 54.2 31.8 60.7 54.4

Climatol F 43.8 33.2 47.8 42.9

ClimatolG01 4.4 NaN 4.4 NaN

APHa2 1.9 0.0 2.1 2.2

APHa1 1.9 0.0 2.1 2.2

CM-D 15.7 NaN 15.7 NaN

MASH-D 15.5 NaN 15.5 NaN

SNHT-D 15.3 NaN 15.3 NaN

Page 29: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Mean no. breaks per stationTable 22. Mean number of breaks per station for every algorithm

Homogenisation alg. All networks Real netw. Surrogate netw. Synthetic netw.

PRODIGE 2.7 NaN 2.7 NaN

Brunetti 5.0 NaN 5.0 NaN

MASH 4.6 NaN 4.6 NaN

Vincent 0.0 NaN 0.0 NaN

HOCLIS 2.8 NaN 2.8 NaN

AnClim 1.2 NaN 1.2 NaN

Climatol A 1.3 0.8 1.5 1.3

Climatol C 1.5 0.8 1.6 1.6

Climatol D 1.4 1.0 1.5 1.4

Climatol E 1.6 0.9 1.8 1.6

Climatol F 1.5 1.2 1.7 1.4

ClimatolG01 1.2 NaN 1.2 NaN

APHa2 1.8 1.2 2.1 1.7

APHa1 1.7 1.0 1.9 1.6

CM-D 4.6 NaN 4.6 NaN

MASH-D 3.9 NaN 3.9 NaN

SNHT-D 3.2 NaN 3.2 NaN

Page 30: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Homogenising the exercise Tab-delimited files: also space-delimited?

– Mixture of strings and numbers Data quality files only for real data section Do we want to use the Diurnal Temperature

Range (DTR)?– Not useful for surrogate and synthetic data!– If we do, everyone should do it

End or begin uncorrected?– Compute statistics independent of absolute level?

Filling missing values part exercise? Human quality control or raw algorithm output? Homogenise all or homogenisable networks, times

Page 31: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Contributions – who is missing?Participant Algorithm Remarks

1. José Guijarro Climatol 6 Versions with different settings

2. Péter Domonkos CM-D, MASH-D, NSHT-D

3 Versions / detection algorithms

3. Michele Brunetti Brunetti Detection Craddock based; 2 surrogate temp. networks

4. Dubravka Rasol & Olivier Mestre

PRODIGE All surrogate temp.; 13 surrogate precip. Networks

5. Matthew Menne & Claude Williams

Automated pairwise hom.

2 Versions; “all” temp. Networks (part of real #3 is missing)

6. Christine Gruber & Ingeborg Auer

HOCLIS 1 Surrogate temp. & 1 surrogate precip.

7. Gregor Vertacnik MASH All surrogate temp.

8. Petr Stepanek AnClim 1 Surrogate temp. & 1 surrogate precip.

9. Lucie Vincent Vincent 1 Surrogate temp.

10. Enric Aguilar NSHT Not in the right format yet

Page 32: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Analysing the results What measures define a well homogenised

dataset?– Real data vs. data with known truth

Ensemble mean for real data?

– Breaks Position, hit rate size distribution detection probability as function of size

– Data itself Root mean square error (RMSE) RMSE (without outliers) RMSE (bias corrected) Uncertainty in the network mean trend

How to study which components are best?

Page 33: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Deadline(s)

Agreed on 09/2009, September this year Multiple deadlines

– For example: synthetic data, real data, surrogate data– After deadline the truth can be revealed– After deadline the other contributions can be

revealed(?)– Start earlier analysing the results– For example: May, July, September

Bologna, 25 – 26 May, EGU, 19 – 24 April

Page 34: Benchmark database inhomogeneous data, surrogate data and  synthetic data

Victor Venema, [email protected], COST HOME, March 2009, Tarragona, Spain

Articles Articles

– Overview COST Action & benchmark with very basic analysis results Performance difference between synthetic (Gaussian, white

noise) and surrogate data How to deal multiple contributions per algorithm? Do we have references to all algorithms?

– What should the others be about Analysing results, which components are best

Who will organise, coordinate it?– Not everyone should do the same analysis– How to subdivide the work?

After deadline: sensitivity analysis