evaluation of 2014 spice stream 1.5 candidate · 2014-05-30 · evaluation of 2014 spice stream 1.5...

Evaluation of 2014 SPICE Stream 1.5 Candidate

30 May 2014

TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa

Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams

Data Manager: Kathryn Newman

Overview

The Statistical Prediction of Intensity from a Consensus Ensemble, referred to as SPICE, is a six-

member statistical intensity consensus based on a weighted combination of the LGEM and

SHIPS statistical-dynamical intensity models run using input from the GFS, GFDL and HWRF

dynamical models. The forecasts are generated using 6-h old output from the dynamical models;

therefore, SPICE can be considered early model guidance. Given this Stream 1.5 candidate is

only intended to improve intensity guidance, track performance was not considered in this report.

The evaluation of SPICE focused on two basins (Atlantic and eastern North Pacific), each with

two primary analyses: (1) a direct comparison between SPICE and each of last year’s top-flight

models for intensity and the operational fixed consensus for intensity, and (2) an assessment of

how SPICE performed relative to last year’s top-flight intensity models and the fixed operational

consensus as a group. Note that rather than adding SPICE to the operational consensus, a direct

comparison between SPICE and the operational consensus was performed because SPICE is a

consensus ensemble that already includes members of the operational consensus for intensity.

All aspects of the evaluation were based on homogeneous samples for each component of the

evaluation, so the number of cases varied depending on the availability of the specific

operational baseline. Table 1 lists the baseline models used in the SPICE evaluation.

Definitions of the operational baselines with their corresponding ATCF IDs can be found in the

“2014 Stream 1.5 Methodology” write-up. Note that only early versions of all model guidance

were considered in this analysis. All cases were aggregated according to whether they include

‘land and water’ or were over ‘water only’. Except when noted, results are for aggregation of

cases over both land and water.

Inventory

The Cooperative Institute for Research in the Atmosphere (CIRA) team delivered 1,762

retrospective SPICE forecasts for 50 Atlantic basin (AL) storms (937 cases) and 48 eastern North

Pacific basin (EP) storms (825 cases) for the 2011-2013 hurricane seasons. Storm information

from the National Hurricane Center (NHC) Best Track was not available for 15 cases (7 AL, 8

EP). In addition, the storm was not classified as tropical or subtropical at the initial time for 59

AL cases and 45 EP cases. Given the NHC verification methodology requires a tropical or

subtropical classification at the initial time for a case to be verified, the total sample used in this

analysis consisted of 871 cases for the AL basin and 772 cases for the EP basin. For the detailed

discussion of the evaluation results, SPICE will be referred to by its ATCF ID, SPC3.

Atlantic Basin

The direct comparisons between the SPC3 absolute intensity errors and those for the top-flight

intensity models and the operational fixed consensus for intensity produced mixed results in the

AL basin (Fig. 1). The pairwise difference tests showed three SS improvements of 22 to 34% for

the comparison with HWFI and four SS improvements of 6-10% for the comparison with

LGEM, but these tests also resulted in one SS degradation of 3% for the DSHP comparison and

four SS degradations of 9 to 12% for the ICON comparison (Table 2). The degradation signature

corresponded to shorter lead times (DSHP 12 h, ICON 12-48 h), whereas the timing of the

improvements varied from intermediate lead times (36-72 h) for the LGEM comparison to the

longest lead times (96-120 h) for the HWFI comparison. Although the SPC3 comparisons with

the top-flight models for intensity did lead to one SS degradation, the results for the remaining

lead times, for the most part, either corresponded to SS improvements or positive non-SS

differences. The opposite was true for the comparison with ICON. Even the non-SS differences

were predominately negative. Limiting the sample to cases over water only produced one

additional SS improvement for the LGEM comparison and reduced the SS degradations by one

for the DSHP and ICON comparisons. Hence, the performance of SPC3 relative to all baselines

tended to improve when the sample was limited to over water cases only.

The frequency of superior performance (FSP) technique, which does not take into consideration

the magnitude of the error differences, yielded results that were consistent with those for the

pairwise difference tests, except for a few minor differences for the HWFI and DSHP

comparisons. For the HWFI comparison, SPC3 outperformed HWFI for lead times starting at 60

h, whereas the mean absolute intensity errors were not statistically distinguishable until 96 h

(Fig. 2). In terms of FSP, SPC3 and DSHP were evenly matched for all lead times (Fig. 2).

Limiting the sample to cases over water only did not change the overall character of the results.

Of the four models and the fixed consensus considered in this evaluation, only HWFI had a SS

bias in the AL basin, which corresponded to over-predicting the storm’s intensity at lead times

60 to 84 h (not shown). The intensity error distributions (Fig. 3) revealed the largest intensity

errors for all five types of guidance considered in this evaluation tended to be associated with

under-predicting the intensity.

A comparison of SPC3’s intensity performance to that of the three top-flight models and the

operational fixed consensus in the AL basin (Fig. 4) indicated SPC3 was significantly more

likely to have the smallest errors, i.e., rank 1st, than would be expected based on random

forecasts for 60 to 72 h and 96 to 120 h. When all cases with ties (i.e., the same intensity error

for SPC3 and at least one other model) award SPC3 with the best rank, the proportion of best

rankings increased substantially for all lead times (shown in solid black numbers). Note that

SPC3 was also significantly less likely to have the largest errors, i.e., rank 5th, for 36 to 60 h.

The overall signature of the rankings for the water only sample was consistent with that for the

land and water sample except that SPC3 was more likely than random to rank 1st for all lead

times starting at 60 h and less likely to rank 5th for 24 to 60 h (Fig. 4).

Eastern North Pacific Basin

The direct comparisons between the SPC3 intensity errors and those for the top-flight intensity

models and the fixed consensus for intensity also produced mixed results in the EP basin (Fig. 5).

The pairwise difference tests once again showed SS improvements for the HWFI and LGEM

comparisons where the SS improvements over HWFI occur at longer lead times (72 to 120 h)

and shorter lead times (36 to 60 h) for the LGEM comparison (Table 3). Percent improvements

over HWFI ranged from 16 to 34%, whereas those for the LGEM comparison ranged from 6 to

9%. The pairwise difference tests for the DSHP and ICON comparisons produced one SS

degradation at 12 h, whereas the DSHP comparison also produced two SS improvements of 10 to

13% at 84 to 96 h. Although the SPC3 comparisons with the top-flight models for intensity did

lead to one SS degradation, the results for the remaining lead times either corresponded to SS

improvements or positive non-SS differences. Conversely, most of the non-SS differences for

the ICON comparison were negative. Limiting the sample to cases over water only did not yield

any notable differences in the results for cases over land and water in the EP basin.

Applying the FSP approach in the EP basin also produced small differences from the pairwise

difference tests with respect to the number of lead times with statistically distinguishable results.

In the context of FSP, the time periods for which SPC3 was associated with SS improvements

over HWFI and LGEM were 36 to 120 h and 24 to 72 h, respectively (Fig. 6). For FSP, SPC3

outperformed DSHP at 48 h and 72 to 120 h, with no SS degradations. SPC3 was characterized

by poorer performance with respect to ICON at 12 and 24 h. Hence, the FSP perspective

indicated a signature of improvement associated with SPC3 when compared with the top-flight

models and a signature of degradation when compared with the operational consensus. Limiting

the sample to cases over water only did not change the overall character of the FSP results (not

shown), but the signature of SS improvements over HWFI were more intermittent (36 to 48 h, 72

to 84 h and 108 to 120 h) and no longer SS for DSHP at the longest lead times (108 to 120 h).

The intensity guidance considered in this evaluation did not exhibit any SS bias in the EP basin

(not shown). On the other hand, the medians of the LGEM and ICON intensity error

distributions were negative for 24 to 36 h and 24 h, respectively, and positive for DSHP and

ICON for 48 to 120 h and 84 to 120 h, respectively (Fig. 7). Regardless of whether the medians

of the intensity error distributions were distinguishable from zero or not, the largest intensity

errors for all types of intensity guidance considered in this evaluation tended to be associated

with under-predicting the intensity, which is similar to the results found in the AL basin.

A comparison of SPC3’s intensity performance in the EP basin to that of all three top-flight

models and the operational consensus for intensity (Fig. 8) indicated SPC3 was less likely to

have the largest errors, i.e., rank 5th, than would be expected based on random forecasts at all

lead times and rank 5 was statistically distinct from all other ranks at 36 to 72 h. Note that SPC3

was also less likely to have the smallest errors, i.e., rank 1st, at 12 to 24 h, but a notable number

of ties (4 to 10%) between SPC3 and the best performing baseline model(s) also occurred in the

EP basin at all lead times. The number of ties is particularly large at these shorter lead times.

Given the ties were randomly assigned to ranks 1 and 2, the frequency of rank 2 would be

reduced by approximately 2 to 5% (~50% of the ties) under the scenario where all rank 1 ties are

awarded to SPC3. Keeping this caveat in mind, SPC3 was more likely to rank either 2nd or 3rd at

most lead times and more likely to rank 4th at shorter lead times. The overall signature of the

rankings for the water only sample in the EP basin was consistent with that for the land and

water sample.

Overall Evaluation

The pairwise difference tests for the direct comparisons between SPC3 and the top-flight models

for intensity produced SS improvements of 6-34% for the comparisons with the HWFI and

LGEM in both basins, whereas the DSHP comparison produced SS degradations of 3% at 12 h in

both basins and SS improvements of 10 to 13% at 84 to 96 h in the EP basin. The pairwise

difference tests for the direct comparison with ICON also produced SS degradations at the

shorter lead times in both basins, with the degradation signature being stronger in the AL basin.

While the SPC3 errors were statistically indistinguishable from those of the baseline models for a

number of lead times, these non-SS differences tended to be positive (corresponding to

improvements) for the top-flight model comparisons and negative for the comparison with the

operational fixed consensus. While the basic trends were the same in both basins, the number of

lead times with SS improvements tended to be larger in the EP basin and the number of lead

times with SS degradations tended to be larger in the AL basin. The FSP approach, which

differs from the pairwise difference test in that it does not take into account the magnitude of the

error differences, indicated a slightly stronger signature of improvement over the top-flight

baselines in both basins, but a slightly stronger signature of poorer performance when comparing

with ICON. The two different approaches resulted in similar conclusions, indicating an

improvement over at least two top-flight models and slightly poorer performance at shorter lead

times when compared to the ICON in both basins. Limiting the sample to cases over water only

had little impact on the overall results.

The rank frequency analysis indicated SPC3 errors were less likely to be larger than all the

operational baselines at all lead times in the EP basin and for a few lead times in the AL basin.

In addition, SPC3 errors were more likely to rank 1st in the AL basin at lead times longer than 48

h and rank 2nd and/or 3rd at all lead times in the EP basin. Hence, the results were favorable for

selecting SPC3 for explicit intensity guidance in the AL basin and somewhat favorable (i.e.,

improves upon two or three of the operational baselines) in the EP basin.

Table 1: Summary of baselines used for evaluation of SPICE for the specified metrics.

Baselines

Variables Verified

Aggregation

Intensity

land and water

Intensity

water only

HWFI ● ●

LGEM ● ●

DSHP ● ●

ICON ● ●

Table 2: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of SPC3 and each

individual top-flight model and the operational fixed consensus for intensity in the AL basin. See 2014 Stream 1.5 methodology write-

up for description of entries.

Table 3: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of SPC3 and each

individual top-flight model and the operational fixed consensus for intensity in the EP basin. See 2014 Stream 1.5 methodology write-

up for description of entries.

Figure 1: Mean absolute intensity errors (SPC3-red, baselines-black) and mean pairwise differences (blue) with 95% confidence

intervals with respect to lead time for HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel), LGEM and SPC3 (bottom

left panel), and ICON and SPC3 (bottom right panel) in the Atlantic basin.

Figure 2: Frequency of superior performance (FSP) with 95% confidence intervals for

intensity error differences stemming from the comparison of HWFI and SPC3 (top panel)

and DSHP and SPC3 (bottom panel) with respect to lead time for cases in the Atlantic

basin. Ties are defined as cases for which the difference was less than 1 kt.

Figure 3: Intensity error distributions with respect to lead time for HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel),

LGEM and SPC3 (bottom left panel) and ICON and SPC3 (bottom right panel) in the Atlantic basin.

Figure 4: Rankings with 95% confidence intervals for SPC3 compared to the three top-

flight models and fixed operational consensus for intensity guidance with respect to lead

time. Aggregations are for land and water (top panel) and over water only (bottom panel)

for Atlantic basin. The grey horizontal line highlights the 20% frequency for reference.

Black numbers indicate the frequencies of the first and fifth rankings where the candidate

model was assigned the better (lower) ranking for all ties.

Figure 5: Same as Fig. 1 except in the eastern North Pacific basin.

Figure 6: Same as Fig. 2 except from the comparison of HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel), LGEM

and SPC3 (bottom left panel), and ICON and SPC3 (bottom right panel) in the eastern North Pacific basin.

Figure 7: Same as Fig. 3 except in the eastern North Pacific basin.

Figure 8: Same as Fig. 4 except for aggregations over land and water in the eastern North

Pacific Basin.

evaluation of 2014 spice stream 1.5 candidate · 2014-05-30 · evaluation of 2014 spice stream 1.5...

Documents