evaluation of 2014 spice stream 1.5 candidate · 2014-05-30 · evaluation of 2014 spice stream 1.5...
TRANSCRIPT
Evaluation of 2014 SPICE Stream 1.5 Candidate
30 May 2014
TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa
Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams
Data Manager: Kathryn Newman
Overview
The Statistical Prediction of Intensity from a Consensus Ensemble, referred to as SPICE, is a six-
member statistical intensity consensus based on a weighted combination of the LGEM and
SHIPS statistical-dynamical intensity models run using input from the GFS, GFDL and HWRF
dynamical models. The forecasts are generated using 6-h old output from the dynamical models;
therefore, SPICE can be considered early model guidance. Given this Stream 1.5 candidate is
only intended to improve intensity guidance, track performance was not considered in this report.
The evaluation of SPICE focused on two basins (Atlantic and eastern North Pacific), each with
two primary analyses: (1) a direct comparison between SPICE and each of last year’s top-flight
models for intensity and the operational fixed consensus for intensity, and (2) an assessment of
how SPICE performed relative to last year’s top-flight intensity models and the fixed operational
consensus as a group. Note that rather than adding SPICE to the operational consensus, a direct
comparison between SPICE and the operational consensus was performed because SPICE is a
consensus ensemble that already includes members of the operational consensus for intensity.
All aspects of the evaluation were based on homogeneous samples for each component of the
evaluation, so the number of cases varied depending on the availability of the specific
operational baseline. Table 1 lists the baseline models used in the SPICE evaluation.
Definitions of the operational baselines with their corresponding ATCF IDs can be found in the
“2014 Stream 1.5 Methodology” write-up. Note that only early versions of all model guidance
were considered in this analysis. All cases were aggregated according to whether they include
‘land and water’ or were over ‘water only’. Except when noted, results are for aggregation of
cases over both land and water.
Inventory
The Cooperative Institute for Research in the Atmosphere (CIRA) team delivered 1,762
retrospective SPICE forecasts for 50 Atlantic basin (AL) storms (937 cases) and 48 eastern North
Pacific basin (EP) storms (825 cases) for the 2011-2013 hurricane seasons. Storm information
from the National Hurricane Center (NHC) Best Track was not available for 15 cases (7 AL, 8
EP). In addition, the storm was not classified as tropical or subtropical at the initial time for 59
AL cases and 45 EP cases. Given the NHC verification methodology requires a tropical or
subtropical classification at the initial time for a case to be verified, the total sample used in this
analysis consisted of 871 cases for the AL basin and 772 cases for the EP basin. For the detailed
discussion of the evaluation results, SPICE will be referred to by its ATCF ID, SPC3.
Atlantic Basin
The direct comparisons between the SPC3 absolute intensity errors and those for the top-flight
intensity models and the operational fixed consensus for intensity produced mixed results in the
AL basin (Fig. 1). The pairwise difference tests showed three SS improvements of 22 to 34% for
the comparison with HWFI and four SS improvements of 6-10% for the comparison with
LGEM, but these tests also resulted in one SS degradation of 3% for the DSHP comparison and
four SS degradations of 9 to 12% for the ICON comparison (Table 2). The degradation signature
corresponded to shorter lead times (DSHP 12 h, ICON 12-48 h), whereas the timing of the
improvements varied from intermediate lead times (36-72 h) for the LGEM comparison to the
longest lead times (96-120 h) for the HWFI comparison. Although the SPC3 comparisons with
the top-flight models for intensity did lead to one SS degradation, the results for the remaining
lead times, for the most part, either corresponded to SS improvements or positive non-SS
differences. The opposite was true for the comparison with ICON. Even the non-SS differences
were predominately negative. Limiting the sample to cases over water only produced one
additional SS improvement for the LGEM comparison and reduced the SS degradations by one
for the DSHP and ICON comparisons. Hence, the performance of SPC3 relative to all baselines
tended to improve when the sample was limited to over water cases only.
The frequency of superior performance (FSP) technique, which does not take into consideration
the magnitude of the error differences, yielded results that were consistent with those for the
pairwise difference tests, except for a few minor differences for the HWFI and DSHP
comparisons. For the HWFI comparison, SPC3 outperformed HWFI for lead times starting at 60
h, whereas the mean absolute intensity errors were not statistically distinguishable until 96 h
(Fig. 2). In terms of FSP, SPC3 and DSHP were evenly matched for all lead times (Fig. 2).
Limiting the sample to cases over water only did not change the overall character of the results.
Of the four models and the fixed consensus considered in this evaluation, only HWFI had a SS
bias in the AL basin, which corresponded to over-predicting the storm’s intensity at lead times
60 to 84 h (not shown). The intensity error distributions (Fig. 3) revealed the largest intensity
errors for all five types of guidance considered in this evaluation tended to be associated with
under-predicting the intensity.
A comparison of SPC3’s intensity performance to that of the three top-flight models and the
operational fixed consensus in the AL basin (Fig. 4) indicated SPC3 was significantly more
likely to have the smallest errors, i.e., rank 1st, than would be expected based on random
forecasts for 60 to 72 h and 96 to 120 h. When all cases with ties (i.e., the same intensity error
for SPC3 and at least one other model) award SPC3 with the best rank, the proportion of best
rankings increased substantially for all lead times (shown in solid black numbers). Note that
SPC3 was also significantly less likely to have the largest errors, i.e., rank 5th, for 36 to 60 h.
The overall signature of the rankings for the water only sample was consistent with that for the
land and water sample except that SPC3 was more likely than random to rank 1st for all lead
times starting at 60 h and less likely to rank 5th for 24 to 60 h (Fig. 4).
Eastern North Pacific Basin
The direct comparisons between the SPC3 intensity errors and those for the top-flight intensity
models and the fixed consensus for intensity also produced mixed results in the EP basin (Fig. 5).
The pairwise difference tests once again showed SS improvements for the HWFI and LGEM
comparisons where the SS improvements over HWFI occur at longer lead times (72 to 120 h)
and shorter lead times (36 to 60 h) for the LGEM comparison (Table 3). Percent improvements
over HWFI ranged from 16 to 34%, whereas those for the LGEM comparison ranged from 6 to
9%. The pairwise difference tests for the DSHP and ICON comparisons produced one SS
degradation at 12 h, whereas the DSHP comparison also produced two SS improvements of 10 to
13% at 84 to 96 h. Although the SPC3 comparisons with the top-flight models for intensity did
lead to one SS degradation, the results for the remaining lead times either corresponded to SS
improvements or positive non-SS differences. Conversely, most of the non-SS differences for
the ICON comparison were negative. Limiting the sample to cases over water only did not yield
any notable differences in the results for cases over land and water in the EP basin.
Applying the FSP approach in the EP basin also produced small differences from the pairwise
difference tests with respect to the number of lead times with statistically distinguishable results.
In the context of FSP, the time periods for which SPC3 was associated with SS improvements
over HWFI and LGEM were 36 to 120 h and 24 to 72 h, respectively (Fig. 6). For FSP, SPC3
outperformed DSHP at 48 h and 72 to 120 h, with no SS degradations. SPC3 was characterized
by poorer performance with respect to ICON at 12 and 24 h. Hence, the FSP perspective
indicated a signature of improvement associated with SPC3 when compared with the top-flight
models and a signature of degradation when compared with the operational consensus. Limiting
the sample to cases over water only did not change the overall character of the FSP results (not
shown), but the signature of SS improvements over HWFI were more intermittent (36 to 48 h, 72
to 84 h and 108 to 120 h) and no longer SS for DSHP at the longest lead times (108 to 120 h).
The intensity guidance considered in this evaluation did not exhibit any SS bias in the EP basin
(not shown). On the other hand, the medians of the LGEM and ICON intensity error
distributions were negative for 24 to 36 h and 24 h, respectively, and positive for DSHP and
ICON for 48 to 120 h and 84 to 120 h, respectively (Fig. 7). Regardless of whether the medians
of the intensity error distributions were distinguishable from zero or not, the largest intensity
errors for all types of intensity guidance considered in this evaluation tended to be associated
with under-predicting the intensity, which is similar to the results found in the AL basin.
A comparison of SPC3’s intensity performance in the EP basin to that of all three top-flight
models and the operational consensus for intensity (Fig. 8) indicated SPC3 was less likely to
have the largest errors, i.e., rank 5th, than would be expected based on random forecasts at all
lead times and rank 5 was statistically distinct from all other ranks at 36 to 72 h. Note that SPC3
was also less likely to have the smallest errors, i.e., rank 1st, at 12 to 24 h, but a notable number
of ties (4 to 10%) between SPC3 and the best performing baseline model(s) also occurred in the
EP basin at all lead times. The number of ties is particularly large at these shorter lead times.
Given the ties were randomly assigned to ranks 1 and 2, the frequency of rank 2 would be
reduced by approximately 2 to 5% (~50% of the ties) under the scenario where all rank 1 ties are
awarded to SPC3. Keeping this caveat in mind, SPC3 was more likely to rank either 2nd or 3rd at
most lead times and more likely to rank 4th at shorter lead times. The overall signature of the
rankings for the water only sample in the EP basin was consistent with that for the land and
water sample.
Overall Evaluation
The pairwise difference tests for the direct comparisons between SPC3 and the top-flight models
for intensity produced SS improvements of 6-34% for the comparisons with the HWFI and
LGEM in both basins, whereas the DSHP comparison produced SS degradations of 3% at 12 h in
both basins and SS improvements of 10 to 13% at 84 to 96 h in the EP basin. The pairwise
difference tests for the direct comparison with ICON also produced SS degradations at the
shorter lead times in both basins, with the degradation signature being stronger in the AL basin.
While the SPC3 errors were statistically indistinguishable from those of the baseline models for a
number of lead times, these non-SS differences tended to be positive (corresponding to
improvements) for the top-flight model comparisons and negative for the comparison with the
operational fixed consensus. While the basic trends were the same in both basins, the number of
lead times with SS improvements tended to be larger in the EP basin and the number of lead
times with SS degradations tended to be larger in the AL basin. The FSP approach, which
differs from the pairwise difference test in that it does not take into account the magnitude of the
error differences, indicated a slightly stronger signature of improvement over the top-flight
baselines in both basins, but a slightly stronger signature of poorer performance when comparing
with ICON. The two different approaches resulted in similar conclusions, indicating an
improvement over at least two top-flight models and slightly poorer performance at shorter lead
times when compared to the ICON in both basins. Limiting the sample to cases over water only
had little impact on the overall results.
The rank frequency analysis indicated SPC3 errors were less likely to be larger than all the
operational baselines at all lead times in the EP basin and for a few lead times in the AL basin.
In addition, SPC3 errors were more likely to rank 1st in the AL basin at lead times longer than 48
h and rank 2nd and/or 3rd at all lead times in the EP basin. Hence, the results were favorable for
selecting SPC3 for explicit intensity guidance in the AL basin and somewhat favorable (i.e.,
improves upon two or three of the operational baselines) in the EP basin.
Table 1: Summary of baselines used for evaluation of SPICE for the specified metrics.
Baselines
Variables Verified
Aggregation
Intensity
land and water
Intensity
water only
HWFI ● ●
LGEM ● ●
DSHP ● ●
ICON ● ●
Table 2: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of SPC3 and each
individual top-flight model and the operational fixed consensus for intensity in the AL basin. See 2014 Stream 1.5 methodology write-
up for description of entries.
Table 3: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of SPC3 and each
individual top-flight model and the operational fixed consensus for intensity in the EP basin. See 2014 Stream 1.5 methodology write-
up for description of entries.
Figure 1: Mean absolute intensity errors (SPC3-red, baselines-black) and mean pairwise differences (blue) with 95% confidence
intervals with respect to lead time for HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel), LGEM and SPC3 (bottom
left panel), and ICON and SPC3 (bottom right panel) in the Atlantic basin.
Figure 2: Frequency of superior performance (FSP) with 95% confidence intervals for
intensity error differences stemming from the comparison of HWFI and SPC3 (top panel)
and DSHP and SPC3 (bottom panel) with respect to lead time for cases in the Atlantic
basin. Ties are defined as cases for which the difference was less than 1 kt.
Figure 3: Intensity error distributions with respect to lead time for HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel),
LGEM and SPC3 (bottom left panel) and ICON and SPC3 (bottom right panel) in the Atlantic basin.
Figure 4: Rankings with 95% confidence intervals for SPC3 compared to the three top-
flight models and fixed operational consensus for intensity guidance with respect to lead
time. Aggregations are for land and water (top panel) and over water only (bottom panel)
for Atlantic basin. The grey horizontal line highlights the 20% frequency for reference.
Black numbers indicate the frequencies of the first and fifth rankings where the candidate
model was assigned the better (lower) ranking for all ties.
Figure 5: Same as Fig. 1 except in the eastern North Pacific basin.
Figure 6: Same as Fig. 2 except from the comparison of HWFI and SPC3 (top left panel), DSHP and SPC3 (top right panel), LGEM
and SPC3 (bottom left panel), and ICON and SPC3 (bottom right panel) in the eastern North Pacific basin.
Figure 7: Same as Fig. 3 except in the eastern North Pacific basin.
Figure 8: Same as Fig. 4 except for aggregations over land and water in the eastern North
Pacific Basin.