examining the influence of link function misspecification...
TRANSCRIPT
Examining the Influence of Link Function Misspecification in Conventional
Regression Models for Developing Crash Modification Factors
By
Lingtao Wu*, Ph.D.
Texas A&M Transportation Institute
Texas A&M University System, 3135 TAMU
College Station, Texas 77843-3135
Phone: 979-845-7214, fax: 979-845-6481
Email: [email protected]
and
Dominique Lord, Ph.D.
Professor, Zachry Department of Civil Engineering
Texas A&M University, 3136 TAMU
College Station, Texas 77843-3136
Phone: 979-458-3949, fax: 979-845-6481
Email: [email protected]
February 7, 2017
*Corresponding author
Wu and Lord 1
ABSTRACT 1
This study further examined the use of regression models for developing crash modification 2
factors (CMFs), specifically focusing on the misspecification in the link function. The primary 3
objectives were to validate the accuracy of CMFs derived from the commonly used regression 4
models (i.e., generalized linear models or GLMs with additive linear link functions) when some 5
of the variables have nonlinear relationships and quantify the amount of bias as a function of the 6
nonlinearity. Using the concept of artificial realistic data, various linear and nonlinear crash 7
modification functions (CM-Functions) were assumed for three variables. Crash counts were 8
randomly generated based on these CM-Functions. CMFs were then derived from regression 9
models for three different scenarios. The results were compared with the assumed true values. 10
The main findings are summarized as follows: (1) when some variables have nonlinear 11
relationships with crash risk, the CMFs for these variables derived from the commonly used 12
GLMs are all biased, especially around areas away from the baseline conditions (e.g., boundary 13
areas); (2) with the increase in nonlinearity (i.e., nonlinear relationship becomes stronger), the 14
bias becomes more significant; (3) the quality of CMFs for other variables having linear 15
relationships can be influenced when mixed with those having nonlinear relationships, but the 16
accuracy may still be acceptable; and (4) the misuse of the link function for one or more 17
variables can also lead to biased estimates for other parameters. This study raised the importance 18
of the link function when using regression models for developing CMFs. 19
Keywords: Crash modification factor, generalized linear model, functional form, 20
misspecification, highway safety 21
Wu and Lord 2
1. INTRODUCTION 22
Before-after and cross-sectional studies are the two main approaches used to estimate crash 23
modification factors (CMFs) (Bonneson and Pratt, 2008). It is believed that before-after studies 24
can generate CMFs more accurately and are always preferred whenever available. However, 25
several limitations have restricted the use of before-after studies for developing high quality 26
CMFs, such as small sample size, regression-to-the-mean (RTM), multiple treatment effect, site 27
selection bias, etc. Due to limited investments and time constraints, there might be too few sites 28
available for conducting a sound before-after analysis. Sites with unusually high number of 29
crashes are often selected for treatments in practice, which will be affected by the RTM and site 30
selection biases (they are separate, but related biases) if not properly accounted for (Lord and 31
Kuo, 2012). Meanwhile, it is common that multiple treatments are implemented simultaneously 32
for reducing the number and severity of crashes, which makes it difficult to estimate the effects 33
of single treatments with a before-after analysis. Given these limitations, the use of cross-34
sectional studies, particularly using regression models (also known as safety performance 35
functions or SPFs), for estimating CMFs has become popular in recent decades. Cross-sectional 36
analyses are nonetheless not perfect and drawbacks have been identified for estimating CMFs 37
(Gross et al., 2010; Lord and Mannering, 2010). Some researchers have criticized the use of 38
SPFs for developing CMFs because they may not properly capture the cause and effect 39
relationship between crashes and variables (Hauer, 2005a; Hauer, 2005b; Hauer, 2010; Lord and 40
Mannering, 2010; Hauer, 2014; Hauer, 2015). Considering the fact that many CMFs are being 41
developed using SPFs, it is necessary to investigate the accuracy of these CMFs further. 42
A recent attempt to evaluate the use of SPFs for developing CMFs was performed by Wu 43
et al. (2015). The main conclusion was that SPFs can generate reliable CMFs when the premise 44
Wu and Lord 3
of cross-sectional studies was matched or when variables having significant influences on crash 45
risk were not omitted in the models. An important assumption within the study was that the 46
variables were considered to have linear relationships (in logarithmic form) with the predicted 47
crash number. This was consistent with the commonly used generalized linear models (GLMs), 48
as the authors have pointed. This linear relationship is monotonic, and the same amount of 49
increase in variable(s) will always bring a fixed change in safety (either benefits or losses 50
depending on the CMFs). In other words, the pre-assumed relationship between variables and 51
crash risk is consistent with the link function specified in the SPF. Both are linear additive 52
functions, or so called canonical log link functions. However, some variables may have nonlinear 53
and/or non-monotonic relationships rather than linear (Hauer, 2004; Gross et al., 2009). Before 54
the true relationships are known to safety analysts, linear functions have been commonly adopted 55
in the GLMs for its simplicity (Hauer, 2015). [Note that, generalized nonlinear models (GNMs) 56
have been developed recently, and a number of studies have utilized nonlinear functions in the 57
GNMs, which showed better fitting results. Nevertheless, the true functions are still unknown.] 58
Under these conditions, the CMFs derived from SPFs may not be able to accurately capture the 59
true safety effects of variables, especially in boundary areas. 60
This study is, in essence, a continuation of recent work (i.e., (Wu et al., 2015)), on the 61
validation of CMFs derived from cross-sectional studies using SPFs, particularly focusing on the 62
conditions when the relationships between variables and crash risk are nonlinear, that is the link 63
function is misspecified in the regression models. It should be noted that there are a number of 64
potential problems with regression models in addition to link function, such as unobserved 65
heterogeneity, omitted variable problem, overfitting of models, and independence assumption 66
among others (Gross and Jovanis, 2007a; Gross and Jovanis, 2007b; Lord and Mannering, 2010; 67
Wu and Lord 4
Mannering et al., 2016; Park et al., 2016; Park and Abdel-Aty, 2016; Wu and Lord, 2016). 68
Though multiple limitations are likely to exist when a regression model is estimated, it is very 69
difficult to examine the influence of a single problem on the modeling output and CMF 70
development when these limitations are combined. The limitations related to the omitted-variable 71
bias and variable interactions have been documented separately and the results can be found in 72
Wu et al. (2015) and Wu and Lord (2016). This paper emphasizes on the link functional form 73
only. Thus, the primary objectives of this paper are to examine the accuracy of CMFs derived 74
from the commonly used GLMs with linear functional forms considering variables having 75
nonlinear effects on crash risk and to quantify the bias. Furthermore, the secondary objective 76
seeks to identify how the non-linear relationships affect the coefficients for other variables that 77
have a linear relationship, which has never been evaluated before. It is worth mentioning that 78
there has been a debate about whether or not cross-sectional studies are able to derive reliable 79
cause-effect results, not only in traffic safety study, but also in other fields where this kind of 80
statistical method is used, such as psychology, epidemiology, etc. (Elvik, 2011; Hauer, 2013; 81
Hauer, 2015). Hauer (2013) pointed out that even perfect SPFs may produce untrustworthy 82
CMFs mainly due to the violation of the “ceteris paribus” assumption1. The objective of this 83
study is not to prove whether or not the cross-sectional analyses are able to reveal the cause and 84
effect of traffic collisions. On the contrary, it is to document the potential problems associated 85
with the commonly and “blindly” used GLMs for developing CMFs. Specifically, this study 86
1 In the circumstance of developing a CMF for a variable with an SPF, this assumption means all
other variables in the SPF and those not included in the SPF are kept constant. For example, if a
CMF for lane width is to be developed from an SPF, it is assumed that all the characteristics
associated with safety (e.g., AADT, roadway alignment, parking, etc.) are constant except the
width of lane.
Wu and Lord 5
focused on the misuse of link functional form, and assuming that other conditions have been 87
satisfied. 88
2. BACKGROUND 89
Plenty of efforts have been made in the past decades on the development of crash prediction 90
models. A considerable portion of them focused on the statistical distribution models, e.g., 91
Poisson, Poisson-gamma (negative binomial or NB), Zero-Inflated Poisson, Conwey-Maxwell-92
Poisson, etc. (Lord and Mannering, 2010). Very few have examined the influence of the link 93
function on the predicted number of crashes. Hauer (2015) pointed out that the functional forms 94
used in regression models played vital roles in the analyses. Without the correct functions, the 95
cause and effect result derived from the modeling result cannot be trusted. However, so far, there 96
is no theory for identifying the appropriate function. In practice, most safety analysts adopted the 97
linear additive form, probably for its simplicity. In these models, there is only one coefficient for 98
one single variable. Thus, the CMF derived for that variable is in log-linear form. One unit 99
change in the variable always brings the same amount of change in the expected crashes. 100
However, this kind of CMF may not properly account for the safety effect of the treatment 101
because some variables may have nonlinear influences on crashes (Hauer, 2004; Hauer et al., 102
2004; Lee et al., 2015). El-Basyouny and Sayed (2010) compared the generalized link function 103
against the commonly used linear additive function in modeling crash data. The predicted 104
number of crashes showed significant differences under particular situations. Meanwhile, 105
attempts have been made to explore the nonlinear effects of some variables on safety. 106
Hauer et al. (2004) developed a statistical model to predict non-intersection crash 107
frequency on urban four-lane undivided roadways. Several variables were considered in the 108
analysis. Based on the estimated parameters, some variables were found to have nonlinear effects 109
Wu and Lord 6
on safety. For example, the degree of curve was captured to have a “U-shape” effect on on-the-110
road crashes. This indicated some flat curves may be safer than a tangent if it is true. But sharp 111
curves are associated with higher crash risk. 112
Xie and Zhang (2008) applied generalized additive models (GAMs) in traffic crash 113
modeling. Compared to GLMs, GAMs used nonparametric smooth functions instead of 114
parametric terms in GLMs. The analysis result indicated that GAMs performed better than 115
GLMs in terms of goodness-of-fit (GOF) and prediction performance. This method was later 116
utilized to develop CMFs for rural frontage segments in Texas (Li et al., 2011). Analyses showed 117
that nonlinear relationships existed between crash risk and changes in lane and shoulder widths 118
on frontage roads. The result was slightly different with a previous GLM based study (Lord and 119
Bonneson, 2007). 120
In order to capture the nonlinear relationships between variables and crashes, some neural 121
network models have also been introduced into safety analyses. Xie et al. (2007) proposed 122
Bayesian neural network (BNN) model for predicting motor vehicle crashes. BNN models were 123
estimated using the Texas frontage road data, the same used in several previous studies (Lord 124
and Bonneson, 2007; Li et al., 2011). The authors conducted sensitivity analyses of the trained 125
BNN model for two sites. It was found that right shoulder width showed quadratic functions with 126
predicted crash counts at two sites, and lane width showed an “inverse U-shape” relation with 127
crash counts at one site. Li et al. (2008) later conducted a continuation of this work. The 128
researchers applied support vector machine (SVM) models to predict crashes on frontage roads 129
in Texas, aiming to capture nonlinear relationships between explanatory and dependent 130
variables. The results were quite similar with those using BNN method. 131
Wu and Lord 7
Recently, Lao et al. (2014) proposed generalized nonlinear models (GNMs) based 132
approach to better elaborate non-monotonic relationships between variables and crash rates. 133
Compared to GLMs, the major improvement of GNM was using piecewise functions to capture 134
the pattern between dependent and independent variables. This made it more flexible to extract 135
complex relationships between the two. Rear-end crashes were modeled using GNM and GLM 136
methods. Comparison showed GNMs outperformed GLMs. In addition, some factors were found 137
to be significant in GNMs, but not in GLMs. Lee et al. (2015) later assessed the safety effects of 138
changing lane width using GNMs. Various nonlinear link functions were used for the effects on 139
crash rates of lane widths, and nonlinear CM-Functions were estimated for changing lane width. 140
It was noted that the CM-Function for lane width showed an “inverse U-shape” curve. It was 141
combined with two quadratic functions and the 12-ft lane was found to be associated with the 142
highest crash rates. This result contradicted some past studies, which concluded widening lanes 143
could consistently reduce crash frequency (AASHTO, 2010). Moreover, Park and Abdel-Aty 144
(2015a) assessed the safety effects of multiple roadside treatments (i.e., poles, trees, etc.) using 145
GLM, GNM, and multivariate adaptive regression splines (MARS) model. The MARS model 146
could capture both nonlinear relationships and interaction impacts between variables. Results 147
generally showed that GNMs provided slightly better fits than the GLMs, and MARS model 148
outperformed the other two. This indicated the roadside treatments had nonlinear effects on crash 149
risk. Park and Abdel-Aty (2016) further estimated the safety effectiveness of multiple cross 150
sectional features on urban arterials using both GLM and GNM. Analysis results revealed that 151
the GNM outperformed the GLM. Studies on the nonlinear relationships between variables and 152
crash risk have started to emerge recently. For example, Zeng et al. (2016a) modeled crash 153
frequency using an optimized neural network, and the researchers found that AADT, speed limit, 154
Wu and Lord 8
the presence of a bus stop and rainfall have nonlinear effects on crash risk. Zeng et al. (2016b) 155
developed a GNM-based mixed multinomial logit approach for crash data analysis, and 156
nonlinear relationships between factors and crash density as well as severity were examined. 157
Park et al. (2016) developed CMFs using finite mixture modeling approach. Median width was 158
found to have a “U-shape” relationship with crash frequently, and widening a narrower median 159
are generally more effective than widening a wider median. Banihashemi (2015; 2016) and Wu 160
et al. (2017) developed CM-Functions for horizontal curves on different types of roadways using 161
cross-sectional methods. All of the three CM-Functions were in a nonlinear form. More 162
specifically, Wu et al. (2017)’s study indicated that the nonlienar CM-Function was more 163
accurate than another linear CM-Function in terms of predicting crash numbers at horiozntal 164
curves. 165
In almost all of these studies, analyses using nonlinear methods showed better results 166
than the commonly used GLM approach. This means some variables indeed have nonlinear 167
and/or non-monotonic effects on crash frequency, and the CMFs derived using normal GLMs 168
may not be able to adequately capture this kind of relationship. 169
3. METHODOLOGY 170
3.1 Quantifying Nonlinearity 171
This section describes how the accuracy of CMFs derived from SPFs was investigated when 172
some variables had nonlinear relationships. Intuitively, if the nonlinear relationship is weak (the 173
CM-Function curve is quite flat), the accuracy of CMFs derived from SPFs should be similar to 174
those in the previous study (Wu et al., 2015). On the contrary, if the nonlinear relationship is 175
strong (the curve is sharp), the accuracy of CMFs may be affected. A measurement is necessary 176
Wu and Lord 9
to describe how flat or sharp the curve is. This section introduces the concept of quantifying 177
nonlinearity. 178
First, the definition of the closest line to a curve. For a given integrable curve ( )y f x 179
over [ , ]m n , the closest line to this curve is defined as a straight line y k x c that minimizes 180
the area between the two. This definition is illustrated in Figure 1. The dashed curve represents 181
the given function ( )y f x , and the solid line represents the closest line to this curve 182
y k x c . This line minimizes the area between the two (the shadowed area in Figure 1). 183
Given the range, in general, the larger the area is, the stronger the nonlinearity the curve tends to 184
have. Particularly, if the given function is linear, the closest line is the function itself, and the 185
area is technically equal to zero. 186
Second, the definition of average vertical distance between a curve and its closest line. 187
Although the area between a curve and its closest line can reflect the nonlinearity of the curve, 188
the area still depends on the range. Wider range is more likely to yield larger area. And variables 189
affecting traffic crashes are usually different in their possible values in practice. A standardized 190
measurement is necessary to quantify the nonlinearity. The average vertical distance (AVD) 191
between a curve and its closest line is defined as the area between the two divided by the range. 192
So, in Figure 1, the AVD is calculated as dividing the shadowed area by n m . This way, the 193
AVD itself can be used to quantify the nonlinearity of a curve regardless of its range. The larger 194
this distance is, the stronger the nonlinearity that curve has. If the given function is linear, the 195
AVD is zero. 196
Wu and Lord 10
197
Figure 1 Example illustrating the closest line to a curve. 198
The details for calculating the coefficients of the line (i.e., k and c ) and AVD are shown 199
below. The objective is to minimize the area, shown in Equation 1. 200
| ( ) ( ) |n
mArea f x k x c dx (1a) 201
Or equivalently, 202
2[ ( ) ( )]n
mZ f x k x c dx (1b) 203
k and c can be easily derived through mathematical translations, shown below. 204
2 2
( ) ( ) ( )
( ) [ ]
n n n
m m mn n
m m
n m xf x dx f x dx xdxk
n m x dx xdx
(2a) 205
2
2 2
( ) ( )
( ) [ ]
n n n n
m m m mn n
m m
f x dx x dx f x dx xdxc
n m x dx xdx
(2b) 206
Wu and Lord 11
The area can be calculated by substituting k and c into Equation 1a, and the AVD is 207
then calculated as dividing the area by n m . The AVD will be used to measure the nonlinearity 208
of CM-Functions. 209
3.2 Simulation Protocol 210
To assess the influence of link function misspecification on the accuracy of CMFs, this study 211
used artificial realistic data proposed by Hauer (2014), since the true relationship between 212
variables and crash risk is unknown with real crash data. The simulation experiment was 213
basically the same as that used in the first attempt of this topic (Wu et al., 2015), but necessary 214
changes were made to accommodate the particular requirements in this study. The simulation 215
procedure is briefly described below. 216
Step 1: Assign Initial CM-Functions 217
Assume CM-Functions for highway geometric features of interest. This study assumed 218
multiple kinds of relationships (i.e., linear and nonlinear) between variables and crash risk. For 219
example, we can assume the CM-Function for lane width as Equation 3. 220
2( ) ( )CMF f LW exp a LW b LW c (3) 221
Where, 222
CMF = the specific CMF for a segment with LW-ft lane; 223
LW = the lane width of the segment, in ft; and, 224
a, b, c = coefficients of assumed CM-Functions for lane width. 225
Note that there is a large (even infinite) number of nonlinear forms that may represent the 226
relationships between variables and crash frequency. This study utilized the quadratic form for 227
Wu and Lord 12
two reasons: (1) this form has been developed in the previous study based on real crash data (Lee 228
et al., 2015); (2) the quadratic form is very flexible. The nonlinearity can be easily adjusted by 229
changing the coefficients of the splint. In addition, it is able to capture both monotonic and non-230
monotonic relationships (over some particular range). For other researchers, it is possible to use 231
other forms to reflect the underlying relationships between crash risk and variables of interest. 232
Step 2: Calculate Mean Values 233
Calculate the true crash means for each segment using basic SPFs and assumed CMFs. 234
The equations used to calculate the true crash mean are called theoretical functions, and they 235
shall be shown later (i.e., Section 5) through this paper. 236
Step 3: Generate Discrete Counts 237
Generate random counts given that the mean for a segment was gamma distributed with 238
dispersion parameter (the inverse dispersion parameter, 1/ ) and mean equal to 1. 239
Step 4: Estimate CMFs from the Simulated Crash Data Using NB Regression Models 240
For the purpose of this study, the most commonly used GLM and functional form (i.e., 241
linear form) were used, as shown in Equation 4 (Lord and Bonneson, 2007). This equation is 242
called the considered function. Note that a different parameter for describing the mean of the 243
site, i , was used for estimating the models (compared to the one used for the simulation, i ). 244
10
2
( ) ( )n
i i j jj
E L AADT exp x
(4) 245
Where, 246
( )iE = the estimated crash mean during a period for segment i; 247
Wu and Lord 13
jx = a series of variables, such as the lane width of segment i; and, 248
0 1, , , n = coefficients to be estimated. 249
For the GOF of the models, the following three methods were used: (1) Akaike 250
information criterion (AIC), (2) Mean absolute deviance (MAD), and (3) Mean-squared 251
predictive error (MSPE). More information about MAD and MSPE are documented in Lord et 252
al. (2008). 253
Once the model was fitted and coefficients were estimated using the simulated crash data, 254
the CM-Function for variable j was then derived as (Lord and Bonneson, 2007; Gross et al., 255
2010): 256
, 0,[ ( )]x j j jCMF exp x x (5) 257
Where, 258
j = estimated coefficient for variable j; 259
x = value of variable j, such as lane width, curve density; 260
0, jx = base condition defined for variable j, usually 12 ft for lane width; and, 261
,x jCMF = CMF specific to variable j with value of x. 262
This also indicated the CMF derived from the SPF for variable j was ( )j jCMF exp , 263
meaning the expected crash frequency would be multiplied or divided by jCMF if the variable j 264
increased or decreased by one unit. 265
Wu and Lord 14
Repeat Steps 2 to 4 100 times, calculate the mean and the standard deviation of the 266
estimated CMF values for each variable. 267
Step 5: Evaluate the CMF Derived from the NB Model 268
Two indexes, estimation bias and error percentage, were used to evaluate the CMF 269
derived from SPFs. They are shown in Equations 6 and 7. The smaller is the error percentage, the 270
more accurate the CMF derived from SPFs is. 271
_ _=j j Assumed j SPFCMF CMF (6) 272
_
100j
jj Assumed
eCMF
(7) 273
Where, 274
j = estimation bias of CMF for variable j with a specific value; 275
je = error percentage of CMF for variable j with a specific value, (%); 276
_j AssumedCMF = assumed specific CMF value for variable j; and 277
_j SPFCMF = specific CMF derived from the SPF for variable j. 278
Please note the meaning of terminology “bias” used above to quantify the quality of 279
CMFs. In Mathematics and Statistics, bias is defined as a systematic (built-in) error which makes 280
all values or estimates wrong in the same direction and by a certain amount (Pierce, 2014). 281
Specifically, bias in this paper means the difference between the true CMF for a variable and that 282
estimated from regression models. It can also be defined as misspecification error (as some 283
Wu and Lord 15
CMFs are misestimated in the models). However, to simplify the description, the issue of 284
misspecification is referred as “bias” in the rest of this paper. 285
3.3 Scenarios 286
Three scenarios were analyzed in this study, as described below. To distinguish from 287
previous work, the scenarios were named as “Nonlinear number”. 288
Scenario Nonlinear I: Consider one variable only, nonlinear form. Only lane width was 289
considered and assumed to have nonlinear effects on crash. The main objective was 290
to examine the bias of CMF for a variable with different levels of nonlinearity. 291
Scenario Nonlinear II: Consider three variables, only one in nonlinear form. Three variables, 292
lane width, curve density and pavement friction, were considered in this scenario. 293
Curve density and pavement friction were assumed to have linear effects on safety. 294
The CM-Functions for lane width were assumed to be in nonlinear forms. The main 295
objective was to examine the influence of nonlinear variables on the accuracy of 296
CMFs for linear variables. 297
Scenario Nonlinear III: Consider three variables, two in nonlinear form. This scenario was 298
similar with Scenario Nonlinear II, but both lane width and curve density were 299
assumed to have nonlinear relationships (and are considered independent). The CMF 300
for pavement friction was in linear form. The main objective was to examine the 301
influence of nonlinear variables on the accuracy of CMFs for both linear and 302
nonlinear variables. 303
For all the three scenarios, the assumed nonlinear relationships varied from weak to 304
strong. Thus, each scenario contained a number of sub-scenarios. In addition, the inverse 305
Wu and Lord 16
dispersion parameter, ϕ, varied from 0.5, 1.0 to 2.0 in each sub-scenario to reflect different traffic 306
characteristics. 307
As discussed above, a number of different issues can affect the development of CMFs 308
from SPFs. However, it will be difficult to properly understand and quantify the bias if multiple 309
limitations are analyzed simultaneously. For the purpose of this study, the main assumptions 310
used were: (1) the variables were considered to be independent of each other (i.e., there is no 311
correlation); and (2) there is no interactions on safety between variables. 312
4. DATA DESCRIPTION 313
This study utilized the same roadway segments (i.e., rural highways in Texas) as in Wu et al. 314
(2015). It contained segment length, Annual Average Daily Traffic (AADT), lane width, curve 315
density (i.e., curves/mile) and pavement friction. The segment length and AADT were based on 316
actual values from the Texas data, while the other three are hypothetical variables created for this 317
study. They were generated using the same method as those in Wu et al. (2015), i.e., independent 318
discrete or continuous uniform distributions. For lane width, the parameters were 8 and 13. For 319
the curve density, they were 0 and 16. And for pavement friction, they were 16 and 48. But it is 320
worth to mention that, the variables may not be independent and uniformly distributed in 321
practice. On the contrary, some of them are usually correlated. For example, when constructing 322
two highways, one with higher demand (i.e., AADT) and the other with lower, it is common that 323
the former one will be designed with higher standard, e.g., wider lanes and shoulders, etc. Thus, 324
variables AADT and lane width are correlated. And also, in highway design manuals (AASHTO, 325
2004), lane width is recommended to be 12 ft for most highways. So 12 ft may be prevalent 326
among lanes, and it is not discrete uniformly distributed in practice. This might affect the 327
regression result and hence the CMFs for variables. For comparative purpose, however, this 328
Wu and Lord 17
study utilized the same dataset as the previous one. As a result, the issue of variable correlation 329
does not have to be considered in this study (Lee et al., 2015). 330
In total, there were 1,492 segments. The summary statistics of these variables are shown 331
in Table 1. In the simulated data, the number of sites within each category of lane width (8 to 13 332
ft) is approximately 249. 333
Table 1 Summary Statistics of Highway Segments 334
Variable Sample Size Min. Max Mean (SD c)
Length (mile) 1,492 0.1 6.3 0.55 (0.67) AADT 1,492 502 24,800 6,643.9 (3,996.4)
Lane Width (ft) 1,492 8.0 13.0 10.47 (1.74) CD a (per mile) 1,492 0.02 16.0 8.1 (4.66)
PF b 1,492 16.0 47.9 31.9 (9.08) a – CD = Curve Density; b – PF = Pavement Friction; c - SD = Standard Deviation. 335 336
Similar to the previous study (i.e. (Wu et al., 2015)), it is important to point out that this 337
study selected three geometric features and the CMFs are mainly assumed based on their 338
practical values (i.e., from the HSM, CMF Clearinghouse, etc.) to reflect as close as possible the 339
characteristics related to variables that can influence crash risk. However, it does not have to be 340
so. With the simulation protocol, it would be possible for other researchers to use variables and 341
ranges based on characteristics associated with the roadway entities in which the researchers 342
have detailed information on these characteristics. 343
5. RESULTS 344
5.1 Scenario Nonlinear I: Consider lane width only 345
In this scenario, three nonlinear CM-Functions were assumed for lane width. This way, there 346
were three sub-scenarios, Nonlinear I-1, I-2 and I-3. The first two CM-Functions for lane width 347
were quadratic functions (in logarithm form), shown in Equations 8 and 9. 348
Wu and Lord 18
2( ) 0.1 2.22 12.28ln CMF LW LW (8) 349
2( ) 0.2 4.22 21.88ln CMF LW LW (9) 350
The third one was a combination of two piecewise quadratic functions. This nonlinear 351
function, shown in Equation 10, was developed by Lee et al. (2015) based on real crash data. 352
Note that, in Lee et al. (2015)’s study lanes narrower than 9 ft were considered to have the same 353
CMF as a 9-ft lane. To keep the analyses consistent and make it easier, this study assumed that 354
an 8-ft lane had a different CMF with a 9-ft lane, and it was directly calculated using Equation 355
10. 356
2
2
0.11 ( 12) 0.30 12( )
0.08 ( 12) 0.30 12
LW LWln CMF
LW LW
(10) 357
The assumed CM-Functions and their characteristics (closest line, area and AVD) for the 358
three sub-scenarios are summarized in Table 2. It can be seen that, the AVD of I-2 was higher 359
than those of I-1 and I-3. The latter two were close to each other. This made the assumed CM-360
Function in I-2 relatively strong in nonlinearity, and the other two relatively weak. 361
Table 2 Assumed CM-Functions for Lane Width (Scenario Nonlinear I) 362
# * ( )ln CMF a Line b Areac AVD Leveld
NL I-1 20.1 2.22 12.28LW LW 0.123 1.46LW 0.802 0.160 Weak NL I-2 20.2 4.22 21.88LW LW 0.023 0.24LW 1.603 0.321 Strong
NL I-3
20.11 ( 12) 0.30 12LW LW
20.08 ( 12) 0.30 12LW LW
0.339 4.05LW 0.886 0.177 Weak
Note: * # - sub-scenario number, NL - Nonlinear; a: LW - lane width, in ft; b: Line - the closest 363
line to the curve; c: Area - the area between the curve and its closest line; d: Level - the relative 364
nonlinear level. 365
Wu and Lord 19
366
The theoretical function of the generated crash counts in these three sub-scenarios is 367
shown in Equation 11. The specific CMF for lane width, CMFLW, i, was calculated using 368
Equations 8 to 10. 369
4, , , ,2.67 10true i spf i LW i i i LW iN N CMF L AADT CMF (11) 370
The considered functional form in SPFs is shown in Equation 12. As can be seen, the link 371
function is misspecified in the considered regression model. 372
10 2( ) ( )i i iE L AADT exp LW (12) 373
Table 3 presents the CMFs derived from SPFs as well as other results (i.e., ϕ and GOF 374
measurements). First, the MAD and MSPE of nonlinear forms were significantly higher when 375
compared with linear ones [i.e., Scenario I in the previous study Wu et al. (2015)]. This indicated 376
the CMFs in this scenario might have higher bias. Second, with the increase of nonlinear level, 377
the MAD and MSPE also increased. In other words, when the relationship between the variable 378
and crash risk became strong in nonlinearity, the normal GLMs were likely to produce biased 379
CMFs. Finally, under nonlinear relationships, the inverse dispersion parameters estimated from 380
SPFs were biased (see the column of “ϕ” in Table 3) 381
Wu and Lord 20
Table 3 Results of Scenario Nonlinear I 382
# * CMF (SD) a ϕ b AIC d MAD e MSPE f ϕ c = 0.5 NL I-1 0.88 (0.01) 0.57 11413.39 0.31 0.39 NL I-2 0.98 (0.02) 0.81 11522.30 0.64 1.57 NL I-3 1.33 (0.02) 0.56 7770.30 0.15 0.11 ϕ c = 1.0 NL I-1 0.87 (0.02) 1.08 11492.77 0.31 0.39 NL I-2 0.98 (0.02) 1.34 11509.57 0.64 1.58 NL I-3 1.34 (0.03) 1.08 7803.04 0.15 0.13 ϕ c = 2.0 NL I-1 0.88 (0.03) 2.11 11233.07 0.32 0.40 NL I-2 0.98 (0.03) 2.41 11138.83 0.64 1.60 NL I-3 1.35 (0.03) 2.11 7690.76 0.15 0.15
Note: * # - sub-scenario number, NL – Nonlinear; a – mean of CMFs from 100 experiments, SD 383
is the Standard Deviation of the 100 CMFs; b – the inverse dispersion parameter derived from 384
SPFs; c – the theoretical inverse dispersion parameter in each sub-scenario; d, e, f – each is the 385
mean value of the corresponding GOF measure of the 100 results. 386
387
To verify the above assumption, the curves of assumed CM-Functions and the CMFs 388
derived from SPFs are illustrated in Figure 2. Due to the space limitation, Figure 2 only presents 389
the curves with inverse dispersion parameter equals to 0.5. In addition, the specific CMFs for 390
lane widths (8, 9, 11 and 13 ft) are presented in Table 4 for explicit comparison. The CMFs for 391
12-ft lane are excluded from Table 4, because 12 ft is the base condition for lane with and the 392
CMFs are equal to 1.0 in both assumed and derived CM-Functions. 393
Figure 2(a) shows the CM-Functions in sub-scenario I-1 (weak nonlinearity). It can be 394
seen that the assumed true CMF for lane width first decreased and then increased. But the CMF 395
derived from SPF was 0.88, meaning the expected number of crashes would consistently reduce 396
by 12 percent whenever the lane was widened by 1 foot. When the lane width was less than 397
Wu and Lord 21
about 9 ft or more than 12 ft, the CMF was underestimated. The result was contrary when the 398
lane width was between about 9 and 12 ft. The bias was more obvious around boundary areas. 399
Figure 2(b) presents the CM-Functions in sub-scenario I-2 (strong nonlinearity). The 400
overall result was similar to that of I-1. But the bias or error around the boundary areas in this 401
sub-scenario was much higher than those in I-1. More interestingly, the CMF derived from SPFs 402
in this sub-scenario was 0.98, very close to 1.0, indicating lane width had minor influence on 403
crash risk. Safety analysts may misleadingly conclude that widening lane has little effect on 404
reducing collisions based on this finding. However, the assumed true safety effect of lane width 405
was far from this statement. 406
The results of I-3 (piecewise nonlinear functions) are shown in Figure 2(c). The CMF for 407
lane width derived from SPFs was 1.33. When the lane width was between 9 and 12 ft, the two 408
curves were close to each other. However, the bias was significantly high when the lane became 409
relatively wide or narrow. The error reached nearly 90 percent at the point of 8-ft lane. On the 410
side of wider lanes, the true CMF decreased as lane width increased, but the CMF derived from 411
SPFs increased continuously. 412
Similar results were found for other inverse dispersion parameters (not documented here 413
due to space limitation). So, it can be concluded that none of the CMFs derived from SPFs could 414
reflect the true safety effects accurately. They were all biased, especially in boundary areas. 415
Regression analysis with the commonly used linear link functions could produce biased CMFs 416
when the variable had nonlinear relationships on crash risk. With the increase of nonlinearity, the 417
bias became significant. In addition, the misuse of linear link function also led to biased 418
estimates for other parameters, which might play important roles in safety analyses. For example, 419
the inverse dispersion parameter is important in calculating the weights in empirical Bayes (EB) 420
Wu and Lord 22
analyses (Hauer et al., 2002; Wu et al., 2014; Zou et al., 2015). As a result, biased dispersion 421
parameters lead to biased EB estimates of crashes. 422
In addition to the quadratic form, several other types of nonlinear functional forms have 423
been assumed. CMFs were generated following the same procedure described above and similar 424
results were found (the detailed analyses are not documented here). 425
426 (a) I-1 (b) I-2 427
428 (c) I-3 429
Figure 2 CM-Functions for lane width in scenario Nonlinear I (ϕ =0.5). 430
Wu and Lord 23
Table 4 Bias and Error of CMFs for Lane Width in Scenario Nonlinear I 431
Scenario Th.a SPFb Bias Ec Th. SPF Bias E Th. SPF Bias E
LW (ft) 8 9 10
ϕ d = 0.5
IV-1 2.44 0.76 1.68 68.9 1.45 0.81 0.63 43.8 1.05 0.87 0.18 16.9
IV-2 2.44 0.20 2.24 91.8 1.07 0.30 0.77 72.0 0.70 0.45 0.25 36.1
IV-3 0.17 0.63 -0.46 278.3 0.36 0.70 -0.34 93.7 0.64 0.79 -0.15 24.1
ϕ d = 1.0
IV-1 2.44 0.76 1.68 68.8 1.45 0.81 0.63 43.7 1.05 0.87 0.17 16.7
IV-2 2.44 0.21 2.23 91.5 1.07 0.31 0.76 71.4 0.70 0.46 0.25 35.2
IV-3 0.17 0.63 -0.47 282.1 0.36 0.71 -0.35 95.1 0.64 0.80 -0.16 24.8
ϕ d = 2.0
IV-1 2.44 0.75 1.69 69.4 1.45 0.80 0.64 44.4 1.05 0.86 0.18 17.4
IV-2 2.44 0.19 2.25 92.3 1.07 0.28 0.79 73.4 0.70 0.43 0.27 38.3
IV-3 0.17 0.57 -0.40 243.2 0.36 0.66 -0.29 80.0 0.64 0.75 -0.12 18.2
432
Wu and Lord 24
Table 4 Continued 433
Setting Th. SPF Bias E Th. SPF Bias E
LW (ft) 11 13
ϕ d= 0.5
IV-1 0.93 0.93 -0.01 0.8 1.32 1.07 0.25 18.8
IV-2 0.69 0.67 0.02 2.4 2.17 1.49 0.68 31.3
IV-3 0.89 0.89 0.00 0.4 0.93 1.12 -0.20 21.2
ϕ d= 1.0
IV-1 0.93 0.93 -0.01 0.9 1.32 1.07 0.25 18.8
IV-2 0.69 0.67 0.01 1.7 2.17 1.48 0.69 31.8
IV-3 0.89 0.89 0.00 0.2 0.93 1.12 -0.19 20.9
ϕ d= 2.0
IV-1 0.93 0.93 0.00 0.4 1.32 1.08 0.24 18.5
IV-2 0.69 0.66 0.03 4.1 2.17 1.52 0.65 30.1
IV-3 0.89 0.87 0.03 2.8 0.93 1.15 -0.22 24.2
Note: a – theoretical CMF (assumed true specific CMFs for lane widths of 8, 9, 10, 11 and 12 ft); b – CMFs derived from 434
SPFs (i.e., specific CMFs derived from regression models for corresponding lane widths); c – error percentage, %; d – the 435
theoretical inverse dispersion parameter (ϕ) in each sub-scenario. 436
Wu and Lord 25
5.2 Scenario Nonlinear II: Consider three variables, only one in nonlinear form 437
In this scenario, the assumed CMFs for curve density and pavement friction were 1.072 and 438
0.973, respectively. Both were in linear form. The same three nonlinear CM-Functions for lane 439
width were used. Three sub-scenarios, II-1, II-2 and II-3, were analyzed, shown in Table 5. 440
Table 5 Assumed CM-Functions (Scenario Nonlinear II) 441
# * ( )ln CMF (Nonlinear Level) CMF
Lane Width Curve
Density Pavement Friction
NL II-1 20.1 2.22 12.28LW LW (W) 1.072CD ( 32)0.973 PF NL II-2 20.2 4.22 21.88LW LW (S) 1.072CD ( 32)0.973 PF
NL II-3 20.11 ( 12) 0.30 12LW LW
20.08 ( 12) 0.30 12LW LW (W) 1.072CD ( 32)0.973 PF
Note: * # - sub-scenario number, NL - Nonlinear; LW - lane width, in ft; CD - curve density, 442
number of curves per mi; PF - pavement friction. 443
444
The nonlinear level of the assumed CM-Functions for lane width of each sub-scenario 445
was the same as the corresponding one in Nonlinear I. It was relatively strong in II-2, and weak 446
in II-1 and II-3. 447
The theoretical function of the generated crash counts and the considered functional form 448
in this scenario are shown in Equations 13 and 14, respectively. 449
, , , , ,true i spf i LW i CD i PF iN N CMF CMF CMF 450
4, , ,2.67 10 i i LW i CD i PF iL AADT CMF CMF CMF (13) 451
10 2 3 4( ) ( )i i i i iE L AADT exp LW CD PF (14) 452
The CMFs for the three variables and other modeling results of each sub-scenario are 453
documented in Table 6. The overall result was quite similar with that in Nonlinear I. The MAD 454
Wu and Lord 26
and MSPE were higher than those of linear relationships (i.e., Scenario II in the previous study). 455
II-2 consistently had the highest MAD and MSPE. Meanwhile, the inverse dispersion parameters 456
estimated from SPFs were biased again. The CMFs for lane width derived in this scenario were 457
slightly different with those of Nonlinear I. The CMFs for curve density and pavement friction 458
were very close to their true values. However, the MAD and MSPE of this scenario were higher 459
than those of Nonlinear I under the same assumed CM-Function for lane width. 460
Figures 3(a) to (c) illustrate the curves of CM-Functions for lane width (ϕ = 0.5). The 461
results were very close to those of the corresponding sub-scenario in Nonlinear I. The CMFs 462
were all biased, especially in boundary areas. The bias of II-2 was always higher than those of 463
II-1 and II-3 (except over a very small range around 9). 464
The CMFs for curve density derived from SPFs in the three sub-scenarios (ϕ = 0.5) were 465
1.071, 1.069, and 1.073, respectively. They were quite close to the assumed true value, 1.072. 466
The CM-Function curves are shown in Figure 3(d). The CMFs were generally acceptable. 467
However, when comparing the results between the three sub-scenarios, it can be observed that 468
the bias and error percentage in II-2 (strong) were always higher than those in II-1 and II-3 469
(weak). So, as the nonlinearity between lane width and crash risk increased, the bias of CMF for 470
curve density became significant. That is to say, even the link function for one variable was 471
correct, the accuracy of CMF for this variable can still be influenced if incorrect or improper link 472
functions for other variables had been utilized in the models. 473
The CMFs for pavement friction produced from the three sub-scenarios (ϕ = 0.5) were 474
0.972, 0.971 and 0.973, respectively. The CM-Function curves are shown in Figure 3(e). The 475
results were similar with those of curve density. Overall, the bias and error percentage was 476
relatively small. II-2 was consistently the highest in terms of bias and error percentage. 477
Wu and Lord 27
Table 6 Results of Scenario Nonlinear II 478
# * CMF (SD) a
ϕ b AIC d MAD e MSPE f LW CD PF
ϕ c =0.5
NL II-1 0.88
(0.014) 1.071
(0.007) 0.972
(0.003) 0.57 14677.2 0.62 1.60
NL II-2 0.98
(0.015) 1.069
(0.006) 0.971
(0.003) 0.79 14988.9 1.27 6.63
NL II-3 1.33
(0.021) 1.073
(0.006) 0.973
(0.003) 0.57 10441.8 0.29 0.52
ϕ c =1.0
NL II-1 0.88
(0.015) 1.072
(0.007) 0.972
(0.003) 1.08 14842.6 0.62 1.65
NL II-2 0.98
(0.021) 1.069
(0.008) 0.971
(0.004) 1.31 14908.6 1.26 6.72
NL II-3 1.35
(0.024) 1.073
(0.008) 0.974
(0.004) 1.07 10520.1 0.30 0.60
ϕ c =2.0
NL II-1 0.88
(0.022) 1.071
(0.008) 0.972
(0.004) 2.08 14439.9 0.63 1.76
NL II-2 0.98
(0.026) 1.069
(0.010) 0.970
(0.005) 2.36 14370.7 1.26 6.80
NL II-3 1.35
(0.034) 1.072
(0.010) 0.974
(0.005) 2.08 10277.6 0.31 0.66
Note: the same notes as those in Table 3. 479
480
Wu and Lord 28
481 (a) Lane width in II-1 (b) Lane width in II-2 482
483 (c) Lane width in II-3 (d) Curve density 484
485 (e) Pavement friction 486
Figure 3 CM-Functions of variables in scenario Nonlinear II (ϕ =0.5). 487
Wu and Lord 29
5.3 Scenario Nonlinear III: Consider three variables, two in non-linear form 488
To simplify the analyses, the first two nonlinear CM-Functions for lane width in Scenarios 489
Nonlinear I and II were used in this scenario, the last one with piecewise function was removed. 490
The assumed CMF for pavement friction was 0.973, the same as that in Scenario Nonlinear II. 491
Two quadratic CM-Functions for curve density were assumed, as shown in Equations 15 492
and 16, respectively. 493
4 2 2( ) 8.7 10 5.56 10i i iln CMF CD CD (15) 494
3 2 2( ) 3.5 10 1.39 10i i iln CMF CD CD (16) 495
The closet line, area and AVD for Equation 15 are 26.95 10 0.037CD , 0.229 and 496
0.014, respectively. And those for Equation 16 are 26.99 10 0.149CD , 0.920 and 0.057, 497
respectively. Both area and AVD of the second function are much higher than those of the first 498
one. So, the nonlinear level of the second one is stronger than the first. 499
In total, there were four sub-scenarios in this scenario, shown in Table 7. It can be seen 500
that the nonlinear level of III-1 was weak in both lane width and curve density. That of III-4 was 501
strong in both. III-2 and III-3 were a combination of a weak and a strong. 502
Table 7 Assumed CM-Functions (Scenario Nonlinear III) 503
# * ( )ln CMF (Nonlinear Level) CMF
Lane Width Curve Density Pavement Friction
NL III-1 20.1 2.22 12.28LW LW (W) 4 2 28.7 10 5.56 10CD CD (W) ( 32)0.973 PF NL III-2 20.2 4.22 21.88LW LW (S) 4 2 28.7 10 5.56 10CD CD (W) ( 32)0.973 PF NL III-3 20.1 2.22 12.28LW LW (W) 3 2 23.5 10 1.39 10CD CD (S) ( 32)0.973 PF NL III-4 20.2 4.22 21.88LW LW (S) 3 2 23.5 10 1.39 10CD CD (S) ( 32)0.973 PF
Note: the same notes as those in Table 6. 504
505
Wu and Lord 30
The theoretical function of the generated crash counts and considered functional form 506
used in this scenario were identical with those in Nonlinear II (i.e., Equations 13 and 14). They 507
are reproduced below as Equations 15 and 16, respectively. 508
, , , , ,true i spf i LW i CD i PF iN N CMF CMF CMF 509
4, , ,2.67 10 i i LW i CD i PF iL AADT CMF CMF CMF (15) 510
10 2 3 4( ) ( )i i i i iE L AADT exp LW CD PF (16) 511
The CMFs for the three variables as well as other results produced from the modeling are 512
presented in Table 8. Similarly to Nonlinear II, the MAD and MSPE were higher than those of 513
linear relationships. But, surprisingly, they were always the highest in III-2 (combination of 514
strong and weak) rather than in III-4 (strong in both). The inverse dispersion parameters 515
estimated from SPFs were biased again in this scenario. The CMF for lane width derived in this 516
scenario was nearly the same as that of Nonlinear II with corresponding assumed CM-Function. 517
The CMFs for curve density were slightly different with those in Nonlinear II. And the CMFs for 518
pavement friction were very close to the true value. 519
Figures 4(a) to (d) illustrate the CM-Function curves for lane width in each sub-scenario 520
(ϕ = 0.5). The results were nearly identical with those of the corresponding one in Nonlinear II. 521
The CMFs were all biased, especially around boundary areas. The calculation indicated the bias 522
in III-1 and III-3 (weak in lane width) were significantly lower than those in III-2 and III-4 523
(strong in lane width). It seems the changes in nonlinearity of curve density had no significant 524
influence on the CMF for lane width. 525
The CMFs for curve density derived from SPFs in the four sub-scenarios (ϕ = 0.5) were 526
1.073, 1.072, 1.075 and 1.074, respectively. The curves are shown in Figures 4(e) to (h). The 527
Wu and Lord 31
CMFs derived from SPFs were overestimated in all of the four sub-scenarios. When comparing 528
the results between the two assumed CM-Functions for curve density, the bias and error 529
percentage of III-3 and III-4 (strong in curve density) were always much higher than those of III-530
1 and III-2 (weak in curve density), expect at a small range around 16. In short, the CMFs for 531
curve density derived from SPFs were all biased when the relationship was nonlinear. The bias 532
increased when the nonlinear level became stronger. Another interesting finding was that the 533
highest bias of CMF for curve density did not appear around the boundary areas, but near the 534
middle. As can be seen in Figures 4(e) to (h), as curve density increased from the baseline point 535
(i.e., zero), the bias first increased then decreased. The highest was around 11. This was probably 536
due to the fact that the baseline for curve density was at the very left side. If the baseline was at 537
some point in the middle (e.g., 8 or 10), the result might be similar to that of lane width. The bias 538
should appear to be small around baseline and became large in boundary areas, intuitively. 539
Nevertheless, the CMFs were still biased. 540
The CMFs for pavement friction produced from the four sub-scenarios (ϕ = 0.5) were 541
0.973, 0.972, 0.973 and 0.972, respectively. The CM-Function curves are shown in Figure 4(i). 542
The overall results were nearly the same as those of Nonlinear II. Both bias and error percentage 543
were relatively small. 544
Wu and Lord 32
Table 8 Results of Scenario Nonlinear III 545
# CMF (SD) a
ϕ b AIC d MAD e MSPE f LW CD PF
ϕ c=0.5
NL III-1 0.88
(0.014) 1.073
(0.006) 0.973
(0.002) 0.57 14455.8 0.60 2.05
NL III-2 0.97
(0.017) 1.072
(0.005) 0.972
(0.002) 0.79 14670.4 1.21 6.80
NL III-3 0.88
(0.015) 1.075
(0.005) 0.973
(0.003) 0.57 13835.8 0.55 1.61
NL III-4 0.97
(0.016) 1.074
(0.006) 0.972
(0.003) 0.79 13980.7 1.08 5.47
ϕ c=1.0
NL III-1 0.88
(0.015) 1.072
(0.008) 0.972
(0.003) 1.07 14648.4 0.61 2.15
NL III-2 0.97
(0.020) 1.074
(0.008) 0.972
(0.003) 1.31 14621.6 1.21 6.89
NL III-3 0.88
(0.018) 1.073
(0.008) 0.973
(0.003) 1.08 13965.8 0.55 1.71
NL III-4 0.97
(0.021) 1.074
(0.008) 0.972
(0.004) 1.32 13978.6 1.08 5.55
ϕ c=2.0
NL III-1 0.88
(0.021) 1.073
(0.010) 0.972
(0.005) 2.09 14243.5 0.62 2.26
NL III-2 0.98
(0.027) 1.073
(0.010) 0.971
(0.005) 2.36 14118.3 1.22 7.16
NL III-3 0.88
(0.022) 1.073
(0.010) 0.973
(0.004) 2.09 13592.0 0.56 1.81
NL III-4 0.98
(0.027) 1.074
(0.011) 0.971
(0.004) 2.36 13420.4 1.08 5.70
Note: the same notes as those in Table 3. 546
547
Wu and Lord 33
548 (a) Lane width in III-1 (b) Lane width in III-2 549
550 (c) Lane width in III-3 (d) Lane width in III-4 551
552 (e) Curve density in III-1 (f) Curve density in III-2 553
Figure 4 CM-Functions for variables in scenario Nonlinear III (ϕ =0.5). 554
Wu and Lord 34
555
556 (g) Curve density in III-3 (h) Curve density in III-4 557
558 (i) Pavement friction 559
Figure 4 Continued 560
6. DISCUSSIONS AND CONCLUSIONS 561
This paper has documented an extensive study on the validation of use of regression models for 562
estimating CMFs, emphasizing on misspecification in link functions connecting variables and 563
crash risk. Several nonlinear CM-Functions mixed with linear ones for variables were assumed 564
and used to generate crash counts. Then, CMFs were estimated using GLM methods and their 565
accuracy were analyzed. The main conclusions can be summarized as follows: (1) the CMFs for 566
variables having nonlinear effects on safety derived from SPFs are biased when using the 567
common GLMs; (2) with the increase in nonlinearity, the bias tends to become significant (as 568
Wu and Lord 35
expected); (3) the quality of CMFs for other variables having linear relationships can be 569
influenced when mixed with those having nonlinear relationships, but the accuracy may still be 570
acceptable; and (4) the misuse of linear link functions for one or more variables also leads to 571
biased estimate of other parameters. This study showed the importance of linking functions when 572
modeling crashes, this is consistent with several previous work (Miaou and Lord, 2003; El-573
Basyouny and Sayed, 2010; Lord and Mannering, 2010). This also explains why nonlinear based 574
models outperformed GLMs in the previous studies (Lee et al., 2015; Stefanova et al., 2015; 575
Park and Abdel-Aty, 2016). Improper functions may lead to misleading conclusions and biased 576
CMFs. The conventional GLM should not be used for either modeling the crash or developing 577
CMFs when the variables are having strong nonlinear effects on safety, as Lee et al. (2015) and 578
Zeng et al. (2016a) suggested. 579
This study is not without any limitations. First, a solid model is the basis for predicting 580
crashes and developing reliable CMFs (Wu et al., 2015). This study only considered the most 581
frequently used one (i.e., NB distribution), and it was used in both data generation and modeling. 582
Second, some variables were independently and uniformly generated in this study, which might 583
not be the case in practice (Park and Abdel-Aty, 2015b; Wu and Lord, 2016). Preliminary 584
analyses indicated variable distribution and correlation could also affect modeling result. Lee et 585
al. (2015) pointed out that variable correlation should be considered when modeling crashes. To 586
simplify the analyses and properly quantify the biased estimates, these limitations were not 587
included in this study. Finally, sample size influences the modeling significantly (Lord, 2006; 588
Lord and Miranda-Moreno, 2008). The dataset used in this study contained 1,492 observations, 589
which should be large enough. In reality, many studies used only hundreds or even dozens of 590
observations due to the limited availability of crash and highway data. These questions will 591
Wu and Lord 36
affect the quality of CMFs and need further consideration. Nevertheless, the simulation protocol 592
proposed in this paper can still be applied to evaluate the CMFs under different conditions. 593
ACKNOWLEDGEMENTS 594
The authors wish to thank Dr. Ezra Hauer for thehis comments on a previous version of 595
the paper. 596
597
REFERENCES 598
Aashto, 2004. A policy on geometric design of highways and streets, 2004 Washington, D.C. : 599 American Association of State Highway and Transportation Officials, Washington, D.C. 600
Aashto, 2010. Highway safety manual, 1st Edition ed. American Association of State Highway 601 and Transportation Officials, Washington, D.C. 602
Banihashemi, M., 2015. Is horizontal curvature a significant factor of safety in rural multilane 603 highways? Transportation Research Record: Journal of the Transportation Research Board 604 (2515), 50-56. 605
Banihashemi, M., 2016. Effect of horizontal curves on urban arterial crashes. Accident Analysis 606 & Prevention 95, 20-26. 607
Bonneson, J., Pratt, M., 2008. Procedure for developing accident modification factors from 608 cross-sectional data. Transportation Research Record: Journal of the Transportation Research 609 Board 2083, 40-48. 610
El-Basyouny, K., Sayed, T., 2010. Application of generalized link functions in developing 611 accident prediction models. Safety Science 48 (3), 410-416. 612
Elvik, R., 2011. Assessing causality in multivariate accident models. Accident Analysis & 613 Prevention 43 (1), 253-264. 614
Gross, F., Jovanis, P.P., 2007a. Estimation of safety effectiveness of changes in shoulder width 615 with case control and cohort methods. Transportation Research Record (2019), 237-245. 616
Gross, F., Jovanis, P.P., 2007b. Estimation of the safety effectiveness of lane and shoulder width: 617 Case-control approach. Journal of transportation engineering 133 (6), 362-369. 618
Wu and Lord 37
Gross, F., Jovanis, P.P., Eccles, K., Chen, K.-Y., 2009. Safety evaluation of lane and shoulder 619 width combinations on rural, two lane, undivided roads. In: Fhwa, U.S.D.O.T. ed., Washington 620 D.C. 621
Gross, F., Persaud, B., Lyon, C., 2010. A guide to developing quality crash modification factors. 622 In: Fhwa, U.S.D.O.T. ed. FHWA, U.S. Department of Transportation, Washington, D.C. 623
Hauer, E., 2004. Statistical road safety modeling. Transportation Research Record: Journal of the 624 Transportation Research Board 1897, 81-87. 625
Hauer, E., 2005a. Cause and effect in observational cross-section studies on road safety. In: 626 Proceedings of the the 84th Annual Meeting of the Transportation Research Board (TRB), 627 Washington D.C. 628
Hauer, E., 2005b. Fishing for safety information in murky waters. Journal of Transportation 629 Engineering 131 (5), 340-344. 630
Hauer, E., 2010. Cause, effect and regression in road safety: A case study. Accident Analysis & 631 Prevention 42 (4), 1128-1135. 632
Hauer, E., 2013. Even perfect regressions may not tell the effect of interventions. In: Proceedings 633 of the The Transportation Research Board (TRB) 92nd Annual Meeting, Washington D.C. 634
Hauer, E., 2014. Trustworthiness of safety performance functions. In: Proceedings of the the 635 93rd Annual Meeting of the Transportation Research Board (TRB), Washington, D.C. 636
Hauer, E., 2015. The art of regression modeling in road safety Springer, USA. 637
Hauer, E., Council, F.M., Mohammedshah, Y., 2004. Safety models for urban four-lane 638 undivided road segments. Transportation Research Record: Journal of the Transportation 639 Research Board 1897, 96-105. 640
Hauer, E., Harwood, D.W., Council, F.M., Griffith, M.S., 2002. Estimating safety by the 641 empirical bayes method - a tutorial. Transportation Research Record: Journal of the 642 Transportation Research Board 1784, 126-131. 643
Lao, Y., Zhang, G., Wang, Y., Milton, J., 2014. Generalized nonlinear models for rear-end crash 644 risk analysis. Accident Analysis & Prevention 62, 9-16. 645
Lee, C., Abdel-Aty, M., Park, J., Wang, J.-H., 2015. Development of crash modification factors 646 for changing lane width on roadway segments using generalized nonlinear models. Accident 647 Analysis & Prevention 76, 83-91. 648
Li, X., Lord, D., Zhang, Y., 2011. Development of accident modification factors for rural 649 frontage road segments in texas using generalized additive models. Journal of Transportation 650 Engineering 137 (1), 74-83. 651
Wu and Lord 38
Li, X., Lord, D., Zhang, Y., Xie, Y., 2008. Predicting motor vehicle crashes using support vector 652 machine models. Accident Analysis & Prevention 40 (4), 1611-8. 653
Lord, D., 2006. Modeling motor vehicle crashes using poisson-gamma models: Examining the 654 effects of low sample mean values and small sample size on the estimation of the fixed 655 dispersion parameter. Accident Analysis & Prevention 38 (4), 751-766. 656
Lord, D., Bonneson, J.A., 2007. Development of accident modification factors for rural frontage 657 road segments in texas. Transportation Research Record: Journal of the Transportation 658 Research Board 2023, 20-27. 659
Lord, D., Guikema, S.D., Geedipally, S.R., 2008. Application of the conway-maxwell-poisson 660 generalized linear model for analyzing motor vehicle crashes. Accident Analysis & Prevention 661 40 (3), 1123-1134. 662
Lord, D., Kuo, P.-F., 2012. Examining the effects of site selection criteria for evaluating the 663 effectiveness of traffic safety countermeasures. Accident Analysis & Prevention 47, 52-63. 664
Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: A review and 665 assessment of methodological alternatives. Transportation Research Part A 44 (5), 291-305. 666
Lord, D., Miranda-Moreno, L.F., 2008. Effects of low sample mean values and small sample size 667 on the estimation of the fixed dispersion parameter of poisson-gamma models for modeling 668 motor vehicle crashes: A bayesian perspective. Safety Science 46 (5), 751-770. 669
Mannering, F.L., Shankar, V., Bhat, C.R., 2016. Unobserved heterogeneity and the statistical 670 analysis of highway accident data. Analytic Methods in Accident Research 11, 1-16. 671
Miaou, S.-P., Lord, D., 2003. Modeling traffic crash flow relationships for intersections - 672 dispersion parameter, functional form, and bayes versus empirical bayes methods. 673 Transportation Research Record: Journal of the Transportation Research Board 1840, 31-40. 674
Park, B.-J., Lord, D., Wu, L., 2016. Finite mixture modeling approach for developing crash 675 modification factors in highway safety analysis. Accident Analysis & Prevention 97, 274-287. 676
Park, J., Abdel-Aty, M., 2015a. Assessing the safety effects of multiple roadside treatments using 677 parametric and nonparametric approaches. Accident Analysis & Prevention 83, 203-213. 678
Park, J., Abdel-Aty, M., 2015b. Development of adjustment functions to assess combined safety 679 effects of multiple treatments on rural two-lane roadways. Accident Analysis & Prevention 75, 680 310-319. 681
Park, J., Abdel-Aty, M., 2016. Evaluation of safety effectiveness of multiple cross sectional 682 features on urban arterials. Accid Anal Prev 92, 245-55. 683
Pierce, R., 2014. Definition of bias. Accessed October 2, 2016. 684 http://www.mathsisfun.com/definitions/bias.html 685
Wu and Lord 39
Stefanova, T., Burkhardt, J.M., Filtness, A., Wullems, C., Rakotonirainy, A., Delhomme, P., 686 2015. Systems-based approach to investigate unsafe pedestrian behaviour at level crossings. 687 Accid Anal Prev 81, 167-86. 688
Wu, L., Lord, D., 2016. Investigating the influence of dependence between variables on crash 689 modification factors developed using regression models. In: Proceedings of the The 95th 690 Annual Meeting of the Transportation Research Board (TRB), Washington D.C. 691
Wu, L., Lord, D., Zou, Y., 2015. Validation of crash modification factors derived from cross-692 sectional studies with regression models. Transportation Research Record: Journal of the 693 Transportation Research Board 2514, 88-96. 694
Wu, L., Zou, Y., Lord, D., 2014. Comparison of sichel and negative binomial models in hot spot 695 identification. Transportation Research Record: Journal of the Transportation Research Board 696 2460, 107-116. 697
Xie, Y., Lord, D., Zhang, Y., 2007. Predicting motor vehicle collisions using bayesian neural 698 network models: An empirical analysis. Accident Analysis & Prevention 39 (5), 922-33. 699
Xie, Y., Zhang, Y., 2008. Crash frequency analysis with generalized additive models. 700 Transportation Research Record: Journal of the Transportation Research Board 2061, 39-45. 701
Zeng, Q., Huang, H., Pei, X., Wong, S.C., Gao, M., 2016a. Rule extraction from an optimized 702 neural network for traffic crash frequency modeling. Accident Analysis & Prevention 97, 87-703 95. 704
Zeng, Z., Zhu, W., Ke, R., Ash, J., Wang, Y., Xu, J., Xu, X., 2016b. A generalized nonlinear 705 model-based mixed multinomial logit approach for crash data analysis. Accident Analysis & 706 Prevention 99 (Pt A), 51-65. 707
Zou, Y., Wu, L., Lord, D., 2015. Modeling over-dispersed crash data with a long tail: Examining 708 the accuracy of the dispersion parameter in negative binomial models. Analytic Methods in 709 Accident Research 5–6, 1-16. 710
711