analysis of injury severity in pedestrian ......classification and regression tree (cart) analysis...
TRANSCRIPT
ANALYSIS OF INJURY SEVERITY IN PEDESTRIAN CRASHES USING
CLASSIFICATION REGRESSION TREES
By
Vichika Iragavarapu, P.E.
(Corresponding author)
Assistant Research Engineer
Texas A&M Transportation Institute, 3135 TAMU
College Station, TX 77843-3135
Phone: 979/845-5686, fax: 979/845-6006
Email: [email protected]
Dominique Lord, Ph.D, P.Eng.
Associate Professor and Zachry Development Professor I
Zachry Department of Civil Engineering, 3136 TAMU
Texas A&M University
College Station, TX 77843-3136
Phone: 979/458-3949, fax: 979/845-6481
Email: [email protected]
And
Kay Fitzpatrick, Ph.D., P.E.
Senior Research Engineer
Texas A&M Transportation Institute, 3135 TAMU
College Station, TX 77843-3135
Phone: 979/845-7321, fax: 979/845-6006
Email: [email protected]
TOTAL WORDS: 6,164 [3664 Words, 9 Tables, 1 Figure (2500)]
Submitted to the Transportation Research Board 94th Annual Meeting
January 11-15, 2015, Washington D.C.
Iragavarapu, Lord, Fitzpatrick
ABSTRACT
Texas is considered to be an “opportunity” state by the Federal Highway Administration
(FHWA), due to the high number of pedestrian crashes. Data from the Fatality Analysis
Reporting System (FARS) show that the number of pedestrian fatal crashes in Texas is the third
highest in the U.S and is significantly higher than the national average. The research team
explored the Texas Department of Transportation (TxDOT) Crash Record Information System
(CRIS) database to identify characteristics of crashes involving pedestrians in Texas. A
Classification and Regression Tree (CART) analysis of all pedestrian crashes was conducted to
find the significant factors influencing the severity of crashes involving pedestrians in Texas.
The classification tree identified that light condition, road class, traffic control, right shoulder
width, involvement of a commercial vehicle, pedestrian age, and the collision manner, have the
most influence on the severity of pedestrian crashes.
Iragavarapu, Lord, Fitzpatrick 1
INTRODUCTION 1 2
Pedestrian crashes are a major safety issue; on average, a pedestrian is killed every two hours 3
and injured every eight minutes in traffic crashes in the United States (1). For the years 2007 4
through 2011, Texas had the third highest number of pedestrian fatal crashes in the U.S., with 5
about 400 pedestrian fatalities per year, which is equal to 15 percent of all traffic crash fatalities 6
in the US (2, 3). To understand the characteristics and factors influencing pedestrian crashes, the 7
Texas Department of Transportation (TxDOT) requested an in-depth analysis of these crashes. 8
Understanding of these relations will provide insight for the development of effective 9
countermeasures focused on pedestrians. The objective of this analysis was, therefore, to find the 10
significant factors and their interactions that influence the severity of crashes involving a 11
pedestrian in Texas. 12
13
14
STUDY METHODOLOGY 15 16
A variety of methodologies have been used to understand factors influencing crash severities. A 17
comprehensive review of these methodologies is provided in Savolainen et. al, 2011 (4). Data 18
mining techniques are widely applied in the areas of business, medicine, industry and 19
engineering and are gaining attention in the transportation safety area (5, 6, 7, 8, 9). The 20
classification-regression tree (CRT) methodology is a popular data mining technique that does 21
not need a specific functional form and is effective with large data sets containing a large number 22
of explanatory variables (10). 23
24
The CRT framework is based on the algorithm first proposed in Breiman et al. (11). The “root 25
node”, which is the node with all the data, is divided into two child nodes on the basis of an 26
independent variable (splitter) that creates the best homogeneity. This process is repeated for 27
each child node until all data in each node has the greatest possible homogeneity. This node is 28
called a “leaf node”. The most famous index for splitting of nominal data is the Gini index. For 29
each tree created, the “goodness of fit” index is calculated using the “misclassification error rate” 30
or “misclassification cost.” “Pruning” is performed according to the cost-complexity algorithm to 31
avoid over-fitting of the training data and to create an optimal tree. An optimal tree is the one 32
that has the least misclassification cost for the test data. Misclassification cost allows inclusion of 33
information about the relative penalty associated with incorrect classification and is inversely 34
proportional to the accuracy of prediction. Correct classifications always have a misclassification 35
cost of 0 (discussed further below). Importance of each variable is calculated using the variable 36
importance index and is scaled such that its summation is one. (12) 37
38
The equation used for the Gini index is (12): 39
40
(Gini Index) Gini(m) = ∑ ( | ) 1 41
42
( | ) ( )
( ) , ( )
( ) ( )
, ( ) ∑ ( )
43
44
Iragavarapu, Lord, Fitzpatrick 2
Where, j is the number of target variables or classes, π(j) is the prior probability for class j, p(j|m) 1
is the conditional probability of a record being in class j, provided that it is in the node m, Nj(m) 2
is the number of records in class j of node m, Nj is the number of records of class j in the root 3
node, and Gini (m) or the Gini index is the indication of impurity in node m. The prior 4
probability shows the proportion of observations in each class in the population. (12) 5
6
Misclassification Error Rate = ∑ ( )[ ( )] 2 7
8
Where, p(m) is the proportion of existing observations in the terminal node or leaf m (from all 9
observations) and M is the number of terminal nodes. (12) 10
11
(Variable Importance Index) VIM(xj) = ∑
( ( ))
3 12
13
Where, Gini(S(xj,t)) is the reduction in the Gini index at node t that is achieved by splitting 14
variable xj,
is the proportion of the observations in the dataset that belong to node t, T is the 15
total number of nodes and N is the total number of observations.(12) 16
17
DATA 18 19
The TxDOT Crash Record Information System (CRIS) database has three subsets: crash, person, 20
and unit. The crash dataset contains information on the characteristics of the crash (e.g., date, 21
time, weather) and its location (e.g., intersection relation, surface condition, traffic control 22
devices). The person dataset has information on the characteristics of the people involved (e.g., 23
age, gender, blood alcohol content, etc.) and injuries sustained (e.g., fatal, no injury, etc.). The 24
unit dataset describes the characteristics of the units (e.g., type of vehicle, contributing factor for 25
unit) involved, along with the contributing factors for each unit involved. Each of these subsets 26
has different codes that distinguish crashes involving pedestrians from other reported crashes. 27
34,620 pedestrian crashes were available between 2007 and 2011 for use in this analysis. Table 1 28
lists the variables used in this analysis. 29
30
31
Iragavarapu, Lord, Fitzpatrick 3
Table 1. Variables used in analysis. 1
Variable Description
Rpt_Road_Part_ID Reported roadway part on which the crash occurred
Wthr_Cond_Rev Weather condition at the time of crash (Revised to merge some categories)
Light_Cond_ID The type and level of light at the time of the crash
Road_Algn_ID Roadway Alignment at the crash site
Surf_Cond_ID The surface condition at the time and place of the crash
Traffic_Cntl__Rev Type of traffic control at the scene of the crash (Revised to merge some categories)
FHE_Collsn_Rev The manner in which the vehicle(s) were moving prior to the first harmful event (Revised
to merge some categories)
Othr_Factr_Rev Additional detail of events/circumstances concerning the crash (Revised to merge some
categories)
Road_Cls_ID The functional classification group of the priority road the motor vehicle(s) was traveling
on before the crash
Road_Relat_ID Where the crash occurred in relation to the roadway
Month Month of year when the crash occurred
Hour Time of day when the crash occurred
DriverAge Age of one of the drivers involved in the crash
DriverGender Gender of one of the drivers involved in the crash
PedestrianAge Age of one of the pedestrians involved in the crash
PedestrianGender Gender of one of the pedestrians involved in the crash
Hwy_Dsgn_Lane_ID Lane design on the applicable section of highway for crashes located on the state
highway system
Hwy_Dsgn_Hrt_ID Part of the Highway Design Code indicating HOV (High Occupancy Vehicle), railroads,
and toll roads (HRT) for crashes located on the state highway system
Hp_Shldr_Left
Width of inside shoulder on divided sections, or width of shoulder traveling in
descending marker direction, measured in feet, for crashes located on the state highway
system
Hp_Shldr_Right
Width of outside shoulder on divided sections, or width of shoulder traveling in
ascending marker direction, measured in feet, for crashes located on the state highway
system
Hp_Median_Width Median width plus both inside shoulders, measured in feet, for crashes located on the
state highway system
Nbr_Of_Lane Number of lanes, not including turning and climbing lanes, for crashes located on the
state highway system
Shldr_Type_Left_ID Type of shoulder on the left side of the road, for crashes located on the state highway
system
Shldr_Type_Right_ID Type of shoulder on the right side of the road, for crashes located on the state highway
system
Median_Type_ID Median type description, for crashes located on the state highway system
Adt_Curnt_Amt Average daily traffic amount for a given road segment and year for crashes located on the
state highway system
Trk_Aadt_Pct Adjusted average daily traffic percent for trucks for crashes located on the state highway
system
2
Iragavarapu, Lord, Fitzpatrick 4
ANALYSIS 1 2
The analysis was performed with IBM SPSS Statistics 21 Decision Tree tool, using the tree 3
growing method of CRT (Classification-regression Tree). The response variable assessed in this 4
analysis was crash severity, which is defined as the level of injury sustained by the most severely 5
injured person involved in the crash. The crash severity variable used for developing the 6
classification tree for this analysis was categorized as either fatal or non-fatal. The tree depth was 7
restricted to five levels and impurity was measured with the Gini index. Minimum change in 8
impurity improvement was set at 0.0001. Seventy percent of the data was randomly assigned to 9
train the model and the remaining thirty percent was allocated to the test. The tree was pruned to 10
avoid over-fitting. 11
12
In the dataset used for this analysis, the number of non-fatal crashes was almost fifteen times 13
than that of fatal crashes (32,388 vs. 2,232) and the overall prediction accuracy for test sample 14
was 93.5%, whereas the prediction accuracy for fatal crashes was only 10.9%. To ensure the 15
same prediction accuracy for both severity levels, the prior probabilities are set equal so that the 16
target variable level that has a lower proportion is also taken into consideration in predictions. 17
Although this decreases the overall accuracy of the model, the prediction accuracy of the data 18
with the least proportion increases (12). Table 2 shows that with using equal prior probabilities 19
across all categories, the overall prediction accuracy of the model (for the training and the test 20
data) decreased slightly, but the prediction accuracy for fatal crashes improved tremendously 21
(74.3% vs. 10.9%). 22
23
Table 3 shows the misclassification costs and Table 4 shows the risk estimate for the model, 24
which is the proportion of cases incorrectly classified after adjustment for prior probabilities and 25
misclassification costs. As discussed above, correct classifications represented on the diagonal in 26
Table 3 have a misclassification cost of 0 27
28
29
Iragavarapu, Lord, Fitzpatrick 5
1
Table 2. Prediction Performance for the CRT Model. 2
Sample Observed Crash Severity
Predicted Crash Severity
Non-Fatal Fatal Percent
Correct
Training
Non-Fatal 17826 4912 78.4%
Fatal 359 1224 77.3%
Overall Percentage 74.8% 25.2% 78.3%
Test
Non-Fatal 7575 2075 78.5%
Fatal 167 482 74.3%
Overall Percentage 75.2% 24.8% 78.2%
3
Table 3. Misclassification Costs for the CRT Model. 4
Observed Predicted
N Y
N 0.000 1.000
Y 1.000 0.000
5
Table 4. Risk Estimate of the Model. 6
Sample Estimate Standard Error
Training 0.221 0.005
Test 0.236 0.009
7
The classification tree generated (Figure 1) shows that the initial split at node 0 is based on the 8
variable of light condition, which implies that light condition is the best variable to classify and 9
predict pedestrian crash severity (fatal versus non-fatal). More (64%) pedestrian crashes are 10
predicted to occur in daylight, whereas a higher proportion of fatal crashes are predicted to occur 11
in dark conditions (13% vs. 3%, Node 1 vs. Node 2). Traffic control, road class, and pedestrian 12 age are selected to be the splitters more than once, implying that these variables have multiple 13
effects on the crash severity outcome. 14
15
Iragavarapu, Lord, Fitzpatrick 6
Figure 1. Classification Tree.
Iragavarapu, Lord, Fitzpatrick 7
DISCUSSION 1 2
The classification tree developed in this analysis (shown in Figure 1) indicates that the following 3
variables are critical in classifying the injury severity of pedestrian crashes: 4
Light condition 5
Road class 6
Traffic control 7
Right shoulder width 8
Involvement of a commercial vehicle 9
Pedestrian age 10
Manner in which the vehicle(s) were moving prior to the first harmful event 11 12
Daylight conditions are associated with more pedestrian; however, the severity of the crash is 13
higher in dark conditions. When a pedestrian is struck at night, he or she is four times more 14
likely to be killed when compared to daylight conditions (13% vs.3%, Node 2 vs. Node 1). This 15
result is in agreement with a study on pedestrian crashes in North Carolina that found that dark 16
conditions (with and without streetlights) significantly increase the probability of fatal injury for 17
pedestrians (13). This could be a reflection of higher speeds at night, along with greater difficulty 18
in detecting pedestrians in dark conditions; hence, not being able to reduce the speed in time. 19
20
Under all light conditions, the probability of pedestrian crashes is higher on city streets, county 21
roads, and other lower speed roads that have segment-related traffic control device (e.g. warning 22
sign); whereas the severity of the crash is more on higher speed roads, i.e. Interstates, US & State 23
Highways, FM roads, and Tollways (25% vs. 6%, Node 4 vs. Node 3, and 13% vs. 3%, Node 14 24
vs. Node 13). This result was further investigated and is discussed in greater detail in 25
Iragavarapu et al. (14). Kim et al. (9) also found that freeway, U.S. route, and state route 26
increased the probability of fatal injury in pedestrian crashes, compared with local city streets 27
(13). 28
29
Younger (≤ 60 years) pedestrians are predicted to be involved in more crashes, whereas older (> 30
60 years) pedestrians are more likely to be killed when stuck by a vehicle (5% vs. 4%, Node 16 31
vs. Node 15, and 12% vs. 2%, Node 20 vs. Node 19). This finding is expected because in general 32
older pedestrians have lesser physical strength to cope with the injuries and are less agile to 33
escape to a safer location just before the crash. Kim et al. also found that older pedestrians are 34
more likely to sustain greater injury than younger pedestrians (13,15). Holubowycz found the 35
greatest fatality rates in pedestrians 75 years or older (16).This results supports the concept of a 36
“pedestrian airbag” technology which would improve the chances of surviving for pedestrians hit 37
by vehicles, more so for elderly pedestrians. Crandall et. al discuss this approach for pedestrian 38
safety in their 2002 paper (17). 39
40
Commercial vehicle involvement in a pedestrian crash is associated with a greater probability of 41
pedestrian fatality (10% vs. 2%, Node 12 vs. Node 11). This is obviously attributed to their 42
larger weight, longer stopping distances, higher bumper height, and blunt geometry, which has 43
also been documented in previous studies (13, 15, 18, 19). 44
45
Iragavarapu, Lord, Fitzpatrick 8
Locations with no traffic control device or intersection-related traffic control devices (e.g., 1
signal) are found to be associated with more number of pedestrian crashes; however, locations 2
with segment traffic control devices (e.g. warning sign or flagger) or both traffic control devices 3
(i.e., officer, flagman) are associated with higher proportion of fatal pedestrian crashes (15% vs. 4
4%, Node 8 vs. Node 7, and 6% vs. 2%, Node 6 vs. Node 5). This result is intuitive because 5
more pedestrians are expected to be present at intersections and the vehicle speeds are relatively 6
lower, when compared to mid-segment. 7
8
The results also show that on high speed roads, more crashes are expected at locations with right 9
shoulder width less than 8.5 feet (Node 9), when one of the vehicles involved is either going 10
straight or backing (Node 17). However, a higher proportion of fatal crashes on high speed roads 11
are at locations with right shoulder width more than 8.5 feet. 12
13
SUMMARY AND CONCLUSIONS 14 15
This study has documented a regression tree analysis for examining factors that influence crash 16
severity in crashes involving a pedestrian. A total of 34,620 pedestrian crashes that occurred 17
between 2007 and 2011 were analyzed. The regression tree provided satisfactory results. The 18
results (summarized in Table 5) were intuitive and consistent with the results of previous studies 19
on pedestrian crashes that used other analytical techniques, such as probabilistic models of crash 20
injury severity. The information provided in this study should help TxDOT and other 21
transportation agencies to better target their efforts for reducing the number and severity of 22
pedestrian collisions. 23
24
Iragavarapu, Lord, Fitzpatrick 9
Table 5. Summary of Classification Tree Result. 1
Significant Factor Classification Tree Relationship
Light condition When a pedestrian is struck at night, they are four times more likely to
be killed when compared to daylight conditions.
Road class When a pedestrian is struck on higher speed roads, they are four times
more likely to be killed when compared to lower speed roads.
Pedestrian age
Older (60+) pedestrians are six times more likely to be killed when
compared younger pedestrians when struck in daylight; at nighttime
there is not much difference.
Traffic control
When a pedestrian is struck at a location with segment traffic control
devices (e.g., warning sign or flagger) or generic traffic control devices
(e.g., officer), they are four times more likely in dark conditions and
three times more likely in daylight to be killed when compared to
location with no traffic control or intersection traffic control (e.g., stop
sign).
Right shoulder width
When a pedestrian is struck on high speed roads in dark conditions,
they are twice as likely to be killed if the right shoulder width is more
than 8.5 feet.
Involvement of a
commercial vehicle
When a pedestrian in daylight, they are five times as likely to be killed
if the crash involves a commercial vehicle.
Collision manner
On high speed roads with right shoulder width less than 8.5 feet,
pedestrian crashes in dark conditions are six times more likely when
one of the vehicles involved is either going straight or backing when
compared to vehicles turning left or right.
2
3
Iragavarapu, Lord, Fitzpatrick 10
ACKNOWLEDGMENTS
This paper is based on research sponsored by the Texas Department of Transportation (TxDOT)
and the U.S. Department of Transportation, Federal Highway Administration (FHWA). The
project was under the direction of Cary Choate of TxDOT. The research was performed at the
Texas A&M Transportation Institute. The contents of this report reflect the views of the authors,
who are responsible for the facts and the accuracy of the data presented herein. The contents do
not necessarily reflect the official views or polices of TxDOT.
REFERENCES
1 NHTSA Pedestrian Crash Fact Sheet. http://www-nrd.nhtsa.dot.gov/Pubs/811748.pdf Accessed
June 2014.
2 NHTSA Fatality Analysis Reporting System. http://www-fars.nhtsa.dot.gov. Accessed April
2013.
3 Fitzpatrick, K., V. Iragavarapu, M.A. Brewer, D. Lord, J. Hudson, R. Avelar, and J. Robertson.
Characteristics of Texas Pedestrian Crashes and Evaluation of Driver Yielding at Pedestrian
Treatments. TxDOT Report FHWA/TX-13/0-6702-1. May 2014.
4 Savolainen, P.T., Mannering, F.L., Lord, D., Quddus, M.A., 2011. The statistical analysis of
highway crash-injury severities: a review and assessment of methodological alternatives.
Accident Analysis and Prevention 43 (5), 1666 – 1676.
5 Kuhnert, P.M., K.-A. Do, and R. McClure. Combining non-parametric models with logistic
regression: an application to motor vehicle injury data. Computational Statistics and Data
Analysis, Vol. 34, No. 3, 2000, pp. 371-386.
6 Karlaftis, M.G., and I. Golias. Effects of road geometry and traffic volumes on rural roadway
accident rates. Accident Anaysis & Prevention, Vol 34, No. 3, 2002, pp.357-365.
7 Montella, A., M. Aria, A. D’Ambrosio, and F. Mauriello. Data-Mining Techniques for
Exploratory Analysis of Pedestrian Crashes. Transportation Research Record: Journal of the
Transportation Research Board, No. 2237, Transportation Research Board of the National
Academies, Washington, D.C., 2011, pp.107-116.
8 Chang, L., and W. Chen. Data Mining of Tree-Based Models to Analyze Freeway Accident
Frequency. Journal of Safety Research, Vol. 36, No. 4, 2005, pp. 365-375.
9 Chang, L., and H. Wang. Analysis of Traffic Injury Severity: An Application of Non-
Parametric Classification Tree Techniques. Accident Analysis & Prevention, Vol 38, No. 5,
2006, pp.1019-1027.
10 Chang, L-Y, and J-T Chien. Analysis of Driver Injury Severity in Truck-Involved Accidents
Using a Non-Parametric Classification Tree Model. Safety Science, Issue 51, 2013, pp. 17-
22.
11 Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression
Trees. Wadsworth International Group, Belmont, California, 1984.
12 Kashani, A.T., and A.S. Mohaymany (2011). Analysis of Traffic Injury Severity on Two-
Lane, Two-Way Rural Road Based on Classification Tree Models. Safety Science, Issue 49,
pp. 1314−1320.
Iragavarapu, Lord, Fitzpatrick 11
13 Kim, J., G. F. Ulfarsson, V. N. Shankar, and S. Kim. Age and Pedestrian Injury Severity in
Motor-Vehicle Crashes: A Heteroskedastic Logit Analysis. Accident Analysis and
Prevention, Vol. 40, No. 5, 2008, pp. 1695–1702.
14 Iragavarapu, V., S. H. Khazraee, D. Lord, and K. Fitzpatrick. Pedestrian Fatal Crashes on
Freeways in Texas. Manuscript submitted for 94th
Transportation Research Board Meeting in
Washington, D.C.
15 Kim, J., G. F. Ulfarsson, V. N. Shankar, and F. Mannering. A Note on Modeling Pedestrian
Injury Severity in Motor-Vehicle Crashes with the Mixed Logit Model. Accident Analysis
and Prevention, Vol. 42, No. 6, 2010, pp. 1751–1758.
16 Holubowycz, O. T. Age, Sex, and Blood Alcohol Concentration of Killed and Injured
Pedestrians. Accident Analysis and Prevention, Vol. 27, No. 3, 1995, pp. 417–422.
17 Crandall, J. R., K. S. Bhalla, and N. J. Madeley. Designing Road Vehicles for Pedestrian
Protection. British Medical Journal Vol. 324, No.7346, 2002, pp. 1145-1148.
18 Fitzpatrick, K., and E. S. Park. Safety Effectiveness of HAWK Pedestrian Treatment. In
Transportation Research Record: Journal of the Transportation Research Board, No. 2140,
Transportation Research Board of the National Academies, Washington, D.C., 2009, pp.
214–223.
19 Lefler, D. E., and H. C. Gabler. The Fatality and Injury Risk of Light Truck Impacts with
Pedestrians in the United States. Accident Analysis and Prevention, Vol. 36, No. 2, 2004, pp.
295–304.