analysis of injury severity in pedestrian ......classification and regression tree (cart) analysis...

ANALYSIS OF INJURY SEVERITY IN PEDESTRIAN CRASHES USING

CLASSIFICATION REGRESSION TREES

By

Vichika Iragavarapu, P.E.

(Corresponding author)

Assistant Research Engineer

Texas A&M Transportation Institute, 3135 TAMU

College Station, TX 77843-3135

Phone: 979/845-5686, fax: 979/845-6006

Email: [email protected]

Dominique Lord, Ph.D, P.Eng.

Associate Professor and Zachry Development Professor I

Zachry Department of Civil Engineering, 3136 TAMU

Texas A&M University


Phone: 979/458-3949, fax: 979/845-6481


And

Kay Fitzpatrick, Ph.D., P.E.

Senior Research Engineer

Texas A&M Transportation Institute, 3135 TAMU


Phone: 979/845-7321, fax: 979/845-6006


TOTAL WORDS: 6,164 [3664 Words, 9 Tables, 1 Figure (2500)]

Submitted to the Transportation Research Board 94th Annual Meeting

January 11-15, 2015, Washington D.C.

Iragavarapu, Lord, Fitzpatrick

ABSTRACT

Texas is considered to be an “opportunity” state by the Federal Highway Administration

(FHWA), due to the high number of pedestrian crashes. Data from the Fatality Analysis

Reporting System (FARS) show that the number of pedestrian fatal crashes in Texas is the third

highest in the U.S and is significantly higher than the national average. The research team

explored the Texas Department of Transportation (TxDOT) Crash Record Information System

(CRIS) database to identify characteristics of crashes involving pedestrians in Texas. A

Classification and Regression Tree (CART) analysis of all pedestrian crashes was conducted to

find the significant factors influencing the severity of crashes involving pedestrians in Texas.

The classification tree identified that light condition, road class, traffic control, right shoulder

width, involvement of a commercial vehicle, pedestrian age, and the collision manner, have the

most influence on the severity of pedestrian crashes.

Iragavarapu, Lord, Fitzpatrick 1

INTRODUCTION 1 2

Pedestrian crashes are a major safety issue; on average, a pedestrian is killed every two hours 3

and injured every eight minutes in traffic crashes in the United States (1). For the years 2007 4

through 2011, Texas had the third highest number of pedestrian fatal crashes in the U.S., with 5

about 400 pedestrian fatalities per year, which is equal to 15 percent of all traffic crash fatalities 6

in the US (2, 3). To understand the characteristics and factors influencing pedestrian crashes, the 7

Texas Department of Transportation (TxDOT) requested an in-depth analysis of these crashes. 8

Understanding of these relations will provide insight for the development of effective 9

countermeasures focused on pedestrians. The objective of this analysis was, therefore, to find the 10

significant factors and their interactions that influence the severity of crashes involving a 11

pedestrian in Texas. 12

13

14

STUDY METHODOLOGY 15 16

A variety of methodologies have been used to understand factors influencing crash severities. A 17

comprehensive review of these methodologies is provided in Savolainen et. al, 2011 (4). Data 18

mining techniques are widely applied in the areas of business, medicine, industry and 19

engineering and are gaining attention in the transportation safety area (5, 6, 7, 8, 9). The 20

classification-regression tree (CRT) methodology is a popular data mining technique that does 21

not need a specific functional form and is effective with large data sets containing a large number 22

of explanatory variables (10). 23

24

The CRT framework is based on the algorithm first proposed in Breiman et al. (11). The “root 25

node”, which is the node with all the data, is divided into two child nodes on the basis of an 26

independent variable (splitter) that creates the best homogeneity. This process is repeated for 27

each child node until all data in each node has the greatest possible homogeneity. This node is 28

called a “leaf node”. The most famous index for splitting of nominal data is the Gini index. For 29

each tree created, the “goodness of fit” index is calculated using the “misclassification error rate” 30

or “misclassification cost.” “Pruning” is performed according to the cost-complexity algorithm to 31

avoid over-fitting of the training data and to create an optimal tree. An optimal tree is the one 32

that has the least misclassification cost for the test data. Misclassification cost allows inclusion of 33

information about the relative penalty associated with incorrect classification and is inversely 34

proportional to the accuracy of prediction. Correct classifications always have a misclassification 35

cost of 0 (discussed further below). Importance of each variable is calculated using the variable 36

importance index and is scaled such that its summation is one. (12) 37

38

The equation used for the Gini index is (12): 39

40

(Gini Index) Gini(m) = ∑ ( | ) 1 41

42

( | ) ( )

( ) , ( )

( ) ( )

, ( ) ∑ ( )

43

44


Where, j is the number of target variables or classes, π(j) is the prior probability for class j, p(j|m) 1

is the conditional probability of a record being in class j, provided that it is in the node m, Nj(m) 2

is the number of records in class j of node m, Nj is the number of records of class j in the root 3

node, and Gini (m) or the Gini index is the indication of impurity in node m. The prior 4

probability shows the proportion of observations in each class in the population. (12) 5

6

Misclassification Error Rate = ∑ ( )[ ( )] 2 7

8

Where, p(m) is the proportion of existing observations in the terminal node or leaf m (from all 9

observations) and M is the number of terminal nodes. (12) 10

11

(Variable Importance Index) VIM(xj) = ∑

( ( ))

3 12

13

Where, Gini(S(xj,t)) is the reduction in the Gini index at node t that is achieved by splitting 14

variable xj,

is the proportion of the observations in the dataset that belong to node t, T is the 15

total number of nodes and N is the total number of observations.(12) 16

17

DATA 18 19

The TxDOT Crash Record Information System (CRIS) database has three subsets: crash, person, 20

and unit. The crash dataset contains information on the characteristics of the crash (e.g., date, 21

time, weather) and its location (e.g., intersection relation, surface condition, traffic control 22

devices). The person dataset has information on the characteristics of the people involved (e.g., 23

age, gender, blood alcohol content, etc.) and injuries sustained (e.g., fatal, no injury, etc.). The 24

unit dataset describes the characteristics of the units (e.g., type of vehicle, contributing factor for 25

unit) involved, along with the contributing factors for each unit involved. Each of these subsets 26

has different codes that distinguish crashes involving pedestrians from other reported crashes. 27

34,620 pedestrian crashes were available between 2007 and 2011 for use in this analysis. Table 1 28

lists the variables used in this analysis. 29

30

31


Table 1. Variables used in analysis. 1

Variable Description

Rpt_Road_Part_ID Reported roadway part on which the crash occurred

Wthr_Cond_Rev Weather condition at the time of crash (Revised to merge some categories)

Light_Cond_ID The type and level of light at the time of the crash

Road_Algn_ID Roadway Alignment at the crash site

Surf_Cond_ID The surface condition at the time and place of the crash

Traffic_Cntl__Rev Type of traffic control at the scene of the crash (Revised to merge some categories)

FHE_Collsn_Rev The manner in which the vehicle(s) were moving prior to the first harmful event (Revised

to merge some categories)

Othr_Factr_Rev Additional detail of events/circumstances concerning the crash (Revised to merge some

categories)

Road_Cls_ID The functional classification group of the priority road the motor vehicle(s) was traveling

on before the crash

Road_Relat_ID Where the crash occurred in relation to the roadway

Month Month of year when the crash occurred

Hour Time of day when the crash occurred

DriverAge Age of one of the drivers involved in the crash

DriverGender Gender of one of the drivers involved in the crash

PedestrianAge Age of one of the pedestrians involved in the crash

PedestrianGender Gender of one of the pedestrians involved in the crash

Hwy_Dsgn_Lane_ID Lane design on the applicable section of highway for crashes located on the state

highway system

Hwy_Dsgn_Hrt_ID Part of the Highway Design Code indicating HOV (High Occupancy Vehicle), railroads,

and toll roads (HRT) for crashes located on the state highway system

Hp_Shldr_Left

Width of inside shoulder on divided sections, or width of shoulder traveling in

descending marker direction, measured in feet, for crashes located on the state highway

system

Hp_Shldr_Right

Width of outside shoulder on divided sections, or width of shoulder traveling in

ascending marker direction, measured in feet, for crashes located on the state highway

system

Hp_Median_Width Median width plus both inside shoulders, measured in feet, for crashes located on the

state highway system

Nbr_Of_Lane Number of lanes, not including turning and climbing lanes, for crashes located on the


Shldr_Type_Left_ID Type of shoulder on the left side of the road, for crashes located on the state highway

system

Shldr_Type_Right_ID Type of shoulder on the right side of the road, for crashes located on the state highway

system

Median_Type_ID Median type description, for crashes located on the state highway system

Adt_Curnt_Amt Average daily traffic amount for a given road segment and year for crashes located on the


Trk_Aadt_Pct Adjusted average daily traffic percent for trucks for crashes located on the state highway

system

2


ANALYSIS 1 2

The analysis was performed with IBM SPSS Statistics 21 Decision Tree tool, using the tree 3

growing method of CRT (Classification-regression Tree). The response variable assessed in this 4

analysis was crash severity, which is defined as the level of injury sustained by the most severely 5

injured person involved in the crash. The crash severity variable used for developing the 6

classification tree for this analysis was categorized as either fatal or non-fatal. The tree depth was 7

restricted to five levels and impurity was measured with the Gini index. Minimum change in 8

impurity improvement was set at 0.0001. Seventy percent of the data was randomly assigned to 9

train the model and the remaining thirty percent was allocated to the test. The tree was pruned to 10

avoid over-fitting. 11

12

In the dataset used for this analysis, the number of non-fatal crashes was almost fifteen times 13

than that of fatal crashes (32,388 vs. 2,232) and the overall prediction accuracy for test sample 14

was 93.5%, whereas the prediction accuracy for fatal crashes was only 10.9%. To ensure the 15

same prediction accuracy for both severity levels, the prior probabilities are set equal so that the 16

target variable level that has a lower proportion is also taken into consideration in predictions. 17

Although this decreases the overall accuracy of the model, the prediction accuracy of the data 18

with the least proportion increases (12). Table 2 shows that with using equal prior probabilities 19

across all categories, the overall prediction accuracy of the model (for the training and the test 20

data) decreased slightly, but the prediction accuracy for fatal crashes improved tremendously 21

(74.3% vs. 10.9%). 22

23

Table 3 shows the misclassification costs and Table 4 shows the risk estimate for the model, 24

which is the proportion of cases incorrectly classified after adjustment for prior probabilities and 25

misclassification costs. As discussed above, correct classifications represented on the diagonal in 26

Table 3 have a misclassification cost of 0 27

28

29


1

Table 2. Prediction Performance for the CRT Model. 2

Sample Observed Crash Severity

Predicted Crash Severity

Non-Fatal Fatal Percent

Correct

Training

Non-Fatal 17826 4912 78.4%

Fatal 359 1224 77.3%

Overall Percentage 74.8% 25.2% 78.3%

Test

Non-Fatal 7575 2075 78.5%

Fatal 167 482 74.3%

Overall Percentage 75.2% 24.8% 78.2%

3

Table 3. Misclassification Costs for the CRT Model. 4

Observed Predicted

N Y

N 0.000 1.000

Y 1.000 0.000

5

Table 4. Risk Estimate of the Model. 6

Sample Estimate Standard Error

Training 0.221 0.005

Test 0.236 0.009

7

The classification tree generated (Figure 1) shows that the initial split at node 0 is based on the 8

variable of light condition, which implies that light condition is the best variable to classify and 9

predict pedestrian crash severity (fatal versus non-fatal). More (64%) pedestrian crashes are 10

predicted to occur in daylight, whereas a higher proportion of fatal crashes are predicted to occur 11

in dark conditions (13% vs. 3%, Node 1 vs. Node 2). Traffic control, road class, and pedestrian 12 age are selected to be the splitters more than once, implying that these variables have multiple 13

effects on the crash severity outcome. 14

15


Figure 1. Classification Tree.


DISCUSSION 1 2

The classification tree developed in this analysis (shown in Figure 1) indicates that the following 3

variables are critical in classifying the injury severity of pedestrian crashes: 4

Light condition 5

Road class 6

Traffic control 7

Right shoulder width 8

Involvement of a commercial vehicle 9

Pedestrian age 10

Manner in which the vehicle(s) were moving prior to the first harmful event 11 12

Daylight conditions are associated with more pedestrian; however, the severity of the crash is 13

higher in dark conditions. When a pedestrian is struck at night, he or she is four times more 14

likely to be killed when compared to daylight conditions (13% vs.3%, Node 2 vs. Node 1). This 15

result is in agreement with a study on pedestrian crashes in North Carolina that found that dark 16

conditions (with and without streetlights) significantly increase the probability of fatal injury for 17

pedestrians (13). This could be a reflection of higher speeds at night, along with greater difficulty 18

in detecting pedestrians in dark conditions; hence, not being able to reduce the speed in time. 19

20

Under all light conditions, the probability of pedestrian crashes is higher on city streets, county 21

roads, and other lower speed roads that have segment-related traffic control device (e.g. warning 22

sign); whereas the severity of the crash is more on higher speed roads, i.e. Interstates, US & State 23

Highways, FM roads, and Tollways (25% vs. 6%, Node 4 vs. Node 3, and 13% vs. 3%, Node 14 24

vs. Node 13). This result was further investigated and is discussed in greater detail in 25

Iragavarapu et al. (14). Kim et al. (9) also found that freeway, U.S. route, and state route 26

increased the probability of fatal injury in pedestrian crashes, compared with local city streets 27

(13). 28

29

Younger (≤ 60 years) pedestrians are predicted to be involved in more crashes, whereas older (> 30

60 years) pedestrians are more likely to be killed when stuck by a vehicle (5% vs. 4%, Node 16 31

vs. Node 15, and 12% vs. 2%, Node 20 vs. Node 19). This finding is expected because in general 32

older pedestrians have lesser physical strength to cope with the injuries and are less agile to 33

escape to a safer location just before the crash. Kim et al. also found that older pedestrians are 34

more likely to sustain greater injury than younger pedestrians (13,15). Holubowycz found the 35

greatest fatality rates in pedestrians 75 years or older (16).This results supports the concept of a 36

“pedestrian airbag” technology which would improve the chances of surviving for pedestrians hit 37

by vehicles, more so for elderly pedestrians. Crandall et. al discuss this approach for pedestrian 38

safety in their 2002 paper (17). 39

40

Commercial vehicle involvement in a pedestrian crash is associated with a greater probability of 41

pedestrian fatality (10% vs. 2%, Node 12 vs. Node 11). This is obviously attributed to their 42

larger weight, longer stopping distances, higher bumper height, and blunt geometry, which has 43

also been documented in previous studies (13, 15, 18, 19). 44

45


Locations with no traffic control device or intersection-related traffic control devices (e.g., 1

signal) are found to be associated with more number of pedestrian crashes; however, locations 2

with segment traffic control devices (e.g. warning sign or flagger) or both traffic control devices 3

(i.e., officer, flagman) are associated with higher proportion of fatal pedestrian crashes (15% vs. 4

4%, Node 8 vs. Node 7, and 6% vs. 2%, Node 6 vs. Node 5). This result is intuitive because 5

more pedestrians are expected to be present at intersections and the vehicle speeds are relatively 6

lower, when compared to mid-segment. 7

8

The results also show that on high speed roads, more crashes are expected at locations with right 9

shoulder width less than 8.5 feet (Node 9), when one of the vehicles involved is either going 10

straight or backing (Node 17). However, a higher proportion of fatal crashes on high speed roads 11

are at locations with right shoulder width more than 8.5 feet. 12

13

SUMMARY AND CONCLUSIONS 14 15

This study has documented a regression tree analysis for examining factors that influence crash 16

severity in crashes involving a pedestrian. A total of 34,620 pedestrian crashes that occurred 17

between 2007 and 2011 were analyzed. The regression tree provided satisfactory results. The 18

results (summarized in Table 5) were intuitive and consistent with the results of previous studies 19

on pedestrian crashes that used other analytical techniques, such as probabilistic models of crash 20

injury severity. The information provided in this study should help TxDOT and other 21

transportation agencies to better target their efforts for reducing the number and severity of 22

pedestrian collisions. 23

24


Table 5. Summary of Classification Tree Result. 1

Significant Factor Classification Tree Relationship

Light condition When a pedestrian is struck at night, they are four times more likely to

be killed when compared to daylight conditions.

Road class When a pedestrian is struck on higher speed roads, they are four times

more likely to be killed when compared to lower speed roads.

Pedestrian age

Older (60+) pedestrians are six times more likely to be killed when

compared younger pedestrians when struck in daylight; at nighttime

there is not much difference.

Traffic control

When a pedestrian is struck at a location with segment traffic control

devices (e.g., warning sign or flagger) or generic traffic control devices

(e.g., officer), they are four times more likely in dark conditions and

three times more likely in daylight to be killed when compared to

location with no traffic control or intersection traffic control (e.g., stop

sign).

Right shoulder width

When a pedestrian is struck on high speed roads in dark conditions,

they are twice as likely to be killed if the right shoulder width is more

than 8.5 feet.

Involvement of a

commercial vehicle

When a pedestrian in daylight, they are five times as likely to be killed

if the crash involves a commercial vehicle.

Collision manner

On high speed roads with right shoulder width less than 8.5 feet,

pedestrian crashes in dark conditions are six times more likely when

one of the vehicles involved is either going straight or backing when

compared to vehicles turning left or right.

2

3


ACKNOWLEDGMENTS

This paper is based on research sponsored by the Texas Department of Transportation (TxDOT)

and the U.S. Department of Transportation, Federal Highway Administration (FHWA). The

project was under the direction of Cary Choate of TxDOT. The research was performed at the

Texas A&M Transportation Institute. The contents of this report reflect the views of the authors,

who are responsible for the facts and the accuracy of the data presented herein. The contents do

not necessarily reflect the official views or polices of TxDOT.

REFERENCES

1 NHTSA Pedestrian Crash Fact Sheet. http://www-nrd.nhtsa.dot.gov/Pubs/811748.pdf Accessed

June 2014.

2 NHTSA Fatality Analysis Reporting System. http://www-fars.nhtsa.dot.gov. Accessed April

2013.

3 Fitzpatrick, K., V. Iragavarapu, M.A. Brewer, D. Lord, J. Hudson, R. Avelar, and J. Robertson.

Characteristics of Texas Pedestrian Crashes and Evaluation of Driver Yielding at Pedestrian

Treatments. TxDOT Report FHWA/TX-13/0-6702-1. May 2014.

4 Savolainen, P.T., Mannering, F.L., Lord, D., Quddus, M.A., 2011. The statistical analysis of

highway crash-injury severities: a review and assessment of methodological alternatives.

Accident Analysis and Prevention 43 (5), 1666 – 1676.

5 Kuhnert, P.M., K.-A. Do, and R. McClure. Combining non-parametric models with logistic

regression: an application to motor vehicle injury data. Computational Statistics and Data

Analysis, Vol. 34, No. 3, 2000, pp. 371-386.

6 Karlaftis, M.G., and I. Golias. Effects of road geometry and traffic volumes on rural roadway

accident rates. Accident Anaysis & Prevention, Vol 34, No. 3, 2002, pp.357-365.

7 Montella, A., M. Aria, A. D’Ambrosio, and F. Mauriello. Data-Mining Techniques for

Exploratory Analysis of Pedestrian Crashes. Transportation Research Record: Journal of the

Transportation Research Board, No. 2237, Transportation Research Board of the National

Academies, Washington, D.C., 2011, pp.107-116.

8 Chang, L., and W. Chen. Data Mining of Tree-Based Models to Analyze Freeway Accident

Frequency. Journal of Safety Research, Vol. 36, No. 4, 2005, pp. 365-375.

9 Chang, L., and H. Wang. Analysis of Traffic Injury Severity: An Application of Non-

Parametric Classification Tree Techniques. Accident Analysis & Prevention, Vol 38, No. 5,

2006, pp.1019-1027.

10 Chang, L-Y, and J-T Chien. Analysis of Driver Injury Severity in Truck-Involved Accidents

Using a Non-Parametric Classification Tree Model. Safety Science, Issue 51, 2013, pp. 17-

22.

11 Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression

Trees. Wadsworth International Group, Belmont, California, 1984.

12 Kashani, A.T., and A.S. Mohaymany (2011). Analysis of Traffic Injury Severity on Two-

Lane, Two-Way Rural Road Based on Classification Tree Models. Safety Science, Issue 49,

pp. 1314−1320.

http://www-nrd.nhtsa.dot.gov/Pubs/811748.pdf

http://www-fars.nhtsa.dot.gov/


13 Kim, J., G. F. Ulfarsson, V. N. Shankar, and S. Kim. Age and Pedestrian Injury Severity in

Motor-Vehicle Crashes: A Heteroskedastic Logit Analysis. Accident Analysis and

Prevention, Vol. 40, No. 5, 2008, pp. 1695–1702.

14 Iragavarapu, V., S. H. Khazraee, D. Lord, and K. Fitzpatrick. Pedestrian Fatal Crashes on

Freeways in Texas. Manuscript submitted for 94th

Transportation Research Board Meeting in

Washington, D.C.

15 Kim, J., G. F. Ulfarsson, V. N. Shankar, and F. Mannering. A Note on Modeling Pedestrian

Injury Severity in Motor-Vehicle Crashes with the Mixed Logit Model. Accident Analysis

and Prevention, Vol. 42, No. 6, 2010, pp. 1751–1758.

16 Holubowycz, O. T. Age, Sex, and Blood Alcohol Concentration of Killed and Injured

Pedestrians. Accident Analysis and Prevention, Vol. 27, No. 3, 1995, pp. 417–422.

17 Crandall, J. R., K. S. Bhalla, and N. J. Madeley. Designing Road Vehicles for Pedestrian

Protection. British Medical Journal Vol. 324, No.7346, 2002, pp. 1145-1148.

18 Fitzpatrick, K., and E. S. Park. Safety Effectiveness of HAWK Pedestrian Treatment. In

Transportation Research Record: Journal of the Transportation Research Board, No. 2140,

Transportation Research Board of the National Academies, Washington, D.C., 2009, pp.

214–223.

19 Lefler, D. E., and H. C. Gabler. The Fatality and Injury Risk of Light Truck Impacts with

Pedestrians in the United States. Accident Analysis and Prevention, Vol. 36, No. 2, 2004, pp.

295–304.

analysis of injury severity in pedestrian ......classification and regression tree (cart) analysis...

Documents