prediction of railcar remaining useful life by multiple

2226 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015

Prediction of Railcar Remaining Useful Life byMultiple Data Source Fusion

Zhiguo Li and Qing He, Member, IEEE

Abstract—Nowadays, railway networks are instrumented withvarious wayside detectors. Such detectors, automatically iden-tifying potential railcar component failures, are able to reducerolling stock inspection and maintenance costs and improve rail-way safety. In this paper, we present a methodology to predictremaining useful life (RUL) of both wheels and trucks (bogies), byfusing data from three types of detectors, including wheel impactload detector, machine vision systems, and optical geometry de-tectors. A variety of new features is created from feature normal-ization, signal characteristics, and historical summary statistics.Missing data are handled by missForest, a Random Forests-basednonparametric missing value imputation algorithm. Several datamining techniques are implemented and compared to predict theRUL of wheels and trucks in a U.S. Class I railroad railwaynetwork. Numerical tests show that the proposed methodology canaccurately predict RUL of the components of a railcar, particu-larly in a middle-term range.

Index Terms—Missing value imputation, predictive model, railwayside detectors, random forests, remaining useful life, rollingstock failure.

I. INTRODUCTION AND PROBLEM STATEMENT

SYSTEM failures of rolling stock components are crucial torailway safety and efficient operations. Analysis of Federal

Railroads Administration (FRA) accident data for the period1999 to 2008 identified that the four largest US Class I railroadsspent an average of about $38 million per year on equipment-caused mainline derailments [1]. Toward effective equipmentinspection, the railroad industry has developed wayside de-tectors to monitor the conditions of railcar components. Suchsystems employ a variety of sensing technologies to measureforce, heat, sound, geometry and so on. The technological ma-turity of these systems also improves from 1930s to 2000s [2],while new technologies are still under development. The NorthAmerican railroad industry is increasingly moving to waysidedetection to reduce rolling stock inspection and maintenancecosts [3]. Detailed surveys of these and other wayside inspec-tion technologies have been conducted in the previous studies

Manuscript received April 9, 2014; revised August 29, 2014 andNovember 23, 2014; accepted February 1, 2015. Date of publicationFebruary 20, 2015; date of current version July 31, 2015. The Associate Editorfor this paper was H. Dong. (Corresponding author: Qing He.)

Z. Li is with the Business Solutions and Mathematical Sciences Department,IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 USA (e-mail:[email protected]).

Q. He is with the Department of Civil, Structural, Environmental Engi-neering and Department of Industrial and Systems Engineering, University atBuffalo, The State University of New York, Buffalo, NY 14260 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TITS.2015.2400424

TABLE ITWO LEVELS OF WHEEL WEAR LIMITS IN THE US

Fig. 1. Illustrations of some MV measurements for wheels.

[4]–[9]. It is anticipated that, through automated wayside detec-tion, the railroad industry is expected to: i) reduce significantinspection efforts, ii) perform more efficient fleet maintenancefocused on specific systems or components directly related tothe measured poor performance, iii) decrease service disrup-tions and service delays with appropriate labor and materialsprovided for necessary repairs, and iv) provide insights into theroot causes for poor or degraded performance [3].

Unplanned failures, such as derailment, cost significantlyto railway operations. Prediction of railcar defects will giveoperators sufficient time to inspect the potential failures beforethey really occur. Existing practice of failure prediction and pre-vention is managed by various alarms in the United States. Suchalarms are created based on two levels of existing regulations,defined by FRA and American Association of Railroads (AAR)respectively. For example, there are differences in allowablewheel profile tolerances between FRA and AAR rules. Table Isummarizes the limits of some wheel profile parameters (shownin Fig. 1), defined by both FRA and AAR. Comparing the MVreadings to existing FRA and AAR regulations will producetwo levels of violations - FRA level and AAR level violations.

1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LI AND HE: PREDICTION OF RAILCAR RUL BY MULTIPLE DATA SOURCE FUSION 2227

AAR level violations are handled at Train Termination Point,whereas FRA level violations need to be handled immediately,as soon as the inspection is carried out. Such rule violations willlead to a severe interruption of existing rail operations. There-fore, the existing railroads have developed their own rules togenerate different levels of alarms before any violation occurs.Most of such rules are composed of threshold-based combinedconditions and generated from railcar repair experiences. Theyare effective when the component is about to fail, becauserailroads strive to maintain a very low false alarm rate to reducethe inspection costs. In addition, some forecasting models forinvestigating the behavior of the railway components have beendeveloped based on intelligent monitoring systems (e.g., [10]).However, such rules and models do not provide the values ofremaining useful life (RUL) of a particular component whosemeasurements are close to, but still under, the thresholds. If theexact RUL of each component is known, the railroad can makebetter use of their resources to maintain the rolling stock in astate of good repair with the reduced number of maintenance“set-outs” and increased rail network average velocity.

In recent years, RUL prediction has drawn intensive atten-tion, and different methods and techniques have been devel-oped in academia and industries [11]–[14]. A recent reviewcan be found in [14], where the authors classified the RULprediction approaches into four major categories: experimentalmethods, data-driven methods, physical model-based methods,and hybrid methods. Due to the rapid development of cyberinfrastructure and sensing technology, an abundance of datafrom service and maintenance of engineering systems are nowreadily available, and thus data-driven methods are developedand widely used for RUL prediction, especially in the casesthat the physical models for failure mechanisms are difficult,if not impossible, to derive. Some examples of the recentdevelopment in the data-driven approaches include [15] and[16], where the nonlinear Support Vector Regression (SVR)models were trained to estimate and predict the wear level ofcutting tools and RUL of rolling element bearings, respectively.

To avoid failures/faults, accurate prediction of RUL on thebasis of the current equipment conditions and operation historyis essential for making a decision on preventive maintenanceor condition-based maintenance. RUL can be defined as aconditional random variable

T − t|T > t,X(t) (1)

where T represents the time to failure, t is the current work-ing age and X(t) denotes the historical operation profile upto the current time (condition monitoring data). Two majordata types used in prediction of RUL are event data and thecondition monitoring data [17]. Usually the event data recordthe information of failure/fault events, and thus failure times(i.e., time-to-failure) can be extracted from the data. The usageconditions and environmental stresses (working load, pressure,temperature and vibration, etc.) are recorded in the conditionmonitoring data. Being able to assess the conditions of theequipment is the basis for predicting potential failures and RUL.

Among all defect types, wheel failures cause about one halfof all train derailments. Given massive amount of data collected

from electronic wayside detectors, railcar failure predictionhas recently attracted great attention. In the literature, thereare two general methods to predict railcar component failures:(1) multi-body simulation (MBS) of railway vehicle dynamics,and (2) statistical learning and data mining based methods. Inthe first category of “vehicle dynamics”, Braghin et al. [18]proposed a wheel wear prediction model which is composed ofsimulation of vehicle dynamics, local wheel-rail contact model,and local wear model. The proposed model has been validatedthrough comparison with full-scale experimental tests carriedout on a single mounted wheelset under laboratory conditions.Pombo et al. [19] developed an estimation tool consisting ofthe use of a sequence of pre- and post-processing packages,in which the presented methodologies are implemented, andinterfaced with commercial multi-body software. The disad-vantage of this model is that such MBS based models requirecomplex computation to test even one scenario. So it is difficultto implement these models on a variety of railcar equipmenttypes and rail conditions in real-time. In the second categoryof “data mining”, Yang and Létourneau [20] first applied datamining based methods for prognostics of train wheel failuresusing measurements from WILD detectors. They proposed aMultiple Classifier System (MCS) capable of predicting 97%of wheel failures while maintaining a reasonable false alertrate (8%). However, this study just predicts the occurrenceof wheel failures without calculating the occurrence times,and only a single detector type was considered in modeling.Hajibabai et al. [21] developed a logistic regression modelto classify wheel failures based on WILD and Wheel ProfileDetector (WPD). They claimed that the classification accuracyis 90% with 10% false alarm rate. In that study, only wheelswithin 30 days of train stop were classified as “bad wheels”. Sotheir proposed model cannot predict the potential bad wheelswith time-to-failure values greater than 30 days. Also, neithertruck measurements nor truck failures is taken into account inthat work.

As one can see, previous studies have shown that waysidedetectors can successfully predict the occurrence of railcarcomponent failures. However, there are still some significantchallenges unaddressed:

1. Underlying sophisticated correlations between differentcomponent failures are still not disclosed. According tothe measuring components, traditional wayside detectiondata can be classified into a number of different categories:trucks, axles, wheels, and bearings. Among all railcarcomponents, wheels, journal bearings, and truck compo-nents represent the greatest derailment risk and result inapproximately 78% of all equipment-caused derailmentcosts [1]. However, those components are not isolatedfrom each other. It is well-known in the field that adefective wheel on one side can easily damage a goodwheel on the other side. So the wheels are usually replacedin a set of two in the repair workshop, even if there is onlyone defective wheel found. Furthermore, defective wheelsnot only result in “hot” journal bearings, but also leadto truck hunting eventually. Also, a misaligned truck cancause asymmetric wheel wear, which dramatically shorten


the life of a new set of wheels. Therefore, some differenttypes of failures are strongly correlated to each other. It ismore reasonable to combine them together for predictionanalytics than to treat them separately.

2. Large measurement variations are found in the datasets.The measurement variations are referred to as discrep-ancies of the readings on the same component due toboth internal factors (e.g., axle load, and brake status) andexternal factors (e.g., temperature, weather and humid-ity). Under these various circumstances, simple thresholdbased rules can generate lots of false alarms. For example,rim thickness measurements from MV are very preciseand hence very volatile. During winter, snow on the wheelwill affect the accuracy of the rim thickness readings overtime without any underlying causes or wheel dimensionchanges.

3. The entire wayside detection system generates sparse,heterogeneous and irregular data due to a limited numberof detectors in the network. Each installation of waysidedetection systems can cost up to one million dollars [2].Therefore, wayside detectors are typically sparsely located(except for hot bearing detectors). Moreover, since loca-tions of the wayside detector are not evenly distributedin the network, the number of readings varies a lot fordifferent trains. Some trains which commute on a detectormonitored line segment generate a large amount of data,while others only have very few readings. If we track onerailcar travelling through the network, its detection dataare irregular and heterogeneous. The interval between tworeadings of the same component could be less than one dayor larger than six months. These irregular measurementsmay lead to a large amount of missing values in the mea-surement data. How to handle the missing data becomesvery essential for the data analysis.

To solve these problems, this paper aims to predict RUL ofboth railcar wheels and trucks, from three types of wayside de-tectors, namely Wheel Impact Load Detector (WILD), MachineVision (MV) systems, and Optical Geometry Detector (OGD).In this study, we will utilize the data-driven approaches for theRUL prediction. In summary, the contributions of this paper lieas follows:

• fusing measurements from multiple types of waysidedetectors including WILD, MV and OGD;

• handling irregular and heterogeneous measurements withlarge variations;

• imputing missing data due to a limited number of read-ings; and

• predicting the exact RUL of both trucks and wheels.

The remaining of the paper is organized as follows. Section IIbriefly reviews the wayside detectors and the measurementdata. In Section III, the methodology developed for RUL pre-diction is illustrated, and the Random Forests based missingvalue imputation method and predictive models are introduced.After that, we discuss the data preprocessing and missingvalue imputation in Section IV. The experimental results forRUL prediction using the state-of-the-art regression models

TABLE IIKEY MEASUREMENT FEATURES OF WILD, MV AND OGD

are presented and compared in Section V, and the results arediscussed in Section VI. Finally, we summarize the study anddiscuss the future work directions in Section VII.

II. DETECTORS AND DATA OVERVIEW

In this section, we will introduce the three types of waysidedetectors used for the RUL prediction as well as the detectormeasurement data. The key measurement features from MV,OGD, and WILD are presented in Table II.

WILD uses strain-gauge-based technologies to measure theperformance of a railcar with wheels and trucks in a dynamicmode. WILD is built into the track to detect defective wheels,which can be distinguished from normal wheels by their highdynamic weights. WILD weighs each wheel on the train severaltimes when the wheel passes by a detector in a certain distance.The system reports the features to describe dynamic impactload at the wheel level, such as vertical KIPS (a thousandpound of force) and lateral KIPS. In particular, whenever awheel’s impact reaches 140kips or more, the train is imme-diately stopped and then required for the wheel replacementin the nearest siding [20]. MV technology employs real-timevideo imaging to capture parameter measurements of rollingstock components, such as wheels, brakes, springs, draft gearsand so on. The key wheel features captured by MV includeflange height, flange thickness, rim thickness, wheel diameter,etc. Some of the limits have been summarized in Table I. OGDis a laser-based monitoring system that measures performanceof a railcar’s axle and wheel suspension assembly, commonlyknown as the “truck.” As a train passes by the OGD on tangenttrack, the system identifies trucks that have become warped ormisaligned, and reports tracking error, truck rotation, inter-axlemisalignment and shift, shown as Fig. 2. Also, OGD measuresthe amplitude and frequency of the waveforms developed by


Fig. 2. Illustrations of some OGD measurements for trucks.

the axles passing by the detector, which can be treated as usefulinformation for identifying truck hunting.

About 500 GB of raw wayside detection data were collectedin a US Class I rail network in a period of 27 months, fromJanuary 2010 to March 2012. After initial data exploration, wefound that the features from wheels and trucks are correlatedwith each other. For example, flange thickness is negative corre-lated (the correlation coefficient is −0.52) with tracking errorsof its truck readings. According to the measuring components,the datasets can be divided into five different levels: i) trainlevel, ii) equipment (railcar) level, iii) truck level, iv) axle level,and v) wheel and bearing level. On average, one train levelrecord will generate 79 equipment records, 175 truck records,413 axle records, and 826 wheel and bearing records, respec-tively. Therefore, one may predict potential failures for eachwheel, axle, truck, railcar, or even the whole train. It is critical toconduct failure prediction on a particular level of components,in order to maximize the benefit and performance of the ap-proach. The best trade off maximizes usefulness of informationfor the target organization and the performance of the approach(false errors, rate of detection, etc.) [20]. This study selectsthe truck level for prediction. Therefore, the model will predictwhen a given truck is likely to suffer a truck/wheel failure ornot. Individual wheel level predictions are not required due tothe fact that defective wheels can be easily identified given anexact truck ID. Therefore, wheel and axle data are eliminatedfrom the analysis, and thus the size of entire data set is reduceddramatically.

In addition to detection data, bad order data and teardowndata are also collected for failure validation. When potentialmajor defects are identified, bad order data consisting of orderdates and types are generated either by wayside detectors orby visual inspections at a yard or terminal. Once a bad orderis issued, the car is scheduled to be separated from the trainand further checked or repaired in the workshop, where repairactions and reasons are recorded in the teardown data. Such datacan be used to validate if the bad order is generated by falsealarms and what subsystems/components have been repaired orreplaced exactly.

Fig. 3. Flow chart of the proposed methodology for RUL prediction.

III. METHODOLOGY

In this section, we will discuss the methodology used forpredicting the RUL of railcars. Fig. 3 illustrates the flow chartof the proposed method.

The first step in the method is integrating the data sourcesand then extracting the data subset of bad trucks/wheels. Theexistence of a large amount of missing values in the datarequires an appropriate missing value imputation approach,because statistical or data mining algorithms often depend ona complete data set. Here we select to use missForest [24],a Random Forests model based nonparametric method whichcan deal with missing values in different variable types (i.e.,continuous and categorical variables). The details and advan-tages of missForest and Random Forests methods will be brieflydescribed in the following subsections.

Next, we create three types of new features from the orig-inal ones through feature normalization, detector signal char-acteristics extraction, and summary statistics generation. Weneed to normalize some measurement features because thesemeasurements vary in a large range with different train speedsand truck weights, especially for WILD. Once the normalizedfeatures are generated, the corresponding original ones areremoved from the feature set. It is common that there is alarge amount of noises in the collected detector signals, so theslopes and smoothing trends of the measurement signals mayrepresent the signal characteristics better and thus can serveas important predictors for RUL. Moreover, to investigate theeffects of the detector signals in a preceding time period (e.g.,90 days), we also need to calculate the summary statistics forthe original signals. These summary statistics, for example,include average, median, maximum, minimum, and the 90thpercentile of a measurement feature.

With the whole feature set consisting of original signals,normalized signals, extracted signal characteristics, and sum-mary statistics, the predictive model will be built for RULprediction. In this study, we fit several popular statistical andmachine learning regression models to the training data and


Fig. 4. Diagram for random forests.

compare their performance using the test data. At the firstplace, regression models are built using the full feature set inthe training data. To avoid model overfitting which leads topoor predictive performance (i.e., large prediction errors) on thetest data, statistically significant features are then determinedusing some feature selection approaches. With the significantfeatures, the final models will be generated by re-fitting theregression models to the training data, and thus be used forpredicting the RUL values on the test data. It turns out thatRandom Forests and Quantile Regression Forests have similarprediction accuracy and outperform the other models used inthis study. The details of the experiments will be discussed inSection V.

A. Random Forests

Random Forests (RF) is a combination of tree predictors(i.e., a collection of CART-like trees [22]) for data exploration,data analysis, and predictive modeling. Compared with decisiontrees, RF is an ensemble learning method which maintainsthe advantages of decision trees while can increase predictionaccuracy. A detailed description of RF can be found in [23].Given p features (predictor variables), the Random Forestsalgorithm is illustrated in Fig. 4:

S1. Draw ntree bootstrap samples from the training data.S2. Grow an unpruned decision tree for each sample: at each

node, randomly sample mtry features (mtry ≤ p) andchoose the best split from the sampled features.

S3. Predict new test data using each of the ntree trees and thenaggregate the results (i.e., take the average for regression).

From the algorithm, we can see that two types of randomnessare contained in RF: each tree is grown on a different randomsubsample of the training data; and each node is split using arandomly chosen subset of features at that node. These result inthe advantages of RF as follows:

• Random Forests is less prone to overfitting. The testingperformance of Random Forests does not decrease (due tooverfitting) as the number of trees increases [23]. Henceafter certain number of trees the performance tends to

stay in a certain value. On the contrary, Decision Treescan overfit the training data and thus generate poor resultswhen applied to the test data.

• RF is not very sensitive to outliers in the training data.• Feature importance can be estimated during training, and

the importance rank can be used for feature selection.• Cross validation is unnecessary as the method generates

an internal unbiased estimate of the generalization error(Out-Of-Bag estimate of error rates).

• RF usually has about the same accuracy as SVMs andneural networks, but can handle larger numbers of pre-dictors. Further, RF has fewer parameters, and is fasterto train.

The Out-Of-Bag (OOB) estimate of error rates can be ob-tained based on the training data as follows:

1. Each bootstrap is split into InBag data and OOB data;2. For each bootstrap sample, predict the OOB data using the

tree grown with InBag data;3. Finally aggregate the predictions for the OOB data from

ntree trees. Calculate error rates using the observed andpredicted values of the OOB data, and call it the OOBestimate of error rates.

B. MissForest

MissForest is a Random Forests based nonparametric impu-tation method, which can effectively handle mixed-type dataand model interactive and nonlinear effects of the features.MissForest is an iterative imputation scheme in that the methodtrains a RF model on observed values first, predicts the missingvalues next, and then proceeds iteratively. Readers are referredto [24] for more details of the approach and experiment results.Here we briefly review the algorithm.

Given a n× p dimensional matrix X = [X1,X2, . . . ,Xp],where n is the number of cases (records) and p stands forthe number of predictor variables. First, we make an initialguess for the missing values in X using mean imputationor another imputation method. Then the variables Xj , j =1, 2, . . . , p, are sorted in the ascending order of the missingvalue proportion. Each variable Xj can be divided into Xmis

j

and Xobsj , where mis stands for missing values and obs stands

for observed values, and the row indexes of the missing valuesin Xj is Imis

j ⊆ {1, 2, . . . , n}. And the other features X(−j),where (−j) means the jth variable is excluded from X, can beseparated into Xmis

(−j) and Xobs(−j) correspondingly with Xmis

(−j) =

X(−j)[Imisj , :]. For each feature Xj , a random forest is fit to

the data with response Xobsj and predictors Xobs

(−j), then, themissing values are imputed by the predicted missing valuesof Xmis

j from the trained random forest applied to Xmis(−j).

This repeated imputation procedure will be stopped when thestopping criterion is met.

Compared to other prevalent imputation methods, like knearest neighbors (KNNimpute) [25], saturated multinomialmodel [26], and multivariate imputation by chained equations(MICE) [27], missForest does not need to make distributionalassumptions of the data or feature subsets. Inherited from


RF, missForest can successfully handle mixed-type missingdata imputation, and it outperforms other methods especiallyin the high-dimensional data with complex interactions andnonlinear relations. Furthermore, the computational efficiencyof missForest is high.

C. Quantile Regression Forests

Quantile Regression Forests (QRF) is a Random Forestsbased regression model that can be used to estimate the condi-tional quantiles for high-dimensional predictor variables. WhileRF provides an accurate estimation of the conditional mean ofthe target variable in the regression, Meinshausen [28] showsthat the full conditional distribution of the target can be ob-tained through RF, and thus the quantiles of this distributioncan be inferred. The quantiles can be used to detect outliers,obtain prediction intervals for the predicted conditional meanfrom RF, etc.

QRF can be viewed as a marriage between quantile regres-sion [29] and Random Forests. In general, the α-quantile isdefined by

Qα(x) = inf{y : F (y|X = x) ≥ α} (2)

where X is a n× p dimensional matrix containing all thepredictor variables, Y is the response variable, F (•) is thedistribution function, and 0 ≤ α < 1. Estimates of the condi-tional quantiles Qα(x) are obtained by plugging the estimateof F (y|X = x) into Eq. (2). In QRF, all observations for everyleaf of every tree are used to infer the distribution, while RFonly takes their average.

In the context of condition-based maintenance, we may useQRF to estimate quantiles of RUL on the lower side of the dis-tribution, e.g., the 10th percentile or the 25th percentile. Suchquantiles may help railroads better adjust the conservativenessof their maintenance strategy.

IV. DATA PREPROCESSING AND MISSING VALUE

IMPUTATION

In this study, we work on raw multi-detector dataset includ-ing MV, OGD, and WILD, bad order data, and tear down datacollected over a rail network from January 2010 to March 2012.As the first step, we merge these three data sources to extractthe bad truck/wheel data for RUL prediction. Because we aimto calculate the RUL values, only those trucks which havetruck/wheel repair or replacement records were identified andstored in the integrated dataset. For a given date, as long asthere is one record from the detectors (MV, OGD, or WILD) orfrom the bad order data validated by the teardown data, a rowof observation will be generated in the dataset. Then the blankcells in the measurements from MV, OGD, or WILD in this rowwill be labeled as missing values.

Next, as the date when a bad order (validated by the teardowndata) is generated for a wheel/truck is considered as its end-of-life or failure time, the RUL value for each record of thistruck/wheel is calculated as the difference (in days) betweenthe date of the record and the coming next bad order date. The

TABLE IIINRMSE OF MISSFOREST AND MICE FOR THE DETECTOR DATA

TABLE IVNEW FEATURES GENERATED FOR RUL MODELING

regression models fitted in this study use RUL values as theresponse variable. Note that a single truck ID in a car can havemultiple bad truck/wheel failures. For a truck ID, the datasetbetween two adjacent failure dates is called a data segment. Theperiod, from the beginning of the records to the first failure date,is also marked as a data segment.

As stated above, a large number of missing values exist inthe detector measurement data due to several aforementionedreasons. We impute the missing detector measurements bytruck ID using missForest method. To assess the performanceof missForest method for the railcar detector data, we ran-domly introduce missing values given a specified missing valueproportion. That is, we first obtain a complete subset of themeasurement data, and then randomly replace the entries inthis subset up to a specified amount. For the missing valueproportions of 0.3 to 0.7, Table III compares the performancemetric of missForest and MICE [27] in terms of the normalizedroot mean squared error (NRMSE) [30], which is widely usedfor comparing missing value imputation methods [24].

The definition of NRMSE for the continuous variables is

NRMSE =

√mean((M− M̃)2)/var(M) (3)

where M is the complete data set, and M̃ is the imputeddata set. Note mean(•) and var(•) are the empirical mean andvariance for the missing values only. If the value of NRMSEis close to 0, the performance of the missing value imputationmethod is good, while the bad performance leads to a value ofNRMSE around 1. From Table III, we can see that missForestoutperforms MICE for all missing value proportions and showsan acceptable NRMSE level (much less than 1) even when themissing value proportion is as high as 70%. In the experiments,we remove the truck IDs with the proportion of total missingvalue greater than 60% to improve the prediction power.

With the complete data set, now we can generate the new fea-tures as listed in Table IV. At first, we normalize the WILD fea-tures by dividing the measurements by car speed, truck weight,and both, respectively. The original WILD measurements


Fig. 5. Example of fitted linear regression line and cubic smoothing spline.The smoothing spline has the degrees of freedom = 3.

will be excluded from the predictive model building. By fittinglinear regression lines and cubic smoothing splines [31] to theoriginal measurements from MV and OGD, and the normalizedfeatures from WILD, we can obtain the signal characteristics,including the signal slope and the smoothing trend. The slopesand smoothing splines are calculated in the date range fromthe immediately preceding failure event to the end of a datasegment by each truck ID.

In this study, for a given feature Xj , j = 1, 2, . . . , p, thecubic smoothing spline with knots at the unique values ofti, where ti is the date of the ith record for the wheel, i =1, 2, . . . , s, where s is the number of records in a data segmentfor a given truck ID, minimizes the penalized residual sum ofsquares

s∑i=1

{Xij − f(ti)}2 + λ

ts∫t1

{f ′′(t)}2 dt (4)

where f(t) stands for all functions with two continuous deriva-tives, Xij = Xj(ti), and λ is a fixed penalty constant. When λincreases, the resulting curve will be smoother. The automaticselection of smoothing parameter λ involves the constructionof cross-validation sum of squares CV(λ) and the optimal λ̂ isselected over a suitable range to minimize CV(λ). The readersare referred to [31] for details. An example plot of the fittedlinear regression line and cubic smoothing spline is shown inFig. 5. From this figure we can see that the smoother correctlydescribes the relationship of the target variable (normalizedwheel average kips from WILD) and the record date even withthe existence of an outlier.

Next, the summary statistics are calculated for each measure-ment in a preceding 90-day window from the date of each rowin the integrated dataset, as shown in Fig. 6. Here we use a90-day leading window based on the discussion with the engi-neers because this window size can provide enough lead timeto perform predictive maintenance while avoid early removingof wheels/trucks.

Finally, we have the data ready for predictive model building.Due to the large number of features in this data set (i.e., allfeatures listed in both Tables II and IV), we apply featureselection on the augmented data representation to automaticallyidentify the significant features and remove irrelevant features.

Fig. 6. Calculating the summary statistics in a preceding 90-day window foreach detector measurement. The length of the arrows represents different typesof detectors as well as bad order codes.

V. EXPERIMENTS

During the period of January 2010 to March 2012, 745 badtrucks were identified and labeled by bad order codes relevantto wheels (79%) and trucks (21%). This dataset contains 21,855records and 19 original measurement features from MV, OGDand WILD detectors as summarized in Table II. After datapreprocessing, there are totally 248 features in the final datasetand these include the original MV and OGD measurements,normalized WILD measurements, signal characteristics, andsummary statistics. In the following analysis, this final datasetwith imputed missing values is used for comparing the regres-sion models.

For model building and validation, the data are grouped intothe training data and test data, while the training data containabout 75% truck IDs. Because the measurements for the sametruck ID are not independent of each other, all data originatedfrom a single truck ID is classified into either the training dataor the test data. Random sampling for the whole failure data isnot used here, since the model evaluation results with randomsampling will be overly optimistic.

To select the best predictive models for RUL, we fit thetraining data with six popular statistic and machine learningregression models: RF, QRF, Decision Trees (DT), KNN Re-gression (KNN), Support Vector Regression (SVR), and thePrincipal Component Regression (PCR).

In the experiments, the R language and its packages are usedfor data processing and model building, and the values of thealgorithms are set as follows:

• RF: nodesize = 20, ntree = 500, mtry = p/2, where pis the number of predictor variables;

• QRF: nodesize = 20, ntree = 500;• SVM: cross = 10, which means that a 10-fold cross

validation is performed on the training data;• KNN: k = 100, where k is the number of neighbors

considered;• PCR: segments = 10 which are used in the cross valida-

tion;• DT: the default control parameters are used. Please refer

to the R package ‘rpart’ for details.

We use the absolute percentage error (APE) of the test datato compare the model performance. If there are nt rows in the


TABLE VSUMMARY OF ABSOLUTE PERCENTAGE ERRORS FOR

SIX REGRESSION MODELS

Fig. 7. Plot of predicted RUL values vs. observed RUL values (in days) inthe test data using RF. The red diagonal line represents the cases where thepredicted RULs equal to the observed values.

test data, the APE for the kth test case, k = 1, 2, . . . , nt, isdefined as

APEk =∣∣∣RULk − RULpred

k

∣∣∣ /RULk · 100% (5)

where | · | is the absolute value operator, RUL stands for theactual RUL value, and RULpred is the predicted RUL from aregression model. Based on (5), we can see that the small RULvalues close to 0 can create very large APE values and thusgenerate a very skewed distribution. And in practice, there isno justification for preferring a high precision for small valuesof RUL. Therefore, those records with RUL values not greaterthan 30 days are removed from the data. To eliminate the effectsof extreme predicted values, the 2.5% trimmed APE (boththe lowest 2.5% extreme values and the largest 2.5% extremevalues are excluded from of the APE results) are calculated andsummarized in Table V. Note that we use the median valueas the predicted RUL when fitting QRF model. The results inTable V show that RF and QRF outperform the other modelsbecause of the advantages of RF models and missForest formissing value imputation.

An example plot of the predicted vs. observed RUL valuesin the test data using RF is shown in Fig. 7, which revealsthe tendency of the dispersion of predicted RUL values toincrease as the actual values increases. Every circle in Fig. 7corresponds to a row in the test data, and it is associated witha multi-detector measurement vector of a truck/wheel. We cansee that RF, as a general disadvantage (e.g., [32]), tends to

TABLE VIMOST IMPORTANT FEATURES RANKED DISCERNINGLY

overestimate small RUL values. As a solution, a classificationmodel (e.g., RF) can be fitted to the data first to differenti-ate the groups with small and large RUL values, and then aQRF model can be fit to the small RUL group to derive theα-quantile as the predicted RUL values, where α is less than0.5. In practice, impact of heterogeneity can be reduced sincerailroad usually limits the maintenance planning horizon in acertain range of time (e.g., 180 days). An alternative solutionto deal with the heteroscedasticity is Robust Random ForestRegression (RRFR) [33], [34], which is more robust than RFto heteroscedastic datasets, because this model uses the medianinstead of the mean for prediction and mean absolute deviation(MAD) to derive the error statistics. Note that the median is alsoused for prediction in our QRF model.

One important advantage of RF and QRF models is thatthe importance of the features can be calculated and used forfeature selection. There are two types of feature importancemeasures [35]: The first importance measure for each feature(i.e., predictor variable), mean increase in Mean Squared Error(MSE) or mean decrease in prediction accuracy, is computedas the total differences in out-of-bag prediction error before andafter the values of this feature are permuted among the trainingdata. averaged over all trees, and then normalized by the stan-dard deviation of the differences. The second measure, meanincrease in residual sum of squares (RSS), is the total increasein RSS from splitting on the variable, averaged over all trees.In this study, we use mean increase in MSE to determine thestatistically significant features. We also use feature selectionapproaches for other models, e.g., LASSO [36].

In the final model, there are 60 important features and most ofthem are new features created from the original ones. The rank-ing of corresponding original features using the mean decreasein the prediction accuracy is shown in Table VI. Here “Nor-malization” means the corresponding feature is normalized inthe data preprocessing. To summarize, the measurements fromWILD are the most important ones for predicting RUL, whilethose from MV are the least important.

VI. DISCUSSION

This study employs three types of wayside detectors in-cluding truck condition detector (OGD) and wheel conditiondetectors (WILD and MV). We found that all three types of


detectors are useful to predict railcar’s RUL. However, WILDdetectors provide more insightful information than the othertwo detectors. One reason is that WILD is contact-based mea-surements and previous studies showed that such measurementscan provide a good reliability in the readings from the systems[7]. The other reason is that WILD is more widely installed inNorth America than the other two types of detectors. In thiscase, a truck could have more WILD readings than the otherreadings. Therefore, compared with OGD and MV, it has lessmissing data from WILD and better smoothers can be esti-mated. In addition to WILD, OGD provides good complemen-tary readings on vehicle performance level. It can detect angleof attack, rotation, lateral positions and hunting amplitudes ofwheelsets in a truck. The results of this study also indicatethat almost all OGD readings can significantly contribute tothe truck failure prediction. MV could also provide usefulfailure information. However, it is well-known that some MVmeasurements are correlated. For example, flange thickness andflange height are negatively correlated, and rim thickness andflange height are also negatively correlated. So it is sufficient touse only a subset of MV readings for modeling the wheel wear.

This paper combines wheel failure and truck failure forprediction, since either failure will need to set out the entirevehicle in the workshop. By aggregating measurements ontotruck level, we are able to reduce the total amount of raw datasignificantly, and it will not damage our prediction performanceand accuracy. We do not consider aggregating onto equipment(railcar) level, since one railcar could contain more than tentrucks. Prediction on equipment level is expected to involvemuch more visual inspection time than on truck level.

If only one failure type, either wheel failure or truck failure,is considered, the proposed methodology is still valid. Weimplemented the RUL prediction method on predicting RULof wheels and trucks separately. The significant factors foundin both prediction models are composed of mixed WILD, OGDand MV readings, similar as what we have found in this paper.Therefore, This study confirms that wheel and truck failuresare interconnected with each other. A new set of wheels canwear faster if its corresponding truck is poorly aligned, andvice versa.

To better plan maintenance activities, the main objective ofthis paper is to predict midterm-range RUL, between 60 daysand 180 days. When the actual RUL value is above 180 days, itdoesn’t influence existing maintenance schedule much. Further,according to the plot, we can see that the prediction errorsincrease when RUL is above 180 days, while the experimentresults show that the proposed RUL prediction methods canaccurately predict RUL of a truck/wheel in the middle-termrange. When the railcar component RUL is close or shorterthan 60 days, one may use classification methods to classifythe components by different levels of RUL values. Further,QRF provides conditional quantiles of RUL instead of justmean estimation. According to different conservativeness of themaintenance policy, the railroads are able to flexibly output theneeded quantiles for different rolling stocks.

Accurate middle-term prediction of RUL allows railroadsto plan condition-based maintenance with sufficient lead time.Condition-based maintenance is carried on along with the mon-

itoring of certain parameters related to component health ordeterioration [7]. Once an imminent failure alarm is received,one will take subsequent corrective maintenance prior to thecomponent failure. However, current state-of-practice mostlyrelies on corrective maintenance, which occurs only after anout-of-threshold defect is detected. This approach maximizescomponent service life but disrupts regular railway opera-tions, leading to higher expenses and reduced average networkvelocity.

VII. CONCLUDING REMARKS

Railway wayside detectors can automatically identify po-tential railcar component failures, and improve railway safetyand reduce rolling stock inspection and maintenance costs.Through this paper, a Random Forests based methodology wasdeveloped to assess the current health and predict RUL of bothtrucks (bogies) and wheels of a railcar by fusing measurementsfrom the three types of wayside detectors, WILD, OGD andMV. In the final predictive model, the statistically significantfeatures are from all three detector types. However, accordingto the number of the most important features, we can rankthe significance of a detector as follows: WILD > OGD >MV. Further, we find that MissForest, a Random Forests basednonparametric imputation method, can effectively handle miss-ing data in wayside detector readings. The median averagepercentage error (APE) of RUL prediction is 22.5%, which isadequate for middle-term (60–180 days) proactive maintenanceplanning. After obtaining RUL estimates, maintenance actionscan be planned within the lead time, and the identified defec-tive railcar can be set out in appropriate timing for necessarymaintenance activities.

In this paper, we only examined three types of waysidedetectors due to data availability. Nowadays more condi-tion monitoring technologies for railcars have been developedrapidly. Railroads may apply similar data fusion methods pro-posed in this paper for other detectors to prognosticate railcarfailures.

Further data analysis is still needed to fully disclose underly-ing relationships between different types of failures. Such rela-tionship can facilitate maintenance procedures and strategies toreduce operation costs in current railway practice.

REFERENCES

[1] B. Schlake, “Impact of automated condition monitoring technologies onrailroad safety and efficiency,” M.S. thesis, Dept. Civil Environ. Eng.,Univ. Illinois at Urbana–Champaign, Urbana, IL, USA, 2010.

[2] Y. Ouyang, X. Li, C. P. Barkan, A. Kawprasert, and Y.-C. Lai,“Optimal locations of railroad wayside defect detection installations,”Comput.–Aided Civil Infrastruct. Eng., vol. 24, no. 5, pp. 309–319,Jul. 2009.

[3] H. M. Tournay, “The development of algorithms to detect poorly perform-ing vehicles at wayside detectors,” in Proc. 8th World Congr. RailwayRes., Seoul, Korea, 2008, pp. 1–11.

[4] P. G. Steets and Y. H. Tse, “Conrail’s integrated automated waysideinspection,” in Proc. IEEE/ASME Joint Railroad Conf., Piscataway, NJ,USA, 1998, pp. 113–125.

[5] K. Bladon, D. Rennison, G. Izbinsky, R. Tracy, and T. Bladon, “Predictivecondition monitoring of railway rolling stock,” in Proc. Conf. RailwayEng., Darwin, Australia, 2004, pp. 1–12.


[6] D. Barke and W. K. Chiu, “Structural health monitoring in the railwayindustry: A review,” Struct. Health Monit., vol. 4, no. 1, pp. 81–93,Mar. 2005.

[7] R. Lagneback, “Evaluation of wayside condition monitoring technologiesfor condition-based maintenance of railway vehicles,” Licentiate thesis,Dept. Civil, Mining Environ. Eng., Lulea Univ. Technol., Lulea, Sweden,2007.

[8] B. Brickle, R. Morgan, E. Smith, J. Brosseau, and C. Pinney, “Identifica-tion of existing and new technologies for wheelset condition monitoring,”U.K. Rail Safety Std. Board, London, U.K., T607, 2008.

[9] J. Robeda and S. Kalay, “Technology drives US train inspections,” Int.Railway J., vol. 48, no. 5, pp. 47–50, 2008.

[10] G. M. Shafiullah, A. M. S. Ali, A. Thompson, and P. J. Wolfs, “Predict-ing vertical acceleration of railway wagons using regression algorithms,”IEEE Trans. Intell. Transp. Syst., vol. 11, no. 2, pp. 290–299, Jun. 2010.

[11] S. Uckun, K. Goebel, and P. J. F. Lucas, “Standardizing research methodsfor prognostics,” in Proc. Int. Conf. Prognostics Health Manage., Denver,CO, USA, Oct. 6–9, 2008, pp. 1–10.

[12] R. Kothamasu, S. H. Huang, and W. H. Verduin, “System health monitor-ing and prognostics—A review of current paradigms and practices,” Int.J. Adv. Manuf. Technol., vol. 28, no. 9/10, pp. 1012–1024, Jul. 2006.

[13] X.-S. Si, W. Wang, C.-H. Hu, and D.-H. Zhou, “Remaining useful lifeestimation—A review on the statistical data driven approaches,” Eur. J.Oper. Res., vol. 213, no. 1, pp. 1–14, Aug. 2011.

[14] F. Ahmadzadeh and J. Lundberg, “Remaining useful life estimation: Re-view,” Int. J. Syst. Assur. Eng. Manage., vol. 5, no. 4, pp. 461–474,Dec. 2014.

[15] T. Benkedjouh, K. Medjaher, and N. Zerhouni, and S. Rechak, “Healthassessment and life prediction of cutting tools based on support vectorregression,” J. Intell. Manuf., vol. 26, no. 7, pp. 1751–1760, Apr. 2013.

[16] T. H. Loutas, D. Roulias, and G. Georgoulas, “Remaining useful lifeestimation in rolling bearings utilizing data-driven probabilistic E-supportvectors regression,” IEEE Trans. Rel., vol. 62, no. 4, pp. 821–832,Dec. 2013.

[17] A. K. S. Jardine, D. Lin, and D. Banjevic, “A review on machinerydiagnostics and prognostics implementing condition-based maintenance,”Mech. Syst. Signal Process., vol. 20, pp. 1483–1510, Oct. 2006.

[18] F. Braghin, R. Lewis, R. Dwyer-Joyce, and S. Bruni, “A mathematicalmodel to predict railway wheel profile evolution due to wear,” Wear,vol. 261, no. 11/12, pp. 1253–1264, Dec. 2006.

[19] J. Pombo et al., “A railway wheel wear prediction tool based on amultibody software,” J. Theor. Appl. Mech., vol. 48, no. 3, pp. 751–770,2010.

[20] C. Yang and S. Létourneau, “Learning to predict train wheel failures,”in Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,Chicago, IL, USA, 2005, pp. 516–525.

[21] L. Hajibabai et al., “Wayside defect detector data mining to predict po-tential WILD train stops,” in Proc. Amer. Railway Eng. MaintenanceWay Assoc. Annu. Meet., Chicago, IL, USA, 2012, pp. 1–39.

[22] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classificationand Regression Trees. Belmont, CA, USA: Wadsworth, 1984.

[23] L. Breiman and A. Cutler, “Random Forests,” Dept. Stat., Univ. CaliforniaBerkeley, Berkeley, CA, USA. [Online]. Available: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

[24] D. J. Stekhoven and P. Bühlmann, “MissForest—Non-parametric missingvalue imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1,pp. 112–118, 2012.

[25] O. Troyanskaya et al., “Missing value estimation methods for DNA mi-croarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001.

[26] J. Schafer, Analysis of Incomplete Multivariate Data. London, U.K.:Chapman & Hall, 1997.

[27] S. Van Buuren and K. Oudshoorn, Flexible Multivariate Imputation byMICE. Leiden, The Netherlands: TNO Prevention Center, 1999.

[28] N. Meinshausen, “Quantile regression forests,” J. Mach. Learn. Res.,vol. 7, pp. 983–999, 2006.

[29] R. Koenker, Quantile Regression. Cambridge, U.K.: Cambridge Univ.Press, 2005.

[30] S. Oba et al., “A Bayesian missing value estimation method for geneexpression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096,Nov. 2003.

[31] T. J. Hastie and R. J. Tibshirani, Generalized Additive Models, ser. CRCMonographs on Statistics & Applied Probability. New York, NY, USA:Chapman & Hall, 1990.

[32] N. Horning, “Random forests: An algorithm for image classification andgeneration of continuous fields data sets,” in Proc. Int. Conf. Geoin-format. Spatial Infrastruct. Develop. Earth Allied Sci., Hanoi, Vietnam,Dec. 2010, pp. 1–6.

[33] J. R. Brence and D. E. Brown, “Analysis of robust measures in randomforest regression,” Dept. Syst. Inf. Eng., Univ. Virginia, Blacksburg, VA,USA, Tech. Rep. sie06_0002, 2006.

[34] J. R. Brence and D. E. Brown, “Comparative Analysis of Two Forest-Based Regression Algorithms,” Dept. Syst. Inf. Eng., Univ. Virginia,Blacksburg, VA, USA, Tech. Rep. sie06_0003, 2006.

[35] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, Package ‘randomForest’,Reference Manual, 2014. [Online]. Available: http://cran.r-project.org/web/packages/randomForest/randomForest.pdf

[36] M. Schmidt, “Least squares optimization with L1-norm regularization,”School Comput. Sci., Simon Fraser University, Burnaby, BC, Canada,Tech. Rep., 2005. [Online]. Available: http://www.di.ens.fr/~mschmidt/Software/lasso.pdf

Zhiguo Li received the B.E. and M.E. degrees inautomotive engineering from Tsinghua University,Beijing, China, in 1999 and 2002, respectively, andthe M.S. degree in statistics and the Ph.D. degreein industrial engineering from the University ofWisconsin-Madison, WI, USA, in 2007.

From 2008 to 2011, he was a Research StaffMember with Xerox Research Center. Since 2011,he has been a Research Staff Member with the IBMT. J. Watson Research Center, Yorktown Heights,NY, USA. His research interests focus on business

analytics, industrial statistics, data mining, and reliability modeling for complexengineering systems and processes.

Qing He (S’10–M’13) received the B.S. and M.S.degrees in electrical engineering from SouthwestJiaotong University, Chengdu, China, and the Ph.D.degree in systems and industrial engineering fromthe University of Arizona, Tucson, AZ, USA, in2010. From 2010 to 2012, he was a PostdoctoralResearcher at the IBM T. J. Watson Research Center.

He is currently the Stephen Still Assistant Pro-fessor in Transportation Engineering and Logistics,affiliated with both Civil Engineering and IndustrialEngineering, at the University at Buffalo, The State

University of New York, Buffalo, NY. USA. His research vision is to developa multimodal multidimensional transportation and logistics systems with newinformation technology using a broad range of techniques, including optimiza-tion, control, simulation, and statistics.

prediction of railcar remaining useful life by multiple

Documents