forecasting short-term taxi demand using...

Forecasting short-term taxi demand using boosting-GCRFXinwu Qian

Purdue University550 Stadium Mall Dr

West Lafayette, Indiana [email protected]

Satish V. UkkusuriPurdue University550 Stadium Mall Dr

West Lafayette, Indiana 47906and Tongji University

4800 Cao An RdShanghai,China [email protected]

Chao YangTongji University4800 Cao An Rd

Shanghai,China [email protected]

Fenfan YanTongji University4800 Cao An Rd

Shanghai,China [email protected]

ABSTRACTIt will be most efficient to frame operation strategies before ac-tual taxi demand is revealed. But this is challenging due to limitedknowledge of the taxi demand distribution in immediate future andis more prone to prediction errors. In this study, we develop theboosting Gaussian conditional random field (boosting-GCRF) modelto accurately forecast the short-term taxi demand distribution usinghistorical time-series demand over the study area. Comprehensivenumerical experiments are conducted to evaluate the performanceof boosting-GCRF as compared to five other benchmark algorithms.The results suggest that the boosting-GCRF is superior with thebest modified mean absolute percentage error being 10.4%. The ap-proach is observed to be robust based on its prediction performanceon anomaly taxi demand data. In addition, the density functionsgenerated by the boosting-GCRF model are found to well capturethe actual distribution of the short-term taxi demand.

KEYWORDSShort-term taxi demand forecasting, spatiotemporal correlations,Gaussian conditional random field, boosting

1 INTRODUCTIONTaxi is a popular mode of urban transportation due to its point-to-point service and 7-24 availability. By the end of 2014, there wasapproximately one yellow cab per hundred population in Manhat-tan, and a total of 13,587 yellow cabs served over 450,000 dailytrips [1]. But such a huge industry has been notorious for its ineffi-cient operations with long passenger waiting time and excessivevacant trips. In particular, Zhan et al. [2] suggested that we mayreduce up to 60% of +the market inefficiency if demand and sup-ply are perfectly matched, and many efforts have been made toaddress the issue by designing recommendation systems [3–5],combining similar passengers [6–9], and framing dynamic pricingpolicies [10, 11]. An essential input for these efforts is the passenger

Copyright is held by the author/owner(s).UrbComp’17, August 14, 2017, Halifax, Nova Scotia, Canada

demand in immediate future, which is either assumed to be knownfrom historical observations or through trivial predictions. But wemay barely improve the operation performance, if not reducing,without precise understanding of the demand level. On the otherhand, it is likely to have a wide range of possible demand valuesover a short period of time, which emphasizes the need of mod-eling demand uncertainty in addition to point estimations. Thismotivates us to study the problem of predicting the distributions ofshort-term taxi demand, which will contribute to robust operationand policy making of taxi industry.

In this study, the short-term taxi demand forecasting problemfocuses on inferring future demand distributions from historicalobservations. It is well-understood that the long-term taxi ridershipis closely associated with socio-economics, demographics, as wellas built environment factors [12]. But for short-term taxi demand,as urban activities and mobility may differ over time and space, it isnot viable to forecast future demand based on aggregated explana-tory variables. Moreover, to help real-world operations such as pre-dispatching vacant taxis and framing dynamic pricing strategies,the prediction needs to be conducted at fine spatial and temporalscale. This introduces several challenges in modeling short-termtaxi demand. First, taxi demand in a small urban area is likely tobe noisy and non-stationary over a short time horizon. And theprediction errors may be largely determined by whether or not thespatiotemporal correlations can be properly modeled [13]. Second, afine level of spatial and temporal aggregations may significantly in-crease the dimensionality of the problem, but useful data are limiteddue to the rapid change of mobility patterns. While simple modelare unlikely to capture the high-dimensional nature of the problem,it is also difficult to make accurate prediction with a complex modelusing limited training data. Third, the taxi demand pattern may beentirely different when special events or sudden weather changestake place, and the proposed model needs to be robust against possi-ble anomalies in addition to normal daily operations. The objectiveof the study is to develop a model to accurately forecast short-termtaxi demand by addressing above-listed challenges.

Despite the significance of the problem, little attention has beenpaid to understand the nature of short-term taxi demand. To the

best of our knowledge, only Moreira et al. [14] studied the problemand proposed an ensemble model to forecast taxi demand at taxistands for every 30 minutes, and the best MAPE reported was23.12%. However, the demand at taxi stands may only account fora small portion of total ridership, and the best MAPE achieved maynot be accurate to meet the needs of real-time market operations.

In light of the results from previous works and considering thedifficulties in modeling short-term taxi demand, this study investi-gates predicting short-term taxi demand using the Gaussian Condi-tional Random Field (GCRF) model. We use historical time seriesof taxi demand as input, and model the distributions of taxi rid-ership one or multiple steps ahead for each individual location inthe study area. The problem can be categorized as the multi-targetprobabilistic regression, and we develop a boosting approach whichspecifically tailors the needs of the problem. The advantages of theproposed model can be summarized from four aspects. First, themodel captures both historical and future spatiotemporal correla-tions of short-term taxi demand. Second, the model is capable ofgenerating forecast multiple steps ahead as a structured output,thus avoiding constructing multiple predictors for each individualoutput and being more accurate. Third, the model conducts proba-bilistic estimations rather than point estimations, which helps toquantify uncertainties of future demand and is more suitable formodeling noisy short-term demand. Finally, the model uses regular-ization to overcome data availability issue, and improves point andprobabilistic estimation performances through the proposed boost-ing approach. We compare with the base GCRF, bagging-GCRF,ANN, AdaBoost regression tree, and gradient boosting tree meth-ods using the real-world taxi data to evaluate the performance of theboosting-GCRF model. We find that the boosting-GCRF achievesthe best mMAPE of 10.4% and improves the base GCRF model by5.63% on average. The model is also found to be robust under anom-alies, and the resulting probabilistic distribution may well capturethe actual distribution of short-term taxi ridership.

The rest of the paper is organized as follows. Section I intro-duces the data and the study area, discusses about data processing,and analyzes the spatiotemporal characteristics of short-term taxidemand. Section III discusses the GCRF model in detail, presentsthe boosting approach, and also introduces the bagging methodand benchmark algorithms. Section IV presents comprehensive nu-merical experiments in assessing the model performances. Finally,section V concludes the study followed by future works.

2 DATA2.1 Data processingIn this study, we use 2015 New York City (NYC) taxi trip data toexplore forecasting short-term taxi demand. The data were collectedby the New York City Taxi and Limousine Commission (NYCTLC)for the year of 2015. It was reported that there were 13,000medallioncabs by the end of 2015 [1], which produced over 20 Gigabytes oftrip data. Each trip record contains geo-coordinates and timestampsfor trip origin and destination. It also includes trip distance, tripduration, number of passengers, as well as payment information.Complete trip trajectories are not available due to privacy concerns.

We processed the data by first removing erroneous information,e.g. if a trip lies outside of the study area, if the trip duration is either

NYC ZCTA Map

Study Area

Figure 1: Study area and NYC ZCTAMap. (The dots in studyarea denote selected tracts in lower, middle, and upper Man-hattan for spatiotemporal characteristic analyses)

Time Interval

0 50 100 150 200 250

Dem

and

0

20

40

Time Interval

0 50 100 150 200 250

Dem

and

0

200

400

Time Interval

0 50 100 150 200 250

Dem

and

0

20

40

(a) 15-minutes taxi demand distribution(Lower, middle, and upper Manhattan)

Lag

0 5 10 15 20

Au

to c

orr

00.5

1

Lag

0 5 10 15 20

Au

to c

orr

00.5

1

Lag

0 5 10 15 20

Au

to c

orr

00.5

1

(b) Corresponding auto correlation coef-ficients

lag distance h

0 0.05

γ(h

)

×104

0

0.5

1

1.5

2

8 am

lag distance h

0 0.05

γ(h

)

0

5000

10000

15000

12 pm

lag distance h

0 0.05

γ(h

)

×104

0

1

2

3

7 pm

Emperical

Theoretical

lag distance h

0 0.05

γ(h

)

0

500

1000

1500

2000

12 am

(c) Semivariogram plot for spatial autocorrelation (lag at the scale of 105 meters)

Figure 2: Sample demand distribution, temporal autocorre-lation plot, and spatial autocorrelation plot

too short (< 30s) or too long (> 2h), and if the mean travel speedis excessive (> 70mph). Then we grouped the trip records basedon trip origin locations and timestamps. In this study, we chooseManhattan borough of NYC as our study area since the boroughalone covers more than 90% of total taxi trips in NYC [15]. Thetrip records are aggregated spatially into Zip Code Tabular Areas(ZCTAs) with the time horizon of 15-minutes time interval. Thereare initially 106 ZCTAs in Manhattan, and some of the ZCTAs onlyrepresent a single building where no taxi trips originates inside. Wemerge these tracts with peripheral areas, which results in the finalstudy area of 48 ZCTAs as shown in Figure 1.

2

2.2 Spatiotemporal correlations of short-termtaxi demand

To model future demand accurately, the first priority is to analyzeand understand the underlying characteristics of the data. Figure 2presents the sample demand distributions of three selected tracts,as well as the evaluations of the spatial and temporal correlations.In this section, we discuss the data characteristics of three subareasselected from lower, middle, and upper Manhattan (see Figure 1).

It can be observed from the demand distributions at 15-minutesscale that the short-term taxi demand data are noisy and highly non-stationary.While there are signs of seasonality for lower-Manhattanand mid-Manhattan locations as revealed by the periodical hills andvalleys, no obvious trend can be observed for the upper Manhattanlocation. Moreover, the periodical patterns are only observable atlonger time-scale (daily), it is difficult to justify short-term correla-tions from the figure. To improve our understanding, we then plotthe autocorrelation coefficients for temporal correlations and theSemivariogram for spatial correlations at these locations. As shownin Figure 2(b), all three plots suggest that the time-series demandis not random and the autocorrelation coefficients are significanteven with large lags. There is clearly a strong positive associationfor consecutive short-term demand at all three locations, where theincrease in short-term demand will likely lead to higher demand inimmediate future and vice versa. But the strength of the associationis heterogeneous over the space.

The Semivariogram is plotted for the whole study area at differ-ent time of the day, as shown in Figure 2(c), which is often used inspatial statistics to describe the spatial autocorrelation between apair of sample points [16]. In particular, a smaller Gamma value indi-cates that the spatial correlation between two locations is stronger.In the plots, the red dots are for observed empirical gamma valueswith bin size equal to 20, and the curve is the fitted theoreticalone. For the fitted theoretical curve, the range of x-axis under theblue curve indicates the effective range, which suggests that tractswithin the lag range are spatially correlated. We can verify from theresults that a strong spatial correlation does exist across differenttime of the day if the centroids of two tracts are within the distanceof 0.05 lag, or 5 kilometers. In other words, the short-term taxidemand is found to be strongly correlated for neighborhood areas,and the closer the two tracts are the higher the correlation is. Andthis correlation varies based on the time of day. The range can beup to 0.05 lag for 8AM and 12PM case, while the 12AM is observedto have the smallest range of around 0.03. As a consequence, thespatial correlation is also heterogeneous with respect to the time.

In conclusion, we observed that the short-term taxi demand isnon-stationary, but there exists non-trivial spatial and temporaldependencies for short-term taxi demand at various times and loca-tions. And it is important to model the spatiotemporal correlationsproperly in order to obtain accurate prediction results.

3 METHODOLOGY3.1 Gaussian Conditional Random Field

3.1.1 The basic model. We are interested in the problem of mod-eling the distribution of short-term taxi demand at a given location

and time interval. This ensures greater flexibility for stakehold-ers to frame the operation strategy and helps to evaluate the levelof confidence of a certain estimation. Moreover, we have shownthat the short-term taxi demand presents significant spatial andtemporal correlations, which need to be modeled accordingly inorder to obtain accurate estimation. As a consequence, consideringa study area which consists of n subareas, our focus is to model theprobability:

p(Yi |X ,Y−i ) (1)where Yi = [yt+1

i ,yt+2i , ...,y

t+ki ]T is a column vector of length k ,

representing the estimated taxi demand at location i up to k timesteps ahead. And X = [XT

1 ,XT2 , ...,X

Tn ]

T is a column vector oflength tn, which refers to the collection of observed taxi demandat n locations in the study area across t time steps into the history,where each component of X takes the form Xi = [x

1i ,x

2i , ...,x

ti ]T .

Similarly, Y = [Y1,Y2, ...,Yn ]T is the column vector of length kn forthe estimated taxi demand of n locations into k future time steps,and Y−i stands for all estimated demand except at location i .

The conditional probability suggests two layers of dependency:(1) future demand at location i may depend on historical demand atall other places in the past t time steps, and (2) future demand of alllocations may also be inter-correlated. In this study, we assume theshort-term taxi demand at each location during each time intervalfollows the normal distribution, which we observe to be reasonableapproximation for majority of the places in the data. Note thisassumption is not restrictive, as one may convert any empiricalmarginal distribution into Gaussian distribution by conductingcopula transformation and vice versa [17]. We model the structureddemand prediction a multivariate Gaussian distribution following:

p(Y |X ;Θ,Λ) =1

Z (x)exp{−

12YTΛY − XTΘY } (2)

where Θ is a tn × kn matrix which maps X to Y , Λ is a kn × knmatrix which captures the inter-correlation among the output, andZ (x) is the normalization constant which ensures the distributionbeing integrated to 1 and can be calculated as:

Z (x) = c |Λ|−12 exp{XTΘΛ−1X } (3)

This conditional probabilistic distribution of Y |X is known asthe GCRF model, which was original developed from the CRFmodel [18] for structured regressions, and has been applied tovarious problems including energy load forecasting [19], image de-noising [20], and modeling patients’ behavior [21]. In particular, thedistribution has the mean of −Λ−1ΘTX and the covariance beingΛ−1, and the two parameter matrices Λ and Θ for the short-termdemand forecasting problem can be understood from the graphrepresentation as shown in Figure 3. To model the conditional dis-tribution p(Y |X ), we need to infer the most likely Θ and Λ fromtraining data, which is equivalent to solving the following maxi-mum likelihood estimation (MLE) problem:

maximizeΘ,Λ − loд |Λ| −1ttr (YTYΛ + 2YTXΘ + Λ−1ΘTXTXΘ)

(4)There are two main benefits of GCRF: 1) the model is the discrimi-native form of Gaussian Markov Random Field model, thus beingcomputationally more efficient and resulting in lower error com-pared with generative models [22], 2) while the covariance matrix

3

Figure 3: An example of the graphmodel of the GCRFwhichmodels the demand in 48 locations and uses data d steps inthe history to predict demand 2 steps ahead.

may be dense due to complex spatial and temporal correlations, themodel infers the inverse of the covariance matrix which is likely tobe sparse since it models the conditional independency.

3.1.2 Regularization. The basic GCRF is not readily applicablefor estimation due to overfitting issue. Considering the scenariowhere the study area has 30 tracts, and we would like to estimatethe demand 8 steps into the future using 16 steps of historicalobservations, where each time step corresponds to a 15-minutestime interval. The resulting input X has the number of featuresbeing 30 (tracts)× 16 (time steps) = 480, and the length of Y is 30× 8= 240. In the worst case, we will have to estimate 480 × 240 + 240 ×(241)/2 number of parameters to infer Θ and Λ, which suggeststhat the GCRF model has high complexity. To avoid overfitting,the size of training samples is expected to be significantly largerthan the number of features. As one day of trip data may onlycontribute to one training sample, it may require years of trip datato sufficiently train the GCRF model. To obtain the data is not anissue, nevertheless, due to rapidly changing urban landuse overtime, historical trip data long time back are likely to serve as outliersfor estimating demand at present. Adding regularizers to estimatemodel parameters is the viable solution in light of the issues. Weconsider estimating model parameters with l1 regularizer , whichgives rise to the following maximum likelihood estimate equation:

maximizeΘ,Λ − loд |Λ| −1ttr (YTYΛ + 2YTXΘ

+ Λ−1ΘTXTXΘ) + λ(| |Λ| | + | |Θ| |)(5)

In Equation 5, λ is the l1 regularization coefficient. A large λ in-creases the number of zero elements in Λ and Θ, while a small λwill result in dense estimations of model parameters.

Equation 5 is no longer differentiable, and numerical approachessuch as Newton’s method may suffer from slow convergence due tothe high-dimensional nature of the problem. We adopt the second-order active set approach in [23] for parameter inference. The algo-rithm proceeds by constructing the second-order approximatingof Equation 5 without l1 regularization term, and solving the de-scent directions∆Λ,∆Θ using the coordinate descent algorithm. Thesolution algorithm has super-linear convergence rate and worksespecially well for high dimensional data. Readers may refer to [23]for analytical and theoretical details.

Finally, the mean value of future demand can be calculated as

Y = −Λ−1ΘTX (6)

and the covariance matrix for this multivariate Gaussian distri-bution is simply Λ−1. This suggests that future demand can bepredicted in no time, due to efficient sparse matrix inversion andmatrix multiplications with existing techniques.

3.2 Boosting-GCRFThe basic GCRF is a strong learner which tries to model the corre-lations between every pair of elements in input and output data.But the regularized GCRF model may be considerably weaker thanthe basic model, due to reduced number of parameters to avoidoverfitting. And a relatively weaker model is unlikely to best fit thewhole data, especially considering that each training sample maydiffer significantly from the others. This motivates us to boost theGCRF model for more robust and accurate demand prediction.

In particular, we are interested in the adaptive boosting (Ad-aBoost) approach [24] to build a strong learner Boosting-GCRF. Ad-aboost was originally proposed for classification problems, wherethe weak model is trained using different distributions of the sametraining data and different outputs are combined to derive the strongmodel. Despite its success in classification problems, less effortshave been made in regression problems and there are few studiesin combining boosting with probabilistic multi-target regressionmodels. We follow the AdaBoost.RT algorithm [25] to develop ouradaptive boosting algorithm for the GCRF model.

Similar to any AdaBoost algorithms, the Boosting-GCRF startswith the input training and test data X ,Y , the weight of the sampledataW , the weak learning modelM , and the number of machinesT .For each machine t , we prepare the training dataDt from theX andtrain the machine with the data. After each machine is constructed,we evaluate the fitness of the model on the entire training dataX and update the weight of each training sample based on thecorresponding error rate. For the GCRF model, the error rate of atraining sample xi is measured as:

Errt (i) =| |yti − yi | |2√kn

(7)

where yti is the predicted output from machine t , and the error rateis simply the rooted mean squared deviation.

Different from point estimation regression, the output of theBoosting-GCRFmay not be generated by simply taking theweightedaverage of the results from theT machines. The reason is that eachmachine t models a Gaussian distribution, and the combination ofthe machines is no longer a Gaussian distribution but a Gaussianmixture model. For Gaussian mixture models, the weighted meanand covariance are calculated as:

E(Y |X ) =T∑t=1

ptE(Y |X ,Mt ) (8)

Var (Y |X ) =T∑t=1

ptVar (Y |X ,Mt )

+

T∑t=1

pt {(E(Y |X ,Mt ) − E(Y |X )}2

| (9)

4

where pt is the associated weight of machine t , and the varianceof the mixture model is therefore the mixture of the variance plusa dispersion term of the weighted means. As a consequence, thisvariance will be greater than any of the variance from the individualmodel. Even if the mean prediction gets improved by boosting, theresulting variance is likely to be over-conservative which provideslittle value for understanding the true distribution of taxi demand.

To overcome this issue, we introduce the combination rule wherethe weights are adjusted based on prediction uncertainty [26]. If amachine is uncertain about the prediction, which is equivalent tothat the machine has larger variance, the importance of the machineshould be discounted. And the combination of the T machines isgenerated by the product rule rather than the weighted averagingrules following Equation 8 and 9:

p(Y |X ) ∝T∏t=1

p(Y |X ,Mt )pt (10)

And the expectation and variance of the mixture model can becalculated as:

Var (Y |X ) = (T∑t=1

ptVar (Y |X ,Mt )−1)−1 (11)

E(Y |X ) = Var (Y |X )T∑t=1

ptE(Y |X ,Mt )Var (Y |X ,Mt )−1 (12)

In this regard, we avoid the overdispersion on the Gaussian mixturevariance and the resulting variance is no worse than the machinewith the least prediction uncertainty.

Finally, we present the Boosting-GCRF algorithm as follows.

Table 1: Boosting-GCRF algorithm

1 Input: Training and test data X ,Y , the weak learner GCRFwith regularizer λ, number of machines T , initial weightW .

2 for t=1:T do3 Generate training and test data Xt ,Yt based on weightW ;4 Train the machineMt : ft (Xt ) → Y ′t ;5 Calculate error Err (i) for each sample xi in X following

Equation 7;6 Measure the maxium error rate of machine t as

ϵt ← maxiErr (i)7 Calculate the exponential loss function for each sample:

Lt (i) ← 1 − exp(−Err (i)/ϵt )8 Calculate the average loss function for the machine

Lt ←∑iW (i)Lt (i)

9 Set βt ← Lt1−Lt

10 Update the weightW (i) ← W (i)β 1−Lt (i )tZ ∀i , where Z is the

normalization constant.11 end

12 Calculate model importance pt ←loд(1/βt )∑t loд(1/βt )

13 Calculate E(Y |X ),Var (Y |X ) following Equations 12 and 11.14 Output: E(Y |X ),Var (Y |X )

3.3 FeaturesThe input and output data of the model are the number of taxitrips per location per time interval. And the input features consistof a vector of trip observations across all the locations during theobserved time periods, and the prediction is the forecasted numberof trips in the same set of locations over a pre-specified time horizon.Instead of using the actual trip counts, we first scale the counts intothe range [−1, 1]. The scaling is applied to both xi t and yti acrossthe all observations, which helps to reduce the variations amongthe number of trips at different time and locations.

In additional to feature scaling, we further construct additionalfeatures based on the time series observations using radial function.In particular, we adopt the Gaussian radial function:

h(xti ) = e−(xti −c

ti )

2(13)

where cti denotes the average number of trips for location i at timet across all observations. Equation 13 is mainly used to measurethe deviation of a particular feature from the normal scenario. Alarge deviation usually implies the impact from unusual externalevents such as bad weather or traffic condition, which is likelyto affect subsequence time intervals and neighborhood locations.As a result, the radial function doubles the size of input features,but contributes to extracting additional information from the timeseries observations.

3.4 Benchmark AlgorithmsWe introduce three benchmark algorithms which are considered asthe state-of-the-art approaches in literature for short-term demandforecasting.

3.4.1 Bagging-GCRF. The boostrap aggregating or bagging, isanother popular approach for constructing strong models from anensemble of weak learners [27]. We replace the weighted averageaggregation of weak learners in conventional bagging with thesame product weighting rules in Boosting-GCRF, and develop theBagging-GCRF approach as the first benchmark algorithm.

3.4.2 Artificial Neural Network. The artificial neural network(ANN) is known to be capable of modeling the nonlinearity ofthe data [28], and we use the feed-forward network as the sec-ond benchmark algorithm. Theoretical work suggested that singlehidden layer is sufficient to model any complex nonlinear rela-tionship [29], and we choose the structure of single layer with 10neurons based on our experiments. We also add the regularizationterm in the form of the sum of squares of the network weights toavoid overfitting issue.

3.4.3 Boosting regression tree. We follow the Adaboost.R2 al-gorithm [30] to construct an ensemble of regression trees for theshort-term demand prediction. For each decision tree regressor, themaximum depth is set to 3. And 200 weak trees are trained usingthe square loss function.

3.4.4 Gradient boosting tree. The gradient boosting methodcan be viewed as a combination of boosting regression tree andgradient descent approach. The main idea of the gradient boostingtree algorithm is that it construct new learners to minimize theloss function and additively adding new learners to the existing

5

model [31]. In this study, we use least-square loss function, train 200weak regression trees with maximum depth of 3. We set learningrate to 0.1 to avoid overfitting.

4 NUMERICAL RESULTS4.1 Experiment SettingAll experiments are conducted using MATLAB. The built-in neuralnetwork library is used to perform the training and testing of theANN models. We use 2015 NYC taxi trip data which contain 365observations. We partition the data into two sets, with 75% of thembeing the training set and the rest 25% of the data serving as thestand-alone test set. The model performances are compared by firstconducting 5-fold cross-validation over the training set, and thentrain the model using the complete training data and evaluate theperformance on the test data. Each sample contains the taxi dataof 96 time steps across the study area, where each prediction timestep corresponds to the time interval of 15-minutes. We create 14experimental scenarios by various segments of the each samplewith different start and end time steps.

4.2 Performance MeasuresTwo metrics are used for evaluating the algorithm performances inthis study. The first metric is the root means square error (RMSE),which measures the deviations of predictions from the observationsand is calculated as

RMSE =1D

D∑d=1

√√√1kn

n∑i=1

k∑t=1(ytd,i − y

td,i )

2 (14)

whereD is the number of observations in the validation data, ytd,i isthe predicted demand at location i of time interval t for observationd , and ytd,i is the observed demand correspondingly.

The second metric we used is the modified mean absolute per-centage error (mMAPE). It differs from the conventional MAPE byaddressing the value of observations being zero. Instead of averag-ing over the error of individual observations, the mMAPE measuresthe sum of errors for all observations over the summation of theobservations as:

mMAPE =1D

D∑d=1

∑ni=1

∑kt=1 |y

td,i − y

td,i |∑n

i=1∑kt=1 y

td,i

(15)

4.3 DiscussionWe first discuss the choice of regularization coefficient for GCRF, aswell as the number of learners for boosting and bagging methods.We evaluate the corresponding RMSE for different λ values byaveraging the RMSE from the 5-fold cross-validation, and the resultis presented in Figure 4(a). The result corresponds to the scenarioby using data from 8AM to 1PM to predict taxi demand from 1PM to2PM (4 steps ahead). When λ > 1, the base GCRF model is observedto be almost completely sparse with high errors. On the other hand,as we keep decreasing λ, the model becomes too complex and itoverfits the data. Based on the results of all scenarios, we chooseλ = 0.05 and this value is also used for boosting and baggingmethods. While we do not have a stopping criterion for boostingand bagging methods, the number of learners for each method is

Regularization term

10-3

10-2

10-1

100

101

RM

SE

10

15

20

25

30

(a) RMSE versus λ

N Learners

0 20 40 60 80 100

RM

SE

11

11.5

12

12.5

13

13.5

Bag Train Bag Test Boost Train Boost Test

(b) Number of weak learners for boost-ing and bagging λ

Figure 4: Cross-validation results for choosing regularizerand number of weak learners

Table 2: Comparison of model performance based on 5-foldcross validation

Case Start End Predict GCRFBoostingGCRF

BaggingGCRF ANN

GradientBoosting

AdaBoostDecisionTree

1 8 28 4 12.95±0.42

13.32±0.38

12.91±0.43

14.58±0.74

26.84±2.29

25.12±2.22

2 16 36 4 13.49±0.56

12.79±0.32

13.51±0.56

15.92±1.19

27.59±1.69

25.19±2.03

3 24 44 4 12.61±1.40

11.76±0.85

12.65±1.52

14.24±0.93

24.50±1.99

22.75±2.21

4 32 52 4 12.67±0.57

11.73±0.37

12.74±0.51

15.58±3.53

24.03±1.64

22.46±1.75

5 40 60 4 12.80±0.84

11.99±0.56

12.87±0.84

13.69±0.86

25.23±2.09

24.24±1.83

6 48 68 4 16.32±1.72

15.80±1.69

16.28±1.75

16.99±0.66

30.50±2.38

28.27±2.38

7 56 76 4 18.19±0.87

17.43±0.95

18.20±0.92

22.24±2.73

37.25±2.32

34.00±2.20

8 8 28 8 14.19±1.38

14.17±1.17

14.26±1.41

16.59±1.50

28.97±2.06

27.06±2.23

9 16 36 8 13.87±0.77

13.31±0.48

13.90±0.79

15.11±0.88

25.86±1.87

23.97±2.01

10 24 44 8 13.13±1.12

12.38±0.85

13.18±1.15

14.46±1.37

25.53±2.37

23.72±2.20

11 32 52 8 13.46±1.02

12.59±0.69

13.44±0.96

14.39±0.58

25.01±2.01

22.99±1.92

12 40 60 8 13.34±0.25

12.61±0.26

13.34±0.25

13.79±0.35

26.69±2.45

25.64±2.43

13 48 68 8 19.51±1.27

18.77±1.03

19.57±1.36

19.73±1.34

38.36±3.04

35.34±3.03

14 56 76 8 19.32±0.74

18.68±0.96

19.39±0.74

21.72±1.75

39.32±2.33

35.09±2.18

Average RMSE 14.70 14.09 14.73 16.35 28.98 26.85Difference compared to GCRF(%) / -4.13 0.20 11.22 97.10 82.60

determined by choosing the value that corresponds to the smallesttest RMSE based on the cross-validation results. As can be seenfrom Figure 4(b), the performance gets stabilized as we increasethe number of learners to over 60, for both boosting and baggingmethods. As a consequence, the number of optimal learners forbagging and boosting are set to 60.

Table 2 presents the cross-validation performance of the 6 algo-rithms under 14 different scenarios. The 14 scenarios correspondto prediction tasks at 7 different time the day with two differentprediction lengths: 4 and 8 steps into the future. The start and theend in the table define the segments of time used for training, andthe predict value refers the number of time steps to forecast the taxidemand. Cases 1 and 8 can be viewed as prediction during morningpeak hours (MP), cases 2-5 and 9-12 correspond to off-peak time(OP), and the rest of the cases refer to evening peak time (EP). The

6

table presents both mean RMSE value as well as the standard de-viation based on 5-fold cross-validation. The first conclusion wecan make is that the base GCRF can model the short-term taxidemand very well, with its performance consistently better thanANN model, and being significantly better than boosting regres-sion tree and gradient boosting tree methods. Not only the meanRMSE, the lower standard deviations for majority of the scenariosalso suggest the prediction performance is more stable. Second, theboosting-GCRF proposed in this study is found to further improvethe performance of the base GCRF model. On average, it decreasesthe RMSE by 4.13% compared to base GCRF model, and the stan-dard deviation is improved in 11 of the 14 cases, with the largestimprovement being 42.8% in case 2. However, the boosting-GCRF isfound to performance worse in cases 1. The reason is that there aremuch fewer taxi trips between 2AM-7AM compared with other timeperiods of the day, and the distributions of the trips are thereforeunstable. This creates many outliers in the training data and onewell-known drawback of the boosting method is its high sensitivityto outliers. And this issue is alleviated if we predict 8 steps aheadinstead of 4 (case 8 versus case 1), since GCRF also explores corre-lation among future predictions through Λ, and 8 steps predictionprovides more information to better calibrate the matrix comparedwith only predicting 4 steps. Finally, the bagging-GCRF is found tohave similar performance as compared to base GCRF mode in allscenarios, which suggest the ineffectiveness of bagging method inimproving the well-regularized GCRF model. Note that the perfor-mance is always associated with the bias-variance trade-off, andthe regularization is introduced to reduce the variance of a complexmodel with low bias. While bagging approach is known to improvemodel performance by reducing variance, the ineffectiveness ofbagging for the base GCRF model implies that the base model iswell regularized with the appropriate choice of regularization term.Finally, while boosting regression tree and gradient boosting treemethods are found to perform way worse than the other methods,we only focus on the comparison of other four approaches in therest of the study.

The previous results are obtained from the cross-validation ofthe training set, and we next present the results on the stand-alonetest set in Table 3. Since the total number of trips varies fromtime to time, the RMSE only provides a measure among differentmethods for the same scenario, and the mMAPE helps to compareperformances between scenarios. It can be seen from the table thatthe average RMSE from the test set is in general consistent withthat of the training set. Boosting-GCRF is still the best model andthe average performance is found to be even better than that of thetraining data, with the improvements of 4.96% in RMSE and 5.63% inmMAPE as compared to the base GCRF model. Moreover, the bestmMAPE achieved by the boosting-GCRF is 10.4%, and the biggestdifference over base GCRF is 11.8% (case 4). MP cases 1 and 8 areagain the two cases that are comparatively harder to predict forall models, which can be explained by the same reasons discussedabove. It can be concluded from the results that the boosting-GCRFmodel is a superior method for modeling short-term taxi demand,and the level of accuracy (12.4% mMAPE in average) may be wellsuited for real-world applications.

Table 3: Prediction performance comparison

GCRF Boosting-GCRF Bagging-GCRF ANNCase RMSE mMAPE RMSE mMAPE RMSE mMAPE RMSE mMAPE1 13.209 0.171 13.597 0.183 13.131 0.170 15.299 0.2012 13.505 0.138 13.005 0.127 13.571 0.139 16.377 0.1563 12.232 0.118 11.313 0.107 12.268 0.118 12.550 0.1184 12.686 0.118 11.069 0.104 13.105 0.122 13.744 0.1255 12.443 0.118 11.395 0.108 12.530 0.119 14.503 0.1386 14.390 0.120 14.252 0.116 14.493 0.121 18.039 0.1467 15.320 0.109 14.829 0.104 15.314 0.109 20.102 0.1358 14.520 0.166 14.654 0.168 14.497 0.167 16.385 0.1849 14.206 0.144 13.346 0.130 14.321 0.146 15.413 0.14910 13.340 0.125 12.280 0.113 13.435 0.126 14.272 0.13011 13.585 0.122 11.852 0.107 13.512 0.122 13.156 0.11612 12.914 0.128 11.696 0.116 12.871 0.127 14.978 0.14813 18.134 0.134 17.906 0.130 18.266 0.135 21.187 0.15314 16.921 0.122 16.417 0.118 16.889 0.122 22.371 0.158

Average 14.100 0.131 13.401 0.124 14.157 0.132 16.313 0.147Diff. vs. GCRF (%) / / -4.962 -5.627 0.405 0.492 15.690 12.182

Time step

0 5 10N

um

be

r o

f T

rip

s140

160

180

200

220

240

260

280

300

MA

PE

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Time step

0 5 10

Nu

mb

er

of

Tri

ps

2

3

4

5

6

MA

PE

0

0.1

0.2

0.3

0.4

0.5

Time step

0 5 10

Nu

mb

er

of

Tri

ps

10

12

14

16

MA

PE

0

0.05

0.1

0.15

Time step

0 5 10

Nu

mb

er

of

Tri

ps

55

60

65

70

75

80

85

90

95

MA

PE

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Observation GCRF Boost MAPE GCRF MAPE Boost

a b

c d

Figure 5: Prediction performance at various locations

As the average mMAPE and RMSE suggest that the boosting-GCRF model is better than the base GCRF model, we next assesshow the differences look like if we zoom into locations with differ-ent demand level. Figure 5 presents the snapshots of the predictionperformances corresponding to case 11, but we extend the pre-diction intervals from 8 to 12 time steps ahead. The results areaveraged over all testing samples. We visualize the location withhighest average demand, and two locations with relatively lownumber of ridership. We also plot the prediction performance byaveraging over all the locations in the study area to understandthe mean performance at each time step. Figure 5(a) indicates thatboth GCRF and boosting-GCRF perform well for high demand lo-cation, with worst mMAPE being lower than 8% and the predictedvalues resemble the shape of the observed number of trips. And theboosting-GCRF is found to reach consistently better accuracy inthis case, with the mMAPE being closer to or lower than 1% in sixtime steps. But when we turn to the locations with low number oftrips, a different story emerges. When there are more spikes due tolow ridership, the boosting-GCRF may have similar performance(Figure 5(b)) or even worse (Figure 5(c)) than the base GCRF model.

7

Observed demand

0 100 200 300

Pre

dic

tion

0

100

200

300

(a) GCRF, mMAPE=0.182

Observed demand

0 100 200 300

Pre

dic

tion

0

100

200

300

(b) Bagging-GCRF,mMAPE=0.182

Observed demand

0 100 200 300

Pre

dic

tion

0

100

200

300

(c) Boosting-GCRF,mMAPE=0.124

Observed demand

0 100 200 300

Pre

dic

tion

0

100

200

300

(d) ANN, mMAPE=0.179

Figure 6: Prediction performances under anomaly

And both approaches are found to perform poorly for certain lowdemand locations, with the mMAPE being as high as 45%. Thoughthe absolute difference is not that significant considering the smallmagnitude of trip counts, cautions should still be exercised forimplementing the prediction for these places. If we average overall places, as shown in Figure 5(d), the boosting-GCRF is found tooutperform the base GCRF in almost all time steps.

In additional to average performances, we are also interested inthe robustness of the prediction results. There exist many anom-aly events such as bad weather condition and large sports events,which may significantly affect the distribution of taxi ridership.And we present the prediction performances under anomalies inFigure 6.The results in Figure 6 are obtained from scenario 11, wherewe train the model using the whole training set and validate theperformance by choose 15% of the samples in the test set with thegreatest deviations from the mean trip value. The closer the scatterpoints are to the red line, the better the prediction performance is.The mMAPEs indicate that the boosting-GCRF model is robust toanomalies, which has the mMAPE being 31.9% better than base andbagging GCRFs. And the ANN is also found to perform better thanbase GCRF under anomaly scenario. The most noticeable differenceis for demand between 100-200, where GCRF and bagging-GCRF arefound to overestimate the number of trips by a considerable margin,and the Boosting-GCRF tends to perform well for all trip values.Meanwhile, the ANN model well predicts the value for majority ofthe points, but it also suffers outliers with excessive errors.

Finally, as the previous results are associated with the pointestimation results, we discuss how well the models represent theactual distribution of short-term taxi trips. The way to understandhow well the model captures the distribution of the data is byconstructing the confidence interval, and a well fitted distributionshould cover a% of the data with a% confidence level. While it is

Table 4: Coverage level

Confidence GCRF Boosting-GCRF Bagging-GCRFMP 68% 0.978 0.853 0.975

90% 0.994 0.940 0.99395% 0.999 0.979 0.999

OP 68% 0.958 0.744 0.95490% 0.992 0.887 0.99195% 0.998 0.959 0.998

EP 68% 0.970 0.783 0.96790% 0.996 0.912 0.99595% 0.999 0.970 0.999

easy to do so for univariate distributions, it is non-trivial to buildthe confidence regions of multi-variate distributions especially forhigh-dimensional cases in our study. One approach is to take theCartesian product of a% confidence interval for each dimension, butthe chance that the resulting confidence region does not containthe mean for at least one of the dimension will be way higher than1 − a. To address this issue, we apply the Bonferroni correction.That is, to reach the overall confidence level of a%, we take thejoint confidence interval of each dimension with the individualconfidence level being 1 − (1 − a)/T , where T is the total numberof steps to predict. We then evaluate the percentage of samplesthat are contained within the confidence region constructed at eachlocation across prediction time steps. The results are presented inTable 4. We average the performance for MP, OP, and EP casesbased on our discussion of Table 2. It can be seen from the resultsthat both base GCRF and bagging-GCRF are less confident withthe prediction by giving excessively wide confidence regions. Thisprobabilistic estimation is therefore less informative and has littlevalue for deriving stochastic operation strategies in real world. Onthe other hand, the Boosting-GCRF performs well in all three cases.And the coverage level are very close to all of the confidence levelsin OP and EP cases. The only exception is for the 68% confidencelevel during morning peak, where the boosting-GCRF is also foundto have worse performance according to previous discussions. Apossible amendment to this issue is to train a specific model withdifferent choice of regularization term and different number of weaklearners. But in general, we may arrive at the conclusion that theboosting-GCRF is very effective in terms of both point-estimationresults as well the probabilistic prediction performances.

5 CONCLUSIONIn this study, we analyze the characteristics of short-term taxi de-mand based on real-world trip data, and develop the boosting-GCRFmodel for forecasting short-term taxi demand. Numerical resultssuggest that the boosting-GCRF may reach the best mMAPE of10.4%, be robust against demand anomalies, and well model theactual distributions of the observations across space and time. Forfuture works, we plan to fine tune the model to improve the perfor-mance for low-demand areas, introduce data fusion to incorporateother time-varying factors into our current modeling framework,and apply the model in other prediction tasks, such as short-termtraffic flow and travel speed predictions.

ACKNOWLEDGMENTThe algorithmic development in this work is partially funded bythe NSF grant 1520338. The authors take full responsibility for thefindings and do not represent the views of NSF.

8

REFERENCES[1] NYCTLC. 2015 New York City taxicab factbook, 2015.[2] Xianyuan Zhan, Xinwu Qian, and Satish V Ukkusuri. A graph-based approach

to measuring the efficiency of an urban taxi service system. IEEE Transactionson Intelligent Transportation Systems, 17(9):2479–2489, 2016.

[3] Nicholas Jing Yuan, Yu Zheng, Liuhang Zhang, and Xing Xie. T-finder: A recom-mender system for finding passengers and vacant taxis. IEEE Transactions onKnowledge and Data Engineering, 25(10):2390–2403, 2013.

[4] Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, and Hui Xiong. A cost-effective recommender system for taxi drivers. In Proceedings of the 20th ACMSIGKDD international conference on Knowledge discovery and data mining, pages45–54. ACM, 2014.

[5] Ye Ding, Siyuan Liu, Jiansu Pu, and Lionel M Ni. Hunts: A trajectory recommen-dation system for effective and efficient hunting of taxi passengers. In MobileData Management (MDM), 2013 IEEE 14th International Conference on, volume 1,pages 107–116. IEEE, 2013.

[6] Shuo Ma, Yu Zheng, and Ouri Wolfson. T-share: A large-scale dynamic taxiridesharing service. In Data Engineering (ICDE), 2013 IEEE 29th InternationalConference on, pages 410–421. IEEE, 2013.

[7] Shuo Ma, Yu Zheng, and Ouri Wolfson. Real-time city-scale taxi ridesharing.IEEE Transactions on Knowledge and Data Engineering, 27(7):1782–1795, 2015.

[8] Chi-Chung Tao. Dynamic taxi-sharing service using intelligent transportationsystem technologies. In Wireless Communications, Networking and Mobile Com-puting, 2007. WiCom 2007. International Conference on, pages 3209–3212. IEEE,2007.

[9] Xinwu Qian, Wenbo Zhang, Satish V Ukkusuri, and Chao Yang. Optimal as-signment and incentive design in the taxi group ride problem. TransportationResearch Part B: Methodological, 2017.

[10] Xinwu Qian and Satish V. Ukkusuri. Time-of-day pricing in taxi markets. IEEETransactions on Intelligent Transportation Systems, (99):1–13, 2017.

[11] Jiarui Gan, Bo An, Haizhong Wang, Xiaoming Sun, and Zhongzhi Shi. Optimalpricing for improving efficiency of taxi systems. In IJCAI, 2013.

[12] Xinwu Qian and Satish V Ukkusuri. Spatial variation of the urban taxi ridershipusing gps data. Applied Geography, 59:31–42, 2015.

[13] Kai Zhao, Denis Khryashchev, Juliana Freire, Cláudio Silva, and Huy Vo. Predict-ing taxi demand at high spatial resolution: Approaching the limit of predictability.In Big Data (Big Data), 2016 IEEE International Conference on, pages 833–842.IEEE, 2016.

[14] Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, andLuis Damas. Predicting taxi–passenger demand using streaming data. IEEETransactions on Intelligent Transportation Systems, 14(3):1393–1402, 2013.

[15] Xinwu Qian, Xianyuan Zhan, and Satish V Ukkusuri. Characterizing urbandynamics using large scale taxicab data. In Engineering and Applied SciencesOptimization, pages 17–32. Springer International Publishing, 2015.

[16] Paul J Curran. The semivariogram in remote sensing: an introduction. RemoteSensing of Environment, 24(3):493–507, 1988.

[17] Ivan Žežula. On multivariate gaussian copulas. Journal of Statistical Planningand Inference, 139(11):3942–3946, 2009.

[18] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional randomfields: Probabilistic models for segmenting and labeling sequence data. In Pro-ceedings of the eighteenth international conference on machine learning, ICML,volume 1, pages 282–289, 2001.

[19] Matt Wytock and J Zico Kolter. Large-scale probabilistic forecasting in energysystems using sparse gaussian conditional random fields. In 52nd IEEE Conferenceon Decision and Control, pages 1019–1024. IEEE, 2013.

[20] Marshall F Tappen, Ce Liu, Edward H Adelson, andWilliam T Freeman. Learninggaussian conditional random fields for low-level vision. In 2007 IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.

[21] Vladan Radosavljevic, Kosta Ristovski, and Zoran Obradovic. Gaussian condi-tional random fields for modeling patientsâĂŹ response to acute inflammationtreatment. In Int. Conf. Machine Leaning Workshop on Machine Learning forSystem Identification (ICML Workshop). Citeseer, 2013.

[22] A Jordan. On discriminative vs. generative classifiers: A comparison of logisticregression and naive bayes. Advances in neural information processing systems,14:841, 2002.

[23] Matt Wytock and J Zico Kolter. Sparse gaussian conditional random fields:Algorithms, theory, and application to energy forecasting. In ICML (3), pages1265–1273, 2013.

[24] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm.In icml, volume 96, pages 148–156, 1996.

[25] Dimitri P Solomatine and Durga L Shrestha. Adaboost. rt: a boosting algo-rithm for regression problems. In Neural Networks, 2004. Proceedings. 2004 IEEEInternational Joint Conference on, volume 2, pages 1163–1168. IEEE, 2004.

[26] Tao Chen and Jianghong Ren. Bagging for gaussian process regression. Neuro-computing, 72(7):1605–1610, 2009.

[27] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[28] G Peter Zhang. Time series forecasting using a hybrid arima and neural networkmodel. Neurocomputing, 50:159–175, 2003.

[29] George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989.

[30] Harris Drucker. Improving regressors using boosting techniques. In ICML,volume 97, pages 107–115, 1997.

[31] Jerome H Friedman. Greedy function approximation: a gradient boosting ma-chine. Annals of statistics, pages 1189–1232, 2001.

9

forecasting short-term taxi demand using...

Documents