precise estimation of connections of metro passengers from …polytope.snu.ac.kr/papers/precise...

Precise estimation of connections of metro passengersfrom Smart Card data

Sung-Pil Hong1 • Yun-Hong Min1,3 • Myoung-Ju Park1,4 •

Kyung Min Kim1,2• Suk Mun Oh2

Published online: 1 April 2015� Springer Science+Business Media New York 2015

Abstract The aim of this study is to estimate both the physical and schedule-based

connections of metro passengers from their entry and exit times at the gates and the

stations, a data set available from Smart Card transactions in a majority of train networks.

By examining the Smart Card data, we will observe a set of transit behaviors of metro

passengers, which is manifested by the time intervals that identifies the boarding, trans-

ferring, or alighting train at a station. The authenticity of the time intervals is ensured by

separating a set of passengers whose trip has a unique connection that is predominantly

better by all respects than any alternative connection. Since the connections of such pas-

sengers, known as reference passengers, can be readily determined and hence their gate

times and stations can be used to derive reliable time intervals. To detect an unknown path

of a passenger, the proposed method checks, for each alternative connection, if it admits a

sequence of boarding, middle train(s), and alighting trains, whose time intervals are all

consistent with the gate times and stations of the passenger, a necessary condition of a true

connection. Tested on weekly 32 million trips, the proposed method detected unique

connections satisfying the necessary condition, which are, therefore, most likely true

physical and schedule-based connections in 92.6 and 83.4 %, respectively, of the cases.

Keywords Physical and schedule-based connection estimation � Smart Card data � Metro

network � Passenger’s behaviors

& Kyung Min [email protected]

1 Department of Industrial Engineering, Seoul National University, San 56-1 Shilim-dong, Kwanahk-gu, Seoul 151-742, South Korea

2 Present Address: Policy-Technology Convergence Research Division, Korea Railroad ResearchInstitute, 360-1 Woulam-dong, Uiwang-city, Geonggi-do 437-757, South Korea

3 Present Address: Intelligence Computing Laboratory, Samsung Electronics Co. Ltd., Suwon City,South Korea

4 Present Address: Department of Industrial Engineering and Management Systems Engineering,Kyung Hee University, Yongin City, South Korea

123

Transportation (2016) 43:749–769DOI 10.1007/s11116-015-9617-y

http://crossmark.crossref.org/dialog/?doi=10.1007/s11116-015-9617-y&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s11116-015-9617-y&domain=pdf

Introduction

Since the advent of the Smart Card Automated Fare Collection System, or Smart Card in

short, a major issue for transit planners has been how to fully realize the potential of the

massive transaction data (Pelletier et al. 2011). One possibility is that a Smart Card data

analysis culminates in providing an exact estimation of the physical and schedule-based

connections of each and every card holder’s trip. Archived information on the complete

train choices of passengers has numerous applications in planning and operation of a public

transit network (Seaborn et al. 2009; Trepanier et al. 2007). Typical examples include:

1. Empirical evaluation of a transit assignment model (Lam and Lo 2004; Kato et al.

2010; Raveau et al. 2011). Most transit assignment models in the literature have

verified their validity based on the train choice data from the surveyed passengers. It

normally requires a costly procedure to secure a large enough sample to guarantee a

level of accuracy.

2. Clarification of revenue collected by different train operating companies (Rinks 1986;

Tsamboulas and Antoniou 2006). A public transit network may evolve into an

integration of sub-networks of different operating companies. The lines are then

intertwined so that a card holder may carry out a trip with a minimal card transactions

with the system. This leaves the difficulty in allocating the fare revenue collected from

passengers among the operating companies. There appears an agreement that the real

data on the ridership (e.g. in person � kilometers) of each sub-network is essential in

any clarification method.

3. Estimation of connection cost (Nour et al. 2010; Guo and Wilson 2009). In principle, a

transit assignment contemplates an equilibrium attained by a user-optimal behavior

pursuing a minimum cost connection for a trip. An intensive real path choice from

Smart Card data is expected to provide us a firm empirical ground of the user-optimal

behavior studies.

Among the data fields of a Smart Card, the quadruple, (Departure station O, Entry time at

gate, Arrival station D and Exit time at gate) is available in a majority of train networks. It

is evident that this is also a minimal set of data required for a precise estimation of

passengers’ connections. Conversely, the quadruple appears a maximal data set that may

well be expected from the Smart Card data. For a modern day metro there is a demand to

accommodate a passenger’s trip with a minimal transaction with the fare collection system.

The privacy issue also makes difficult implementation a system that monitors more traces

of individual trips. The quadruple, therefore, seems most reasonable data available from a

Smart Card system.

To estimate a schedule-based connection of a passenger, we simply need to identify the

sequence of trains chosen for his/her O–D trip. Yet, it appears to be a nontrivial task to

trace the sequence even if the quadruples are known. A natural way to use the quadruples

might be to compute the inter-gate time of a passenger, the interval between entry and exit

times at the origin O and destination D, respectively, to assign him/her to a connection of

the closest mean inter-gate time.

The inter-gate time of a passenger, however, has a large variance because it has, as its

components, dismount movement-times, namely gate-to -platform, transfer, and platform-

to-gate times. In the metro network in Seoul, for instance, not rare is an O–D having

alternative connections with similar inter-gate times. The method, thus, may err seriously.

For instance, the Shillim–Garak Market Station pair has two alternative physical con-

nections, one via the station, SeoulNat’lUnivEducation, and the other via Jamsil. The mean

750 Transportation (2016) 43:749–769

123

and the standard deviation of inter-gate time on former connection are, respectively,

2936.7 and 189.1 s, while on the latter, 2923.5 and 239.2 s. If we assume, for simplicity,

the inter-gate times have a Gaussian distribution, 43.7 % of passenger of the first physical

connection has an inter-gate time closer to the mean gate-time of the second physical

connection. However, as will be shown in this paper, the quadruples, as a whole, provide

crucial information on the connection choice of passengers.

This paper is organized as follows. The section continues with a summary of the

previous estimation methods relying on the Smart Card data. In ‘‘Principles of proposed

method’’ section, we identify a set of transit behaviors and a special class of metro

passengers that enable us to develop a consistency condition that a true connection nec-

essarily satisfies. In ‘‘The algorithm’’ section, we develop an algorithm to detect a con-

nection of a passenger based on the consistency condition and illustrate some possible

cases. ‘‘Connection estimation results’’ section reports on an empirical evaluation of the

method applied to 32 million weekly O–D trips in the Seoul metropolitan area, from

Sunday, November 20 to Saturday, November 26, 2011. Finally, some concluding remarks

are provided in ‘‘Conclusion’’ section.

Literature review

The potential of a substantial and detailed collection of Smart Card data in the public

transit system has attracted attention of researchers since the turn of the millennium. See,

for example, (Lehtonen et al. 2002; Bagchi and White 2005). Consequently various be-

havioral analyses of Smart Card data mostly in countries that adopted the system in the

early period, have emerged (Asakura et al. 2008; Guo and Wilson 2009; Jang 2010;

Kusakabe et al. 2010; Morency et al. 2007; Park et al. 2008; Seaborn 2008; Seaborn et al.

2009; Trepanier et al. 2007; Utsunomiya et al. 2006). [For a more comprehensive literature

review on Smart Card data used in public transit systems, readers are referred to Pelletier

et al. (2011)].

Most of these studies, with a few exceptions, are based on statistical analyses of Smart

Card transaction data. However, there has been much emphasis on the importance of

deterministic and detailed information of travel behavior, such as the length of of indi-

vidual trips Bagchi and White (2005).

In reality, exact and complete estimates of the connections of individual passengers

would have led to a more beneficial analysis in many studies (Utsunomiya et al. 2006;

Morency et al. 2007; Park et al. 2008; Seaborn 2008; Asakura et al. 2008; Guo and Wilson

2009; Jang 2010; Raveau et al. 2011). However, such estimates could not be achieved

solely from a statistical behavioral study. In particular, recent studies (Trepanier et al.

2007; Seaborn et al. 2009; Kusakabe et al. 2010) for the purpose of marketing or transit

planning have aimed at more specific estimates of passengers’ connections.

Trepanier et al. (2007) proposed a method of estimating, from Smart Card data and bus

routes, the alighting stops of individual passengers in Gatineau, Quebec, Canada. For

buses, the locations were recorded in Smart Card data at boarding, but not necessarily on

alighting. Suppose a passenger travelled on two buses A and B consecutively in a day. They

reasoned that the alighting stop from bus A was the closest to the boarding location of bus

B. When a passenger rides only a single bus A in a day, they identify the bus, say, A0, thatthe passenger will ride in the near future and apply the same reasoning to A and A0 forestimating the passenger’s alighting stop from A.

In a multi-modal public transit network, it is nontrivial to decide if two consecutive legs

of a single passenger journey are actually the connections of a single trip. In Seaborn et al.

Transportation (2016) 43:749–769 751

123

(2009), they considered the transfers between buses and between bus and train (but not

between trains, because a record on connecting trains was not available from Smart Card

data). They proposed to consider them as the connections of a single trip if the time interval

between the alighting from the first vehicle and the boarding of the second vehicle was less

than a specific threshold value. To determine the threshold value, they applied the sta-

tistical method proposed in Seaborn (2008) to the time intervals whose alighting and

boarding locations were adequately close. The same method was also used in Jang (2010)

to determine a threshold value for the transit network in Seoul, Korea.

The work Kusakabe et al. (2010) is closely related to ours in that it was aimed at

estimating the choices for passengers from express, rapid and local trains for the same O–D

trip. The proposed method uses Smart Card data, railway topology, and the train timetable

on the premise that trains operate exactly to the timetable. The time expanded network was

then derived from the timetable. A key assumption was that the passengers always choose

the shortest connection in transit time and number of transfers. Given the entry and exit

times at gates for a passenger from Smart Card data, their algorithm determines the most

probable boarding and alighting trains. They then return a shortest connection consistent

with the two trains as the passenger’s connection. A tie is broken, if possible, by choosing

the connection for a minimum number of transfers.

Zhou and Xu (2012) performed a case study to estimate the schedule-based connections of

passengers based on their entry and exit times and the train log, namely the real data on arrival

and departure times of trains. It is based on the assumption that the passengers minimize a

surplus time, the time interval between the earliest arrival of a passenger at the platform and the

departure of the first train since then. Table 1 summarizes these studies and ours.

Smart Card data

Since its introduction in 2000, the Smart Card has quickly become the predominant method

of payment for the metro network in the Seoul metropolitan area. By 2005, 72 % of metro

Table 1 Literature summary

Mode Object ofestimation

Location Required informationother than Smart Carddata

Assumptions

Trepanieret al.(2007)

Bus Alightinglocations

Gatineau,Quebec,Canada

Bus route People ride buses at stopsclosest to preceding alightinglocations

Kusakabeet al.(2010)

Railway

Trainchoices

Osaka,Japan

Train log, Railwaynetwork topology

People choose shortestconnections

Zhou andXu(2012)

Metro Trainchoices

Beijing,China


People choose train tominimize a surplus time

Fu et al.(2014)

Metro Trainchoices

London,UK

Railway networktopology

Transit time distribution ofeach route follows Gaussianor lognormal

Seabornet al.(2009)

Bus ?Train

Linkingtrips

London,UK

Bus route, GPS forstations and stops

Threshold value on transfertime

Ours Metro Trainchoices

Seoul,Korea


People do not route longer by30 min. or more

752 Transportation (2016) 43:749–769

123

passengers used a Smart Card, with approximately 20 million transactions per day (Park

et al. 2008). In 2011, Smart Card became the only payment method.

Our connection estimating method in ‘‘The algorithm’’ section assumes the quadruple

(Departure station O, Entry time at gate, Arrival station D, Exit time at gate) for each O–D

trip of a metro passenger. For trains, this appears to be the case in most Smart Card systems

as shown in Table 2, although in some systems, for instance in Chicago, the pair (Arrival

station D, Exit time at gate) was not retained in the data.

Principles of proposed method

The schedule-based connection estimating problem can be posed formally as follows.

Definition 2.1 Section, physical and schedule-based connections By a section, we mean

the physical metro line between two adjacent stations inclusively. By a physical connec-

tion, in turn, we mean the concatenation of sections that a passenger passes during her/his

O–D trip. Then a schedule-based connection is defined to be a sequence of trains that a

passenger can take on a physical connection.

For example, the arc between Isu and Dongjak stations of the metro network illustrated

in Fig. 1, is a section. We can see that the physical connection from Bongcheon to Dongjak

consists of five sections. And a train from Bongcheon to Sadang and its connecting train at

Sadang to Dongjak constitutes a schedule-based connection of Bongcheon-Dongjak pair.

Note that once the schedule-based connection has been estimated, the physical connection

Table 2 Comparison of data fields for the Smart Card in cities.

Fields Seoul,Korea

London,UKSeabornet al.(2009)

Quebec*,CanadaTrepanieret al.(2007)

Osaka**,Japan,Kusakabeet al.(2010)

Chicago, USUtsunomiyaet al. (2006)

SanFrancisco,USUtsunomiyaet al. (2006)

WashingtonD.C, USUtsunomiyaet al. (2006)

Route ID Busesonly

Busesonly

Buses – Buses andtrains

Buses only Buses only

VehicleID

Busesonly

Busesonly

Buses – – Buses only Buses only

Boardingtime

Busesandtrains

Busesandtrains

Buses Trains Trains only Buses andtrains

Buses andtrains

Boardinglocation

Busesandtrains

Busesandtrains

Buses Trains Trains only Buses andtrains

Buses andtrains

Alightingtime

Busesandtrains

Trainsonly

– Trains – Buses andtrains

Buses andtrains

Alightinglocation

Busesandtrains

Trainsonly

– Trains – Buses andtrains

Buses andtrains

* No metro in Quebec, ** Smart Card in Osaka is for trains only

Transportation (2016) 43:749–769 753

123

is immediate. Also, by a train log, we mean the set of records indicating the arrival and

departure time of each train at each station.

Problem 2.2 Schedule-Based Connection Estimation Problem.

Input: The topology of the physical metro network, the set of passengers over a prede-

termined time horizon along with their quadruples, q ¼ (Departure station O, Entry time at

gate, Arrival station D, Exit time at gate), and the real arrival and departure times of trains

from a train log.

Output: The schedule-based connection of each passenger for his/her O–D trip, namely,

the complete sequence of boarding, transfer, and alighting trains, that was chosen for his/

her O–D trip in the metro network.

Generating tentative physical connections for O–D trips

The proposed method constructs a prior set of physical connections for each O–D trip by

excluding irrational connections. A passenger travelling from Station O to Station D

chooses a connection of the least cost. If other conditions are equivalent, the cost is an

increasing function of each of the number of transfers, travel times and the level of

congestion [e.g. Bureau of Public Roads (1964), De Cea and Fernandez (1993), Nielsen

(2000)]. According to a Shin et al. (2007), trips involving three or more transfers between

lines accounts for 1.5 % of total trips in the Seoul metropolitan area. We first exclude such

trips from the consideration.

For each O–D pair, we enumerate every possible physical connections and group them

to the numbers of transfers, n ¼ 0; 1, and 2. From each group, we discard the connections

that were longer than a shortest one by k sections or more. Thus we assume that people do

not route longer by some threshold value. We set k ¼ 10 which is equivalent to 31 min in

in-transit time. This is motivated by that the mean in-transit time of a passenger is 30 min

Fig. 1 O–D pairs with a unique tentative physical connection

754 Transportation (2016) 43:749–769

123

in Seoul metropolitan area. From a case analysis, indeed, the number of trips with an inter-

gate time longer than the minimum by more than 30 min is insignificantly small.

Then we perform an inter-group comparison; every physical connection is removed if it

has 10 more sections than an alternative physical connection with fewer transfers. The

connections left are defined to be the tentative physical connections of the O–D pair. In the

case of the Seoul metropolitan area, the process generated 3.9 tentative connections on

average for an O–D pair.

Reference passengers

In the above process, we found that the passengers that are prevalent have a unique

tentative physical connection. Consider, for instance, a Bongcheon-Isu trip on the metro

network in Fig. 1. It requires at least one transfer since the stations are on different lines.

Two physical connections are possible: one via Shindorim and Seoul Station, counter-

clockwise, the other via Sadang clockwise. The former requires two transfers and 17

sections while the latter one transfer and 3 sections. The process return, therefore, as a

unique tentative physical connection for a Bongcheon-Isu trip. A similar argument is

possible for the pairs, BongCheon-Dongjak, SeoulNat’lUniv-Isu, SeoulNat’lUniv-Dongjak

and so on.

About 47 % of the daily passengers on the metro-network in the Seoul metropolitan

area turned out to have a unique tentative physical connection.

Definition 2.3 Reference passengers By a reference passenger, we mean a passenger

whose O–D trip has a unique tentative physical connection.

Reference passengers, as their connections are guaranteed, play a crucial role in the

proposed connection estimation method.

Alighting and boarding time intervals

The idea is best illustrated by an example. There were 203 trips of passengers from Shillim

to Gangnam station, initiated between 7 and 9 A.M. on November 21, 2011. The first

plotting in Fig. 2 shows the entry times of the passengers at the origin, Shillim. They

appear uniformly distributed as expected. However, the exit times at the destination of the

same passengers, Gangnam, show a spiky pattern distributed over a brief period of time.

It is a typical behavior of an alighting passengers to rush to a gate and accomplish exit

as soon as possible. The platform-to-gate time of each passenger is, thus, typically the

maximal speed of a passenger and hence has the characteristic of an extreme value

Fig. 2 Entry and exit times for the same set of Shillim–Gangnam passengers

Transportation (2016) 43:749–769 755

123

(Einmahl and Smeets 2011). In fact, according to Ko et al., the platform-to-gate time of an

alighting passenger is best fitted by the Frechet distribution, mostly used for fitting extreme

values. Figure 3 shows the relative frequency of the platform-to-gate time of the alighting

passengers at Gangnam station from 5:30 to 11:00 A.M., November 21, 2011, which has

been fitted with Gamma, Inverse Gaussian and Frechet distributions. The Frechet distri-

bution is the best fit.

Definition 2.4 Alighting groups and time intervals By an alighting group, AGðX;NÞ, wemean the set of passengers that alight from the train X at their common destination N

(regardless of their origins). An alighting time interval is the time interval between the first

and last exit times in an alighting group.

The extreme value characteristic of the platform-to-gate times renders an alighting time

interval substantially smaller than an interarrival times of trains at a station and hence

disjointed. In the Seoul metropolitan area, the smallest headway in peak hours was 3.5 min

while the platform-to-gate times are 1.9 and 1.0, the mean and standard deviation.

Definition 2.5 Boarding groups By a boarding group, BGðX;NÞ, we mean the set of

passengers that board the same train X at their common origin N (regardless of their

destinations).

Unlike the alighting case, the boarding behaviors of metro passengers do not present

disjoined time intervals. However, the first-come-first-served queue discipline is well-

observed in boarding and the order on entry is maintained. It transpires in the metro

network of Seoul metropolitan area that at most two consecutive time intervals of boarding

groups may overlap. To see this, consider Fig. 4, a 2-dimensional plot of the entry and exit

times at a gate, of Shillim–Gangnam passenger groups, called an entry-exit map originated

from Kusakabe et al. (2010), where the x-axis represents the entry time and y-axis the exit

time of a passenger. From the figure, the passenger group of each Shillim–Gangnam train

is identified by the rectangle of boarding and alighting time intervals in the entry exit map.

Fig. 3 Platform-to-gate-time distribution at Gangnam station

756 Transportation (2016) 43:749–769

123

Furthermore, the disjointed alighting time intervals make the rectangles also disjointed, a

source of preciseness of the proposed method.

Suppose the alighting and boarding time intervals of AGðX;NÞ and BGðX;NÞ, re-spectively, are known to us for each train X and station N. Then, we can determine if X can

be the alighting or boarding train of a passenger at station N by checking if the exit or entry

time at N falls in the time intervals of AGðX;NÞ or BGðX;NÞ. To develop the consistency

check into a connection estimation method, we need first to estimate the time intervals of

trains at each station as in Fig. 4.

It is not however a trivial task to derive the time intervals solely by a plotting of the

quadruples of passengers. The passengers from in- and out-bound trains, for instance, may

happen to exit at the same gate with a proximity in time. Or, in a transfer station, the

alighting passengers from different lines may merge at a gate. The second key idea is to use

the reference passengers to derive the time intervals.

Estimation of alighting and boarding time intervals

Choose only the reference passengers whose physical connections involve no transfer.

Suppose his/her quadruple from Smart Card is q ¼ ðO; Entry time ¼t1;D; exit time ¼ t2Þ. We then consider the set P of trains that departed from O after t1and the set Q of trains that arrived at D prior to t2. If the two sets have only one common

train, say X, it should be the the alighting train of the reference passenger at D. In other

words, the passenger belongs to the alighting group AGðX;DÞ.Also notice that if either the gate-to-platform or platform-to-gate time is less than the

inter-arrival time of the trains, as in most of the real cases, P \ Q should be a singleton

(whose element is, of course, the train choice of q). Thus, we can identify the alighting

train of the reference passengers (if their trips are not delayed abnormally from gate to

platform or from platform to gate).

Thus by repeating the procedure to each of chosen reference passengers we can capture

a large subset fAGðX;NÞ of AGðX;NÞ for each train X and station N. Hence the time

interval fAGðX;NÞ offers a good estimate of that of the alighting time interval of AGðX;NÞ.

Fig. 4 Entry-exit map of the Shillim–Gangnam passengers

Transportation (2016) 43:749–769 757

123

Once fAGðX;NÞ has been constructed for each X and N, we derive an estimate fBGðX;NÞof the boarding group BGðX;NÞ in the following manner. Check every reference passenger

who departed at N via X. Put the passenger into fBGðX;NÞ if his/her exit times at D, the

destination, fall in the alighting time interval of fAGðX;DÞ.Similarly, the boarding time interval of BGðX;NÞ is then estimated using the time

interval of fBGðX;NÞ. As discussed earlier, unlike the alighting case, the time intervals may

overlap.

Another simple but important observation is that a reference passenger who made a

transfer in his/her trip is a verifier that a transfer has actually been made between the two

connecting trains he/she rode. For instance, in Fig. 1 a reference passenger from Bong-

cheon to Isu station certifies that there has been a transfer between the connecting trains he/

she used at the transfer station Sadang. This is very useful when estimating connection of a

passenger whose trip involves a transfer. From a possible list of connecting trains at a

transfer station, we can remove ones that have no verifier of an actual transfer.

Definition 2.6 Transfer reference passengers By the transfer reference passengers

RPðX; Y ;AÞ, we mean the set of reference passengers who transferred from Train X to Y at

Station A.

We now discuss how to find the transfer reference passengers RPðX; Y ;AÞ. Suppose X

and Y are, respectively, from Lines 1 and 2. We look up the list of the reference passengers

who transferred from Line 1 to Line 2 at A. Suppose the quadruple of an O–D passenger is

consistent with X and Y . Namely, his/her entry time at O on Line 1 falls into the time

interval of fBGðX;OÞ and exit time at D on Line 2 falls into the time interval of fAGðY ;DÞ.He/she is a proof that transfer has been made from X to Y at A and, thus, is added to

fRPðX; Y ;AÞ.

Estimation of time intervals from insufficient passengers

Obviously, the accuracy of time intervals depends on the the size of fAG or fBG. At the

stations in suburban areas in non-peak hours, the reference passengers may not be suffi-

cient to provide reliable time intervals. The issue can be resolved by aggregation of

alighting passengers of trains at each station. Under the assumption that the alighting

behavior of passengers is independent of the time of a day, it provides a sufficient col-

lection of platform-to-gate times for a reliable alighting time interval at each station.

The Garak Market station, a transfer station located at the intersection of Lines 3 and 8,

is scant in passenger traffic. The number of reference passengers per train at the station

varies from 1 to 30. We aggregate the reference passengers of the 142 inbound trains at the

station and fit their platform-to-gate times to a Frechet distribution. We then discard the

lowest 2.5 % and the highest 2.5 % as outliers with an excessive length of boarding or

alighting time. In our case this accounts for on average 1.59 passengers per train.

The range ½s; sþ L� of platform-to-gate times of the remaining passengers is then

defined as the standard alighting time interval at each station. The alighting time interval

of each train can be obtained simply by translating ½s; sþ L� to begin at the arrival time of

the train.

Figure 5 shows the resulted standard alighting time interval at Garak Market station.

Initially from the 142 inbound trains, there were 672 initial reference passengers from

which exclude are 19 passengers, 0.23 trips per train. The range is ½s; sþ L� ¼ ½28; 90 s�

758 Transportation (2016) 43:749–769

123

with the length L ¼ 62 s. The figure also shows the translation of the standard alighting

time interval to the arrival time, 08:19:04, of Train X. The resulting standard alighting time

interval ½08:19:32 ; 08:20:34 � of X is significantly larger than the time interval

½08:19:35 ; 08:20:01 � estimated from the reference passengers of X alone.

The standard boarding time intervals can be constructed analogously.

The algorithm

Given the quadruple q ¼ ðO; Entry time ¼ t1;D; Exit time ¼ t2Þ of a passenger, we

carry out the following steps for every tentative physical connection for O–D trip.

Suppose the physical connection, say P, requires no transfer. Then, we look up a train X

on P whose boarding and alighting time intervals contain t1 and t2, respectively. If none,

we reject P. Otherwise, we put P in the list of consistent physical connections of q along

with the train X, a single-train schedule-based connection on P.

Suppose P entails two transfers at stations, say, M and N. (We discuss this case only

since, then, the single-transfer case becomes obvious). We first construct the list of ten-

tative schedule-based connections for q on P, the list of sequences of trains S ¼X1 � X2 � X3 on P such that

1. The boarding interval of X1 at O contains t1, the alighting interval of X3 at D contains

t2, and

2. The arrival times of X1 and X2 are no later than the departure times of the following

trains, X2 and X3, respectively, at the transfer stations M and N.

Fig. 5 Standard alighting time interval at the Garak Market station on Line 8 inbound

Transportation (2016) 43:749–769 759

123

Note that this is a necessary condition that X1 � X2 � X3 can be a schedule-based con-

nection of the trip q on P. Then we loop up the transfer reference passengers RPðX1;X2;MÞand RPðX2;X3;NÞ. If both sets are nonempty, we return S as a consistent schedule-based

connection on P. We reject S, otherwise.

The algorithm returns P as the physical connection of q, only if P is the only physical

connection that admits a consistent schedule-based connection. Otherwise, namely, if there

is none or more than one such physical connections, the algorithm declares a failure to the

input quadruple q.

Initially, we apply the algorithm based on the standard time intervals in ‘‘Estimation of

time intervals from insufficient passengers’’ section. The passengers successfully returned

with a unique physical connection are added to the reference passenger set. Once we have

acquired sufficient reference passengers, we replace the standard time intervals with the

time intervals derived from the reference passengers of individual trains and repeat. The

algorithm can be summarized as in Fig. 6.

Note that there may be multiple consistent schedule-based connections even when a

unique physical connection is returned.

In our case, around 9 % of the trips were returned with more than one schedule-based

connections due to e.g. overlapping boarding time intervals and/or multiple connecting

trains. However, we can estimate the probability that each of the schedule-based con-

nections is the choice of passenger. The details are given in Appendix.

Illustration of actual estimation

The performance of the method is probably best understood by some actual cases of

estimation.

Unique physical and schedule-based connections

Figure 7 shows the trips of two passengers, say a and b who departed from Shillim station,

at 07:33:47 and 07:34:55 s, and arrived at Garak Market station, at 08:16:53 and 08:19:51

Fig. 6 The flow of algorithm

760 Transportation (2016) 43:749–769

123

s, respectively, on November 21, 2011: a = (Shillim, 07:33:47, Garak Market, 08:16:53)

and b = (Shillim, 07:34:55, Garak Market, 08:19:51).

There are two alternative physical connections: beginning at the origin, Shillim station,

both follow Line 2 outer-circle. However, one transfers at SeoulNat’lUnivEducation sta-

tion to line 3, the other at Jamsil station to line 8. The algorithm checks, for each passenger,

which physical connection has a logical connection, a sequence of trains all consistent with

his/her quadruple.

Consider a. On the physical connection, Shillim-SeoulNat’lUnivEducation-GarakMar-

ket, there is a unique train X1 whose boarding time interval contains the entry time of a. Of

the two trains, Y1 and Y2 that have been verified by transfer reference passengers to connect

X1 to Line 8 at SeoulNat’lUnivEducation, Y1 has an alighting time interval containing a’s

exit time at Garak Market station. Thus, Shillim-SeoulNat’lUnivEducation-GarakMarket is

added to the list of consistent physical connection of a along with the consistent schedule-

based connection X1 � Y1.

On the alternative physical connection, Shillim-Jamsil-GarakMarket, a should be as-

signed to the same tentative boarding train X1. However, neither of the two trains Z1 and Z2that connect X1 at Jamshil to Line 8 has an alighting time interval consistent with a’s exit

time at Garak Market station. Thus, Shillim-Jamshil-GarakMarket is rejected. Therefore,

the algorithm returns Shillim-SeoulNat’lUnivEducation-GarakMarket, as a unique physical

connection of a along with the unique schedule-based connection X1 � Y1.

Consider b. X2 is the only train whose boarding time is consistent with his/her entry time,

on the physical connection, Shillim-SeoulNat’lUnivEducation-GarakMarket. However, the

only verified train Y2 of X2 to Line 8 has alighting time interval inconsistent with b’s exit

time at the destination Garak Market. Thus, the physical connection is rejected for b.

Fig. 7 Schedule-based connection estimation of 2 Shillim–Garak Market trips

Transportation (2016) 43:749–769 761

123

On the physical connection, Shillim-Jamsil-GarakMarket, on the other hand, of the two

connecting trains Z2 and Z3 at Jamsil station, Z2 is has alighting time interval consistent

with b’s exit time as indicated in the figure. Thus, Shillim-Jamsil-GarakMarket is returned

as the physical connection for b, and X2 � Z2 is confirmed as the schedule-based

connection.

Analysis of failed cases

The algorithm fails when there are none or more than one physical connections consistent

with the quadruple of an input trip. Figure 8 illustrates the latter case. Consider a trip a =

(Janghanpyeong, 08:24:23, Sangsu, 09:01:28). There are two alternative physical con-

nections, I and II, that are comprised of the same line combination, Line 5 and 6, but

different transfer stations, Cheonggu and Gongdeok, respectively. The entry and exit times

times match with a unique schedule-based connection X1 � Y .

But, the transfer from X1 to Y are verified by a transfer reference passenger at both

transfer stations, Cheonggu and Gongdeok. Both the physical connections I and II are

consistent with the quadruple a and the method is failed. A failure due to multiple con-

sistent physical connections occurred more often when two or more physical connections

are distinct only in transfer station.

Connection estimation results

The metro network

The Seoul metropolitan area has 15 metro lines, 412 stations and 33,548 trains as operating

as of November 20, 2011. On these days of November 20 to 26, 2011, there were

Fig. 8 A case of failure: Indeterminate physical connection

762 Transportation (2016) 43:749–769

123

47,618,710 metro O–D trips. Of the possible O–D pairs, 904,897 pairs have nonzero traffic

and each carried 50 trips on average. In our study, we first excluded the trips involving 3

private lines, Metro 9, AREX(airport line), and DXLine, and one public line, the Incheon

City Line, because the train logs were entirely unavailable.

When the time interval between the entry and exit at a gate was the twice or more the

standard deviation off the mean, the trip was most likely voluntarily delayed. The number

of trips with such excessive inter-gate times was 1,571,417 which is 3.3 % of the total

trips. In addition, we have found that the actual record can be delayed after card tagging at

a gate because of a disruption in the communication network. Those abnormal trips,

delayed voluntarily or in tagging, were excluded from our data set. Finally, simply for an

accuracy, we ruled out the senior and handicapped citizens that have inter-gate times 6:7%longer than others Overall, the estimation algorithm was applied to 32,419,106 O–D trips

as summarized in Table 3.

The success rates

Table 4 summarizes the rate at which the method returns a unique physical connection to

the possible numbers of transfers required by the tentative physical connections of an O–D

trip. As indicated in the first column of Table 4, 51.3 % of the trips have only the tentative

physical connections with no transfer, and 26:7% the tentative physical connections re-

quiring two transfers, etc.

From the table, the success rate gets lower when there is alternative physical connection

requiring two transfers. Overall, the success rates were 92.6 and 83.4 %, respectively, for

the physical and schedule-based connections.

Table 3 Summary of trips se-lected for our experiment

Num. of trips Ratio (%)

Estimated trips 32,419,106 68.1

Excluded trips from 4 metro lines 5,176,440 10.9

Abnormal trips 5,625,075 11.8

Senior and handicapped citizens 4,398,089 9.2

Total 47,618,710 –

Table 4 Success rate for each combination of the numbers of transfers in physical connections

Num.transfers

Unique physicalconnection (%)

Unique schedule-basedconnection (%)

0 Transfer(%)

1 Transfer(%)

2 Transfers(%)

0 (51.3 %) 99.9 95.0 100 – –

1 (26.7 %) 94.5 80.6 – 100 –

2 (1.4 %) 72.6 51.9 – – 100

1, 2 (14.4 %) 67.6 54.0 – 56.6 43.4

0, 2 (4.2 %) 82.7 75.4 87.9 – 12.1

0, 1 (1.6 %) 84.1 78.0 70.2 29.8 –

0, 1, 2 (0.5 %) 75.3 68.0 69.4 20.1 10.5

92.6 83.4 60.8 32.6 6.6

Transportation (2016) 43:749–769 763

123

Consistency of train choice of metro passengers

We first probe the central assumption of transit behavior studies: do the metro passengers

make a rational train choice? To do so, we rely on the analysis of Cronbach (1951) to check

the consistency of the metro choice of passengers of an O–D.

We performed connection estimation for an additional day, Monday, March 19, 2012 to

be compared with Monday, November 21, 2011. We selected the 1513 O–D’s whose daily

traffics are no less than 100 trips in both days and which has more than one alternative

physical connections. The horizontal axis in Fig. 9 indicate the 3897 physical connections

while the vertical axis the proportion of its O–D passenger having chosen it on November

21, 2011. The same plot is done for March 19, 2012, maintaining the order of physical

connections, but exchanging the axes about the diagonal.

Obviously, if the train choice of passengers is consistent, the plotting should exhibit a

concentration of dots around the 45� diagonal, which is the case in the figure. In fact, the

Pearson’s correlation coefficient was very high, namely, 0.94. A paired-comparison T-test

accepted the null hypothesis that the train choices of passengers for their O–D trips are not

different on the 2 days. The Cronbach’s a was also 0.974. In any statistical sense, pas-

sengers indeed make an identical choice over the two Mondays. We extended the test over

the 5 days of the week, November 20 to 26, 2011 and we obtained a similar result.

Passenger flow on the time-expanded network

As algorithm returns the schedule-based connection for each and every trip, we can derive

the complete passenger flow on the time-expanded network. Figure 10 shows the passenger

flow, e.g., on the logical network time-expanded around Daerim station, a transfer station

Fig. 9 Consistency of physical connection estimation

764 Transportation (2016) 43:749–769

123

of Lines 2 and 7, from 07:45 to 07:55 A.M., November 20, 2011. In the time interval, there

were 4 trains arriving from Line 2 inner-circle, denoted by X1;X2;X3, and X4, and 3 trains

from Line 7 inbound, Y1; Y2, and Y3. The train logs and the passenger traffics are sum-

marized in Table 5.

In the figure, indicated are the passenger flows associated with each train. Of the 421

passengers of Train X4 at Daerim station on Line 2 arriving from Shindorim station, 3

exited and 2 transferred to Line 7 outbound. To the remaining 416 passengers, 75 entering

passengers joined. Also 37 transfer passenger from Line 7 outbound, and 3, 38, and 6

transfer passengers from Trains Y1, Y2 and Y3 on Line 7 inbound in the order, are added.

The resulting 575 passengers departed to the GuroDigitalComplex station along Line 2

inner circle.

We can also derive the transfer times between connecting trains. For instance, the

transfer time for 3 passengers from Y1 to X4 was 446 seconds, the difference between the

departure time of X4 and the arrival time of Y1. Crowdedness in public transport is an

Fig. 10 Passenger flow on the time-expanded network at the Daerim intersection of Line 2 and 7 from07:45 to 07:55 A.M. on November 21, 2011

Table 5 The trains and their associated passenger flows at the Daerim intersection from 07:45 to 07:55A.M. on November 21, 2011

Line Train Arrivaltime

Departuretime

Arriv.passen.

Alightingpassengers

Boardingpassengers

Depart.passen.

Exitingpassen.

Transfer-to

Entrypassen.

Transfer-from

Line 2(Innercircle)

X1 07:44:12 07:44:50 1399 54 91 63 55 1372

X2 07:45:59 07:47:13 470 1 0 88 61 618

X3 07:49:35 07:50:27 1277 55 55 91 91 1349

X4 07:51:33 07:52:42 421 3 2 75 84 575

Line 7(In-bound)

Y1 07:45:16 07:45:48 529 13 115 51 21 473

Y2 07:46:48 07:48:44 765 20 132 52 30 695

Y3 07:50:39 07:51:17 620 17 125 45 24 547

Transportation (2016) 43:749–769 765

123

important factor for the level of service (Weidmann et al. 2012; Cox et al. 2006). The

passenger flows on the time-expanded network provide us with the exact load on each train

which is, we believe, the most important data in a study on how crowdedness affects the

train choice of passengers.

Conclusion

First, we studied a set of behaviors of metro passengers by examining the gate times from

the Smart Card data, which produce time intervals precise enough to identify the pas-

sengers boarding, transferring, and alighting of trains based on the entry and exit times and

stations of a passenger.

1. The platform-to-gate time of an alighting passenger has the spiky characteristic of an

extremal value; the exit times at a gate of the passengers from the same train are

distributed over a very brief period of time. The time intervals of trains are disjointed.

2. The boarding behavior of metro passengers, however, is devoid of such disjointed time

intervals. However, the first-come-first-served queue discipline is observed well

enough to allow us to derive useful time intervals of boarding groups.

Second, we recognized and separated the class of passengers who have a unique pre-

dominant connection for a trip. Such passengers, more prevalent than expected, not only

provide us reliable estimates of the time intervals but also bear witness to an actual transfer

between trains from lines intersecting at a transfer station.

Third, we propose a connection estimation algorithm checking consistency of the time

intervals of trains in a tentative connection with the gate times and stations of a passenger,

which necessarily holds when the connection is an actual choice of a trip.

The proposed algorithm is applied to 32 million trips from Smart Card data collected in

the Seoul metropolitan area on the week, from Sunday, November 20 to Saturday,

November 26, 2011. As a result, our method could determine a unique physical connec-

tions in 92 % of the trips. The result shows a consistent physical connection choice over

the 5 weekdays.

Acknowledgments This research was supported in part by Basic Science Research Program(2014R1A2A1A11049663) through the National Research Foundation of Korea (NRF), and by the BK21Plus Program(Center for Sustainable and Innovative Industrial Systems) funded by the Ministry ofEducation, Korea.

Appendix

Probability estimation of schedule-based connections

Suppose the current physical connection requires a single transfer, say, at Station A. The

schedule-based connections on a physical connection can be represented by a time-ex-

panded network as in Fig. 11.

The consistency check is initiated by finding consistent trains at both O and D. By this

assumption, there can be at most two trains, say X1 and X2, at O, whose time intervals

contain the entry time, while at most one train, say Y , can be consistent with the exit time at

766 Transportation (2016) 43:749–769

123

D. If there are no such trains at either O or D, the passenger did not use the physical

connection.

If neither X1 and X2 can be connected to Y , in the sense that there is no relevant transfer

reference passenger, we conclude that the passenger did not use the physical connection.

If there is only one such train, say X1, whose connection to Y can be verified by transfer

reference passengers, then the schedule-based connection, X1 � Y is confirmed as the

unique connection of the passenger.

Finally, if there are two trains, say X1 and X2, from both of which we can find transfer

reference passengers to Y as in Fig. 11, we need to return both X1 � Y and X2 � Y . It is a

worst case in that the maximum number of schedule-based connections are confirmed as

consistent connections.

The estimation, however, can be refined by a probability distribution over the two

connections. In Fig. 11, we introduce some notations as follows:

• p: The fraction of the boarding reference passengers from the overlap of the two time

intervals that boarded train X1

• 1� p: The fraction of the boarding reference passengers from the overlap of the two

time intervals that boarded train not X1 but X2

• 1� q1: The fraction of the transfer reference passenger from X1 to Y

• q2: The fraction of the transfer reference passenger from X2 to Y

It is not then difficult to show that

Pr Passenger chose X1 � Yf g ¼ pð1� q1Þpð1� q1Þ þ ð1� pÞq2

;

Pr Passenger chose X2 � Yf g ¼ ð1� pÞq2pð1� q1Þ þ ð1� pÞq2

:

ð1Þ

Table 6 summarizes the numbers and list of consistent schedule-based connection(s),

the corresponding conditions, and the probability distributions. If none of the conditions

Fig. 11 Two schedule-based connections can be consistent

Transportation (2016) 43:749–769 767

123

from Table 6 is satisfied, no schedule-based connection can be consistent with the

quadruple of our passenger and hence the physical connection is rejected.

For a physical connection that requires two transfers, there may be up to 3 schedule-

based connections consistent with a quadruple if the trip is not abnormally delayed. The

previous arguments can be easily extended to such a case.

References

Asakura, Y., Iryo, T., Nakajima, Y., Kusakabe, T., Takagi, Y., Kashiwadani, M.: Behavioural analysis ofrailway passengers using smart card data. In: Proceedings of the Urban Transport, pp. 599–608. Malta(2008)

Bagchi, M., White, P.R.: The potential of public transport smart card data. Transp. Policy 12(5), 464–474(2005)

Bureau of Public Roads: Traffic Assignment Manual. U.S, Department of Commerce (1964)Cox, T., Houdmont, J., Griffiths, A.: Rail passenger crowding, stress, health and safety in Britain. Transp.

Res. Part A 40, 244–258 (2006)Cronbach, L.J.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)De Cea, J., Fernandez, J.E.: Transit assignment for congested public tranport system: an equilibrium model.

Transp. Sci. 27(2), 133–147 (1993)Einmahl, J.H.J., Smeets, S.G.W.R.: Ultimate 100 m world records through through extreme-value theory.

Stat. Neerl. 65(1), 32–42 (2011)Fu, Q., Liu, R., Hess, S.: A bayesian modelling framework for individual passenger’s probabilistic route

choices: a case study on the London underground. In: 93rd Transportation Research Board (TRB)Annual Meeting (2014)

Guo, Z., Wilson, N.: Transfer behavior and transfer planning in public transport systems: a case of theLondon underground. In: Proceedings of the 11th International Conference on Advanced Systems forPublic Transport, Hong Kong (2009)

Jang, W.: Travel time and transfer analysis using transit smart card data. Transp. Res. Rec. 2144, 142–149(2010)

Kato, H., Kaneko, Y., Inoue, M.: Comparative analysis of transit assignment: evidence from urban railwaysystem in the Tokyo metropolitan area. Transportation 37, 775–799 (2010)

Ko, S.-J., Kim, K.M., Hong, S.-P.: Estimation of transfer times and alighting times of the metro passengersin Seoul metropolitan area. Working paper

Kusakabe, T., Iryo, T., Asakura, Y.: Estimation method for railway passengers’ train choice behaviour withsmart card transaction data. Transportation 37, 731–749 (2010)

Lam, W.H.K., Lo, H.K.: Traffic assignment methods. In: Hensher, D.A., Button, K.J., Haynes, K.E.,Stopher, P.R. (eds.) Handbook of Transport Geography and Spatial Systems, pp. 609–625 (2004)

Lehtonen, M., Rosenberg, M., Rasanen, J., Sirkia, A.: Utilization of the smart card payment system (scps)data in public tranport planning and statistics. In: Proceedings of the 9th World Congress on IntelligentTransport Systems, Chicago, Illinois, 14–17 October 2002

Morency, C., Trepanier, M., Agard, B.: Measuring transit use variability with smart-card data. Transp.Policy 14(3), 193–203 (2007)

Table 6 Numbers and lists of consistent schedule-based connection(s), the corresponding conditions, andthe probability distributions for a single-transfer physical connection

No. Consistent connection(s) Conditions Probability

1 X1 � Y 0\p\1 q1\1 q2 ¼ 0 1

p ¼ 1 q1\1 –

X2 � Y p ¼ 0 – q2 [ 0 1

0\p\1 q1 ¼ 1 q2 [ 0

2 X1 � Y and X2 � Y 0\p\1 0\q1\1 q2 [ 0 pð1�q1Þpð1�q1Þþð1�pÞq2 and

ð1�pÞq2pð1�q1Þþð1�pÞq2

768 Transportation (2016) 43:749–769

123

Nielsen, O.A.: A stochastic transit assignment model considering differences in passengers utility functions.Transp. Res. Part B 34(5), 377–402 (2000)

Nour, A., Casello, J.M., Hellinga, B.: Anxiety-based formulation to estimate generalized cost of transittravel time. Transp. Res. Rec. 2143, 108–116 (2010)

Park, J.Y., Kim, D.-J., Lim, Y.: Use of smart card data to define public transit use in Seoul, South Korea.Transp. Res. Rec. 2063, 3–9 (2008)

Pelletier, M.-P., Trepanier, M., Morency, C.: Smart card data use in public transit: a literature review.Transp. Res. Part C 19, 557–568 (2011)

Raveau, S., Munoz, J.C., de Grange, L.: A topological route choice model for metro. Transp. Res. Part A 45,138–147 (2011)

Rinks, D.B.: Revenue allocation methods for integrated transit systems. Transp. Res. Part A 20(1), 39–50(1986)

Seaborn, C.: Application of smart card fare payment data to bus network planning in London. UK. MSthesis, Massachusetts Institute of Technology, Cambridge (2008)

Seaborn, C., Attanucci, J., Wilson, N.: Analyzing multimodal public transport journeys in London withsmart card fare payment data. Transp. Res. Rec. 2121, 55–62 (2009)

Shin, S.G., Cho, Y., Lee, C.: Integrated transit service evaluation methodologies using transportation carddata (In Korean). Technical Report 2007-R-09, Seoul Development Institute (2007)

Trepanier, M., Tranchant, N., Chapleau, R.: Individual trip destination estimation in a transit smart cardautomated fare collection system. J. Intell. Transp. Syst. 11(1), 1–14 (2007)

Tsamboulas, D.A., Antoniou, C.: Allocating revenues to public transit operators under an integrated faresystem. Transp. Res. Rec. 1986, 29–37 (2006)

Utsunomiya, M., Attanuchi, J., Wilson, N.H.: Potential uses of transit smart card registration and transactiondata to improve transit planning. Transp. Res. Rec. 1971, 119–126 (2006)

Weidmann, U., Orth, H., Dorbritz, R.: Development of measurement system for public transport perfor-mance. Transp. Res. Rec. 2274, 135–143 (2012)

Zhou, F., Xu, R.-H.: Model of passenger flow assignment for urban rail transit based on entry and exit timeconstraints. J. Transp. Res. Board 2284, 57–61 (2012)

Sung-Pil Hong is a professor at the Department of Industral Engineering, Seoul National University. One ofhis research interests is computing discrete choice equilibria via optimization. Since 2010 he has performedvarious studies of modeling and analyzing metro transits.

Yun-Hong Min is a research staff member at Samsung Advanced Institute for Technology (SAIT). Hereceived his Ph.D. degree on Industrial Engineering from Seoul National University in 2012, and he hasbeen working at SAIT since 2012. His main research concerns are equilibrium analysis, convexoptimization, and machine learning.

Myoung-Ju Park is an assistant professor at the Department of Industrial and Management SystemsEngineering, Kyung Hee University. He received his Ph.D. in Industrial Engineering from Seoul NationalUniversity in 2012. His current research is about combinatorial optimization, approximation algorithms, andscheduling.

Kyung Min Kim is a Ph.D. candidate at the Department of Industral Engineering, Seoul NationalUniversity and a senior researcher at the Korea Railroad Research Institute. His main research concerns arerailway planning, transit assignment, and travel behavioral analysis.

Suk Mun Oh is a principle researcher at the Korea Railroad Research Institute. He received Ph.D. inIndustrial Engineering from Korea University in 2010. He was involved in a number of studies on railwayoperation and policy since 1995.

Transportation (2016) 43:749–769 769

123

precise estimation of connections of metro passengers from …polytope.snu.ac.kr/papers/precise...

Documents