stauffer03_mcpt

Automated multi-camera planar tracking correspondence modeling

Chris Stauffer and Kinh TieuArtificial Intelligence Laboratory

Massachusetts Institute of TechnologyCambridge, MA, 02139

Abstract

This paper introduces a method for robustly estimating a planartracking correspondence model (TCM) for a large camera net-work directly from tracking data and for employing said modelto reliably track objects through multiple cameras. By exploitingthe unique characteristics of tracking data, our method can re-liably estimate a planar TCM in large environments covered bymany cameras. It is robust to scenes with multiple simultaneouslymoving objects and limited visual overlap between the cameras.Our method introduces the capability of automatic calibration oflarge camera networks in which the topology of camera overlapis unknown and in which all cameras do not necessarily overlap.Quantitative results are shown for a five camera network in whichthe topology is not specified.

1. IntroductionIf one was given a bank of one hundred video screensshowing video that redundantly covered different sectionsof multiple concourses at a large airport, one would requireexperience or prior knowledge of the cameras and site totrack all objects through all the cameras effectively. Forinstance, Figure 1 shows a five camera network in whichfour cameras overlap and one is independent. This modelwas automatically determined by the method presented inthe paper. By understanding how the cameras are related,one could effectively track objects through the entire scene.Figure 2 shows four different types of overlapping relation-ships a network of cameras could have.

The general problem of automated tracking correspon-dence is to estimate which tracking data resulted from ob-servation of the same objects. An ideal tracking corre-spondence model (TCM) would result in only as manytracking sequences as there were independently movingobjects in an environment, regardless of the number ofsensors or their overlap. For a network of cameras, itwould model frame-to-frame correspondence, correspon-dence in overlapping camera views, correspondence acrossnon-overlapping camera views, and even establish identitywhen an object returns to the environment after a long pe-riod of time has elapsed.

This paper assumes a reliable frame-to-frame tracker ex-

Figure 1: Five cameras and an automatically generated scenemodel. The goal of automated planar tracking correspondencemodeling is to build the correspondence model for the entire set ofcameras and use that model to track objects throughout the scene.

Figure 2: Four different cases overlap in five cameras: (a) noregion of overlap between any cameras, (b) every camera over-lapping with every other camera, (c) a set of cameras overlappingwith one or two cameras, and (d) the most general case with allprevious types of overlap.

1

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Authorized licensed use limited to: UNIVERSITY OF WINDSOR. Downloaded on July 9, 2009 at 17:40 from IEEE Xplore. Restrictions apply.

ists for each camera and centers on the subproblem of au-tomated correspondence modeling for overlapping cameranetworks where objects in each region of overlap lie on ornear a ground plane. Planar TCMs have proven their effec-tiveness in enabling tracking in multi-camera visual surveil-lance networks, but they still require manual supervision tocalibrate.

The goal of this paper is to introduce the first fully au-tomatic method for planar TCM estimation in large cameranetworks. Specifically, it doesn’t require the scene to be va-cated by all but a single moving object, it doesn’t requiremanual specification of invariant properties of objects ina particular environment to bootstrap the correspondence,it doesn’t require specification of which pairs of camerashave valid correspondence models, and it doesn’t require as-sumptions about visibility of objects in the region of cameraoverlap. Our system can automatically determine the topol-ogy of camera overlap, determine pair-wise camera corre-spondence models, estimate visibility constraints for pairsof cameras, and track objects through extended sites. Itestimates the TCM directly from the tracking data and isthen capable of determining object correspondences. Fur-ther, this method could enable continuous validation of themodel in the event that the camera positions or the scene arealtered.

Section 2 discusses previous work. Section 3 formulatesthe general tracking correspondence problem. Section 4 de-scribes the planar TCM used in this work. It contains sub-sections that describe how to determine the likelihood thattwo sequences correspond, outline our method for estimat-ing the TCM, and outline our method for validating the cor-respondence model. Section 5 describes calibrating multi-camera networks and tracking in those networks. It showsanalysis of our results in a five camera network (twentycamera pairs) over hundreds of tracking sequences. The fi-nal section discusses how this work could be integrated inmulti-camera tracking systems and draws conclusions aboutthis work.

2. Previous Work

Many early approaches to tracking correspondence have re-lied on manually calibrated camera networks and complexsite models. For instance, Collins et al. [3] coordinated mul-tiple PTZ and static cameras to track many objects over awide area. This system relied on a 3-d model of the envi-ronment and fully calibrated cameras.

With the decreased barrier to setting up cheap multi-camera tracking systems, automated approaches to estimat-ing a correspondence model to enable tracking in extendedenvironments have become increasing interesting to the vi-sion community. Khan et al. [5] automatically estimated theline where the feet of pedestrians should appear in a sec-

ondary camera when they exit the reference camera. Thisassumes that cameras are vertical and pedestrians will leavea scene at a linear occlusion boundary. It also does not ex-ploit tracking data when the object is visible in both camerasto establish correspondence.

Stein et al. [6] did early work in estimating a perspectiveplane correspondence model of tracking data as well as thetemporal alignment. Their system used a robust RANSAC[4] variant that estimates the best homography between twocameras that were known to overlap significantly. Their au-tomated approach has been used and cited by many sys-tems that employ correspondence modeling to do trackingin extended scenes, e.g. Black and Ellis [1]. They ex-tended their approach to calibrate cameras when a tripletof overlapping cameras is available [6]. Recently, Caspiet al. [2] extended Stein’s approach to include a non-planarcorrespondence model and to take advantage of the trackingsequences rather than just co-occurring object detections re-sulting in more reliable estimation.

This paper outlines an improved method for estimatingthe planar correspondence model given time-aligned track-ing data that takes full advantage of the unique characteris-tics of tracking correspondences in video streams. This pa-per also introduces a mechanism for validating the model todetermine if the best correspondence model is valid. This isrequired to enable automated multi-camera TCM estimationwithout manually specifying the topological arrangement ofthe cameras (as seen in Figure 2). Finally, the primary goalof this paper is establishing tracking correspondence ratherthan building pleasing site models or reconstructing camerageometry. As a result, this is the first paper to include quan-titative analysis of the effectiveness of estimation and em-ployment of the tracking correspondence model over hun-dreds of tracking sequences.

3. General Tracking Correspondence

This section outlines the general tracking correspondenceproblem. It defines a tracking correspondence model(TCM) and explains how it is employed to estimate cor-respondence. The next section describes the multi-cameraplanar TCM.

3.1. Formulation

The ultimate goal of tracking correspondence modeling isto estimate which tracking observations resulted from thesame objects. The corresponding tracking observations canthen be merged across multiple sensors giving a more com-plete description of object activities in an environment.

An ideal tracking correspondence model (TCM) wouldresult in only as many tracking sequences as there are in-dependently moving objects in an environment regardless

2



of the number of sensors or their overlap. For a net-work of cameras, it would model frame-to-frame correspon-dence, correspondence in overlapping cameras, correspon-dence between non-overlapping cameras, and even corre-spondence for an objects returning to the environment aftera long period of time has elapsed.

The result of successful frame-to-frame tracking is a setof tracking sequences in each individual camera. Each se-quence Si is comprised of Ni discrete tracking observa-tions, {si(t0), ..., si(tNi

)}, indexed by the absolute timesthey occurred. Each observation includes a description ofthe object from a particular sensor at a particular time.

si(t) = (xi(t), yi(t), dxi(t), dyi(t), si(t), imagei(t), ...).(1)

The goal of tracking correspondence is to estimate the idealcorrespondence matrix Γ∗ for all sequences. Each elementof this matrix is

γ∗ij =

{1 if Si and Sj correspond to the same object0 otherwise.

(2)

3.2 Direct Correspondence

Unfortunately, it is unclear how to estimating the ideal cor-respondence matrix directly. An alternate form of the idealcorrespondence matrix is the direct correspondence matrix,denoted Γ in this work. In the direct correspondence ma-trix, γij = 1 iff Si and Sj correspond and Sj is the firstcorresponding observation to follow Si. In the case of over-lapping cameras, we can effectively model the likelihood ofdirect correspondence. This enables us to estimate Γ. Thetrue estimate of Γ, can be used to determine Γ∗.

A tracking correspondence model τ refers to the modeland parameter values used to estimate correspondence be-tween to tracking sequences. Our TCM defines the proba-bility that two sequences are in direct correspondence, or

p(γij = 1|S, τ). (3)

where S is the set of all sequences.To determine if two sequences are the same object in the

scene (γij = 1), we need to maximize the likelihood of

p(Γ|S, τ) =∏

Si,Sj∈S

p(γij |Si, Sj , τ)γij , (4)

where∑

i γij and∑

j γij are zero or one. This type of as-signment problem has been covered in other work (as in[7]) and will not be discussed further here. In fact, for themulti-camera direct correspondence modeling discussed inthis paper, a simple likelihood threshold is often sufficient

because an object in a mutually observable region is usu-ally in good correspondence to itself and in poor correspon-dence to any other object (because two objects cannot oc-cupy the same space). Also, objects that are coincidentallyco-located, often do not tend to stay that way.

This paper investigates a particular subproblem of track-ing correspondence—automated planar correspondencemodeling for overlapping camera networks.

We begin by considering a pair of cameras a and b.In a given interval, the cameras track ka and kb objectsrespectively, resulting in two sets of tracking sequences,{Sa1 , ..., Saka } and {Sb1 , ..., Sb

kb}.

Since this work concerns only overlapping camera corre-spondence, we are only considering direct correspondence.If two sequences in two cameras, Sai

and Sbj, are observa-

tions of the same object in the scene over the same intervalof time, they are in direct correspondence. The goal of thiswork is to estimate the kaxkb direct correspondence matrix,Γ. The estimate of this matrix could trivially be combinedwith matrices from other correspondence measures to ap-proximate Γ∗.

In the case of this paper, our site TCM (τ ) is comprisedof a separate TCM for each camera pair,

τab = {Vab,Hab, Rab}, (5)

where Vab is a binary variable denoting whether a validmodel exists between camera a and camera b, Hab is thehomography relating observations in camera a to camera b,and Rab is the region in which valid correspondences areexpected in camera a.

3.3. The structure of tracking dataOne of the most prolific areas of computer vision literaturehas historically been camera calibration and 3D reconstruc-tion. The vast majority of this literature uses static corre-spondence points in static scenes to estimate the model, e.g.,image mosaics from corners or stereo. The primary goalis usually calibration or reconstruction. Without constrain-ing this problem, there is little hope of inferring the correctfeature correspondence from two images with a very widebaseline and no other information available, even for a sim-ple parametric model (e.g., planar homography or mosaic).It would be even more difficult to validate whether the re-sulting model was the same scene or pictures of a similarscene in two completely different locations.

With tracking correspondences this is not as difficult.In contrast to static correspondences, tracking correspon-dences• are very descriptive- In addition to their position,

tracked objects can be described by their size, velocity,direction of travel, appearance, color, articulated mo-tion, etc. These characteristics can aid in establishingcorrespondence.

3



• are spatially sparse- Usually tracked objects do notdensely cover the scene at any point in time. Thismeans that the number of possible corresponding pairsis limited.

• increase linearly with sampling time- By sampling ascene for twice as long, one could collect roughlytwice the number of corresponding tracking se-quences. These correspondence points will often bemuch more dense than pixels at some points in thescene. This also allows one to neglect potentially badcorrespondence in favor of good ones while still find-ing a full descriptive model.

• are temporally continuous- The position of a trackedobject can be interpolated from adjacent frames, so thetracking doesn’t have to be synchronized on a frame-by-frame basis.

All these properties make the solution of this problem sig-nificantly more reliable than trying to mosaic two arbitraryimages together in scenes where tracking data is available.And, obviously, a TCM is only required when tracking datais available.

4. Planar TCM with visibility con-straints

Often in far-field tracking, most objects lie on or near a sin-gle ground plane. This is due to two regularities: in theregion of overlap of two cameras the surface on which ob-jects travel is usually fairly flat; and most objects of interestto surveillance systems are constrained by gravity. Thus,this paper assumes objects’ centroids are roughly planar inthe regions of overlap, although other models could also beused (see [2]). Under this assumption, the position of anobject that is visible in both camera a camera b is related bya 3x3 homography, Hab.

Habpa(t) ∼= p̂b(t) (6)

where pa(t) = (x, y, 1) is the location of the object in cam-era a in homogeneous coordinates at time t, p̂b(t) is theinterpolated position of the tracked object in camera b at thesame time, Hab is a homography, and ∼= denotes equalityup to a scale factor.

As shown in Equation 5, if a pair of cameras has a validmodel, the TCM will include valid estimates of a planar cor-respondence model and a visual occlusion model. A visualocclusion model, Rab, is an estimate of the region of mu-tual observability for a pair of cameras. If two cameras donot overlap, this region is empty. Otherwise, this region de-notes the region in camera a in which objects should alsobe visible in camera b.

We currently consider only object position for matches,but any property in one camera that has mutual information

with a property in another camera could be used. In princi-ple, a TCM could model any property that can be predictedin camera a based on a measurement in camera b includingappearance, color, size, velocity, or dynamic properties.

Because one does not generally entertain many compet-ing matching hypotheses, optimal assignment is not gen-erally necessary. Thus, given a valid τab and a sequencefrom each camera Sa and Sb, two sequences correspond ifthe mean squared error of the predicted sequence is greaterthan a threshold1. If a normalization model was available, aper-observation variance estimate could be used rather thana single variance estimate.

4.1. Estimating Hab

Given the correct values of γij and no outlier observations,it is trivial to estimate

Hab = argminH

∑{i,j}∀γij=1

∑t∈ms(i,j)

|si(t) − Hsj(t)|2 (7)

where ms(i, j) is the time interval of “mutual support” inwhich sequence Si and sequence Sj were both visible intheir respective cameras. Unfortunately, even a few falsecorrespondences or outlier observations will likely result ina poor estimate of Hab.

Stein[6] used a sampling procedure that took k randomsamples of four co-occurring observation pairs in two cam-eras. For sufficiently large values of k, at least one of thesets would result in a reasonable estimate of the true homog-raphy. The Hab that minimized the maximum error of then% percent best matches (e.g., n=50% would correspond tothe median error) was used to filter the best matches and torefit the model. Unfortunately, even hundreds of thousandsof iterations are not likely to produce a good homographyin many cases.

Caspi et al. [2] used pairs of tracking sequences sam-pled from an initial correspondence function estimate, Γ̂.The procedure for computing the initial estimate of γij isunclear as is the importance of having a good estimate. Inthe four cases shown in their paper, there were very limitednumbers of objects travelling reasonably articulated paths,and one case involved manually specifying that projectedsize should be used for correspondence.

Unfortunately, in most surveillance applications, all theobjects will primarily move in straight lines within the lim-ited regions of overlap. Thus a single pair of tracks (onefrom each camera) will result in degenerate homographyestimation. By finding any two pairs of non-collinear corre-sponding tracks, one is guaranteed to get a non-degeneratesolution. Thus, by always using two pairs of tracks and sam-pling sufficiently, one is guarantee to get a non-degenerate

1This corresponds to a hypothesis test on the likelihood ratio of thetwo observations occurring given that they correspond vs. them occurringgiven that they don’t correspond.

4



homography estimation except in the case that ALL objectsthat pass through the area of intersection are on the SAMEline. In this case, a degenerate solution (one that only alignsthe data on that line) is still sufficient for tracking corre-spondence in that scene.

We define the set of all temporally co-occurring se-quences as

Cab = {{Sai, Sbj

}∀{i, j}s.t.ms(i, j) > 0}. (8)

Unfortunately, for scenes with small regions of overlap,large numbers of objects, or significant occlusions, this setis very large. Thus we implement a method for samplingfrom Cab non-uniformly. To estimate a homography, wesample two {i, j}-pairs from Cab with a likelihood Lij anduse them as correspondence point sets. Lij is a heuristicon the likelihood that this pair of tracks correspond to thesame object. Our heuristic likelihood included the probabil-ity of this particular pair corresponding given: the numberof other objects that occurred at the same time2; the proba-bility of a corresponding track lasting a particular time inter-val; and the probability of matching the observed locationsof Si to Sj directly.

Thus, one is more likely to sample pairs of sequencesthat occurred when few other objects were present, thatwere sufficiently long, and that could match each other rea-sonably well. Other terms that might increase the likelihoodof sampling an informative (and valid) pair might be: theprobability that the pair isn’t a trivial match (two straightlines); the probability of a color match, the probability ofvelocity matches, the probability of size match, etc.

The Hab estimate with the maximum number of corre-sponding tracking sequences is chosen as the model for thiscamera pair. Figure 3 shows images from two cameras andthe resulting homography.

4.2. Estimating Rab

Once Hab is estimated and a reasonable set of likely cor-responding tracks are available, it is possible to model theregion of mutual observability of objects in camera b withincamera a, Rab. Rba can be computed by projecting that re-gion in camera b. If the TCM is valid, Rab must not beempty. Also, any sequence Sai that passes through Rab

should have a corresponding sequence in camera b, and anysequence Sbj that passes through Rba should have a corre-sponding sequence in camera a3.

In unobstructed views, Rab is simply the intersection ofthe bounding rectangle of camera a with the bounding rect-angle of camera b transformed into the coordinates of cam-era a. Some examples of visual occlusions that result in

2This is approximated as the ratio of the number of possible valid pairsto the total number of pairs during that time.

3Using a symmetric measure is an important consideration, becausevery dense tracking data can match any scene.

(a) (b)

(c) (d)

Figure 3: The merged view of cameras (a) and (b) generatedfrom the estimated TCMab is shown in (c). Note that the trackingcorrespondence model corresponds to the plane of the centroids ofthe tracked objects and does NOT correspond to the ground plane.Thus, perfect image alignment is not expected (or desirable). Theregion of mutual observability of objects in (a) and (b), Rab, ishighlighted in (d).

Rs that are smaller or have “holes” are: buildings; fences;trucks; trees; window/lens dirt; or window frames.

We estimate the probability that a point (x, y) is in theregion of mutual observability as

p(Rab(x, y) = 1) =

∑si(t)∈Qx,y

δi(t)∑si(t)∈Qx,y

1(9)

where Qx,y is the set of tracking data in either camera pro-jected to camera a coordinates within a distance dt of (x, y)and δi(t) = 1 if there exists a corresponding sequence inthe other camera and t is within the period of mutual ob-servability. p(Rab(x, y) = 1) is defined as zero if no datais present in that region of the image. p(Rab(x, y) = 1) issimply an estimate of the probability of getting a positivematch at position (x, y). Figure 3(d) shows an example fortwo cameras.

4.3. Validating τab

Assuming a particular frame-to-frame tracking success rate,each track in R of either camera should have a correspond-ing track ρ of the time. In practice, a rather conservativeestimate of ρ may be used.

Thus, the probability of having a valid TCM can be es-timated as the probability that drawing a sample from theregion of correspondence (in either camera) will result in amatch. Thus the condition for a valid TCM is

Vab ={

1 if E[δi] > ρ0 otherwise.

(10)

5



cam 1 cam 2 cam 3 cam 4 cam 5

cam 1 - 15.7% 7.6% 1.2% 62.4%cam 2 2.2% - 62.8% 8.5% 74.4%cam 3 10.8% 82.8% - 3.7% 1.1%cam 4 5.6% 13.8% 0.0% - 3.7%cam 5 63.1% 72.8% 0.8% 0.6% -

Table 1: The expectation of a match within Rab for our fivecamera experiment. The bold entries refer to cameras with actualoverlapping fields of view.

where the expectation is over positions drawn from Rab. Ifno model for Rab was available, tracks that were missed be-cause they were behind visual occlusions (e.g., buildings)would greatly reduce this value. Since, any two randomscenes will result in a correspondence model and corre-sponding tracks, it is important to use R to factor out thosetypes of missed correspondences.

5. Calibrating multi-camera networksThus far, we have only talked about pairs of cameras. Thissection explains automatic TCM estimation and trackingcorrespondence in a multi-camera network. The estimationprocedure is

• Estimate best τij for all pairs of cameras {ci, cj}.

– Estimate Hij .

– Estimate Rij .

– Estimate Vij (validate camera pairs).

• Build camera graph with ci’s as nodes and Vij’s asedges.

• (optional) Build visualization for each clique of cam-eras.

At any point in time, any tracking sequence in camera iwhich passes into Rij for a valid τij can be tested for corre-spondence. If they correspond, the tracking IDs can be set tothe same value. Because this system is automated, it couldbe used to continuously validate the correspondence modeland detect if the scene or camera positions have changed.

5.1. A five camera exampleGiven one hour of data from the five cameras shown in thetop row of Figure 4, our system learned a site model. Figure4 shows the second scene warped into the reference scenewith the estimated Hab for all camera pairs and Figure 5shows the resulting Rab estimate highlighted for all cam-era pairs. Table 1 shows the expectation of a match withinthe region of correspondence for all camera pairs. Notethat models with no overlap tended to result in somewhat

random homographies, very small Rs, and low validationscores.

The topology is evident. One clique of cameras is {c1 −c5 − c2 − c3} and the other clique is the remaining camerac4. They can be used to build a mosaic of the environmentautomatically. This is shown in Figure 1. Again, I mustcaution the reader not to evaluate the results based on howwell the ground planes are aligned, because the goal is toalign tracking centroids. If the ground planes were alignedin this figure, it would be a poor TCM estimate.

Our results would be more robust if values were nor-malized prior to estimating our model. Objects at differentdepths exhibit different noise characteristics. Our estima-tion would also be more robust if we augmented our match-ing criterion to include appearance, color, dynamic prop-erties. By using the most unusual objects in an extremelybusy environment the search for valid correspondence pairscould be greatly reduced.

6. ConclusionsWe have described a method for automatically estimatinga planar tracking correspondence model (TCM) for largecamera networks with overlapping fields of view given alarge set of tracking data from the observed area. In additionto recovering the homography between all pairs of cameras,the TCM also includes the corresponding regions of mutualobservability. Furthermore, we use these regions to validatetrue overlapping fields of views as opposed to cameras withdisjoint fields of view. The TCM can be used to better trackobjects between all the camera views.

References[1] James Black and Tim Ellis. Multi camera image tracking. In Perfor-

mance Evaluation in Tracking Systems (PETS2000), December 2001.

[2] Y. Caspi, D. Simakov, and M. Irani. Feature-based sequence-to-sequence matching. In Vision and Modelling of Dynamic Scenes Work-shop, June 2002.

[3] Robert Collins, Alan Lipton, and T. Kanade. A system for videosurveillance and monitoring. In Proc. American Nuclear Society(ANS) Eighth International Topical Meeting on Robotic and Rem oteSystems, pages 25–29, Pittsburgh, PA, April 1999.

[4] M. A. Fischler and R. C. Bolles. Random sample consensus: aparadigm for model fitting with applications to image analysis andautomated cartography. Communication Association and ComputingMachine, 24(6):381–395, 1981.

[5] Omar Javed, Sohaib Khan, Zeeshan Rasheed, and Mubarak Shah.Camera handoff: Tracking in multiple uncalibrated stationary cam-eras. In IEEE Workshop on Human Motion(HUMO-2000), Austin,TX, Dec 2000.

[6] Lily Lee, Raquel Romano, and G. Stein. Monitoring activities frommultiple video streams: Establishing a common coordinate frame.PAMI, 1999. separate paper in this issue.

[7] Hanna Pasula, Stuart J. Russell, Michael Ostland, and Yaacov Ritov.Tracking many objects with many sensors. In IJCAI99, pages 1160–1171, 1999.

6




cam 1

cam 2

cam 3

cam 4

cam 5

Figure 4: Estimated homographies for all pairs of cameras. Note that in general, some homography will be found even for non-overlapping fields of view. The region of mutual observability measure is used to validate true homographies, indicated by the boxedimages.

7




cam 1

cam 2

cam 3

cam 4

cam 5

Figure 5: The estimated regions of correspondence, Rab, for all pairs of cameras are highlighted. These regions signify areas of mutualobservability between pairs of camera field of views. They are used to validate true homographies, shown here with large regions ofcorrespondence.

8



stauffer03_mcpt

Documents