tr2003-100 august 2003tr2003-100 august 2003 abstract an automatic object tracking and video...

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

Multi-Camera Calibration, Object Trackingand Query Generation

Porikli, F.; Divakaran, A.

TR2003-100 August 2003

Abstract

An automatic object tracking and video summarization method for multi-camera systems with alarge number of non-overlapping field-of-view cameras is explained. In this framework, videosequences are stored for each object as opposed to storing a sequence for each camera. Object-based representation enables annotation of video segments, and extraction of content semanticsfor further analysis. We also present a novel solution to the inter-camera color calibration prob-lem. The transitive model function enables effective compensation for lighting changes andradiometric distortions for large-scale systems. After initial calibration, objects are tracked ateach camera by background subtraction and mean-shift analysis. The correspondence of objectsbetween different cameras is established by using a Bayesian Belief Network. This frameworkempowers the user to get a concise response to queries such as which locations did an object visiton Monday and what did it do there?

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in partwithout payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies includethe following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment ofthe authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, orrepublishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. Allrights reserved.

Copyright c©Mitsubishi Electric Research Laboratories, Inc., 2003201 Broadway, Cambridge, Massachusetts 02139

MERLCoverPageSide2

Publication History:

1. First printing, TR-2003-100, August 2003

MULTI-CAMERA CALIBRATION, OBJECT TRACKING AND QUERY GENERATION

Fatih Porikli, Ajay Divakaran

Mitsubishi Electric Research Labs

ABSTRACT

An automatic object tracking and video summarizationmethod for multi-camera systems with a large number ofnon-overlapping field-of-view cameras is explained. In thisframework, video sequences are stored for each object asopposed to storing a sequence for each camera. Object-based representation enables annotation of video segments,and extraction of content semantics for further analysis. Wealso present a novel solution to the inter-camera color cali-bration problem. The transitive model function enables ef-fective compensation for lighting changes and radiometricdistortions for large-scale systems. After initial calibration,objects are tracked at each camera by background subtrac-tion and mean-shift analysis. The correspondence of ob-jects between different cameras is established by using aBayesian Belief Network. This framework empowers theuser to get a concise response to queries such as “which lo-cations did an object visit on Monday and what did it dothere?”

1. INTRODUCTION

The nature of single-camera single-room architecture multi-camera surveillance applications demands automatic and ac-curate calibration, detection of object of interest, tracking,fusion of multiple modalities to solve inter-camera corre-spondence problem, easy access and retrieving video data,capability to make semantic query, and effective abstrac-tion of video content. Although several multi-camera se-tups have been adapted for 3D vision problems, the non-overlapping camera systems have not investigated thoroughly.In [2], a multi-camera system that uses a Bayesian networkto combine multiple modalities is proposed. Among thesemodalities, the epipolar, homography, and landmark infor-mation assume any pair of cameras in the system has anoverlapping field-of-view. Due to this assumption, it is notapplicable to the single-camera single-room architecture.

A major problem of multi-camera systems is the colorcalibration of cameras. In the past few years, many algo-rithms were developed to compensate for radiometric mis-matches. Most approaches use registered images of a uni-formly illuminated color chart of a known reflectance takenunder different exposure settings as a reference [1], and esti-

Fig. 1. A multi-camera setup can contain several camerasworking under different lighting conditions.

mate the parameters of a brightness transfer function. Often,they assume the function is smooth and polynomial. How-ever, uniform illumination conditions may not be possibleoutside of a controlled environment.

In this paper, we designed a framework where we canextract the object-wise semantics from a non-overlappingfield-of-view multi-camera system. This framework has fourmain components: camera calibration, automatic tracking,inter-camera data fusion, and query generation. We de-veloped an object-based video content labeling method torestructure the camera-oriented videos into object-orientedresults. We also propose a summarization technique usingthe motion activity characteristics of the encoded video seg-ments to provide a solution to the storage and presentationof the immense video data. To solve the calibration prob-lem, we developed a correlation matrix and dynamic pro-gramming based method. We use color histograms to deter-mine inter-camera radiometric mismatch. Then, a minimumcost path within the correlation matrix is found using dy-namic programming. This path is projected onto diagonalaxis to obtain a model function that can transfer one his-togram to other. In addition, we use a novel distance metricto determine the object correspondences.

2. RADIOMETRIC CALIBRATION

A typical indoor surveillance system consists of several non-overlapping view of cameras as illustrated in Fig. 1. Usu-ally, such systems consist of identical cameras that are oper-ating under various lighting conditions, or different camerasthat have dissimilar radiometric characteristics. Even iden-tical cameras, which have the same optical properties and

Fig. 2. After computing correlation matrix, a minimum costpath is found and transformed to a model function. Usingthe model function obtained in the calibration stage, the out-put of one camera is compensated.

are working under the same lighting conditions, may notmatch in their color responses. Images of the same objectacquired under these variants show color dissimilarities. Asa result, the correspondence, recognition, and other relatedcomputer vision tasks become more challenging.

We compute pair-wise inter-camera color model func-tions that transfer the color histogram response of one cam-era to the other as illustrated in Fig. 2 in the initial calibra-tion stage. First, we record images of the identical objectsfor each camera. For the images of an object for the cur-rent camera pair 1-2, I 1k ,I2k , we find color histograms h1k,h2k. A histogram, h, is a vector [h[0]; : : : ; h[N ]] in whicheach bin h[n] contains the number of pixels correspondingto the color range. Using the histograms h1k; h

2

k, we com-pute a correlation matrix M 1;2

k between two histograms asthe set of positive real numbers that represent the bin-wisehistogram distances

M1;2 =

24

c11 c12 : : : c1N: : :

cN1 : : : cNN

35 (1)

where each element cmn is a positive real number such thatcmn = d(h1[m]; h2[n]) and d(�) � 0 is a distance norm.Note that, the sum of the diagonal elements represents thebin-by-bin distance with given norm d(�). For example, bychoosing the distance norm as L2 the sum of the diagonalsbecomes the Euclidean distance of histograms.

An aggregated correlation matrix M is calculated byaveraging the corresponding matrices of K image pairs asM1;2 = 1=K

PK

k=1M1;2k . Given two histograms and their

correlation matrix, the question is what is the best align-ment of their shapes and how can the alignment be deter-mined? We reduce the comparison of two histograms tofinding the minimum cost path in the matrix M 1;2. Weused dynamic programming and modified Dijkstra’s algo-rithm to find the path p : f(m0; n0); ::; (mI ; nI)g that hasthe minimum cost from the c11 to cNN , i.e. the sum ofthe matrix elements on the path p gives the minimum scoreamong all possible routes. We define a cost function for

Fig. 3. Relation between the minimum cost path and f(j).

(camera� 1) (camera� 2) (compensated)

0 50 100 150 200 2500

0.005

0.01

0.015

0.02

0.025

0 50 100 150 200 250−100

−50

0

50

Fig. 4. Camera-2 image is compensated using model func.(Histogram black: camera-1, blue: camera-2, red: compen.)

the path as g(pi) = cmi;niwhere pi denotes the path el-

ement (mi; ni). We define a mapping i ! j from thepath indices to the projection onto the diagonal of the ma-trix M , and an associated transfer function f(j) that givesthe distance from the diagonal with respect to the projec-tion j. The transfer function is defined as a mapping fromthe matrix indices to real numbers (mi; ni) �! f(j) suchthat f(j) = jpij sin � as illustrated in Fig. 3. The corre-lation distance is the total cost along the transfer functiondCD(h1; h2) =

PI

i=0 jg(mi; ni)j.

3. SINGLE-CAMERA TRACKING

After initial calibration, objects are detected and tracked ateach camera (Fig. 5). A common approach for detecting amoving object for a stationary camera setup is backgroundsubtraction. The main idea is to subtract the current imagefrom a reference image that is constructed from the staticimage pixels during a period of time. We previously de-veloped an object detection and tracking method that inte-grates a model-based background subtraction with a mean-shift based forward tracking mechanism [3]. Our methodconstructs a reference image using pixel-wise mixture mod-els, finds changed part of image by background subtraction,removes shadows by analyzing color and spatial propertiesof pixels, determines objects, and tracks them in the consec-

Fig. 5. Single-camera tracking.

utive frames. In background subtraction, we model the his-tory of a color channel each pixel by a mixture of Gaussiandistributions. The reference image is updated by comparingthe current pixel with the existing distributions. In case thecurrent pixel’s color value is within a certain distance of themean value of a distribution, the mean value is assigned asthe background. After background subtraction, we detectand remove shadow pixels from the set of the foregroundpixels. The likelihood of being a shadow pixel is evalu-ated iteratively by observing the color space attributes andlocal spatial properties. The next task is to find the sepa-rate objects. To accomplish this, we first remove specklenoise, then determine connected regions, and group regionsinto separate objects. We track objects by computing thehighest gradient direction of color histogram, which is im-plemented as a maximization process. This process is iter-ated by given the current object histogram extracted fromthe previous frame.

The tracking results at each camera should be merged todetermine the global behaviour of objects (6).

4. INTER-CAMERA CORRESPONDENCE

Another main issue is the integration of the tracking resultsof each camera to make inter-camera tracking possible. Tofind the corresponding objects in different cameras and ina central database of the previous appearances, we evaluatethe likelihood of possible object matches by fusing the ob-ject features such as color, shape, texture, movement, cam-era layout, ground plane, etc. Color is the most commonfeature that is widely accepted by object recognition sys-tems since it is relatively robust towards the size and ori-entation changes. Since adequate level of detail is usuallynot available in a surveillance video, texture and face fea-tures does not improve recognition accuracy. Another con-cern is the face orientation. Most face-based methods workonly for frontal images. The biggest drawback of shape fea-tures is the sensitivity to the boundary inaccuracies. Using aheight descriptor will only help if we have the ground plane.

Fig. 6. Multi-camera surveillance system.

Fig. 7. Each camera corresponds to a node in the directedgraph. The links show physical routes between cameras.The probability of object movement is represented using thetransition table.

There is a strong correlation between camera layout andlikelihood of the objects appearing in a certain camera afterthey exit from another one. As in Fig.7, we formulate thecamera system as a Bayesian Belief Network (BBN), whichis a graphical representation of a joint probability distribu-tion over a set of random variables. A BBN is a directedgraph in which each set of random variable is represented bya node, and directed edges between nodes represent condi-tional dependencies. In the multi-camera system, each cam-era corresponds to a node in the directed graph. The linksshow the physical routes between the cameras. The proba-bility of object movement is the elements of the transitiontable. The transition probabilities, that is the likelihood of aperson moving from the current camera to a linked camera,are learned by observing the system. Note that, each direc-tion on a link may have different probability, and the totalincoming and outgoing probability values are equal to one.To satisfy the second constrain, some slack nodes that corre-spond to the unmonitored entrance/exit regions are added tothe graph. Initially, there is no object assigned to any nodeof the BBN, the number of objects in the cameras and ob-jects in the database are equal to zero. The database keepstrack of the individual objects. Let an object O i is detectedat cameraCi. For each detected new object, a database entryis made using its color histogram features. If the object O1

exits from the camera Ci, then the conditional probabilityPO1

(Cj jCi) of the same object will be seen on another cam-

era Cj is found by PO1(Cj jCi) = P (Cj jCs1)::P (Csk jCi)

where fs1; ::; skg is the highest probability path from C i

to Cj on the graph. Because of the dynamic nature of thesurveillance system the conditional probabilities change withtime, PO1

(Cj ; tjCi) = P (Cj ; tjCs1)::P (Csk ; tjCi). Theconditional probabilities are set to erode by time using a pa-rameter � < 1 as P (O1; tjCi) = �P (O1; t � 1jCi) sincethe object may exit from the system completely (We do notthink a multi-camera system should be closed graph). How-ever, the probabilities do not become less than a threshold.

As a new object is detected, it is compared with the ob-jects in the database and with the objects that disappearedpreviously. The comparison is based on the color histogramsimilarity and the proposed distance metric. For more thanone object correspondence, we select the match as the pair(Om; On) that maximizes P (Om; tjCi)P (On; tjCj). Weevaluate the matching for all objects simultaneously insteadof matching independently to match objects between twocameras Ci and Cj ,

5. OBJECT BASED QUERIES

After matching the objects between the cameras, we labeleach video frame according to the object appearances. Thisenables us to include content semantics in the subsequentprocesses. A semantic scene is defined as a collection ofshots that are consistent with respect to a certain semantic,in our case, the identities of objects. Since we know cameralocations and we extracted which objects appeared in whichvideo at what time, we can, for instance, query for what anobject did in a certain time period at a certain location. Toobtain concise representation of query results, we generatean abstract of the query result. The key-frame-based sum-mary is a collection of frames that aims to capture all of thevisual essence of the video except, of course, the motion. In[4], we showed that the frame at which the cumulative mo-tion activity is half the maximum value is also the halfwaypoint for the cumulative increment in information. Thus, weselect the key frames according to the cumulative function.

6. DISCUSSION

A base station controls the fusion of the multi camera in-formation, and its computational load is negligible. We pre-sented sample results of the single-camera tracking modulein Fig. 8-a, and the object-based inquiry interface in Fig.8-b. The tracking method can follow several objects at thesame time even some of the objects are totally occluded fora long time. Furthermore, it provides accurate object bound-aries. We initialized the Bayesian Network with identicalconditional probabilities. These probabilities may also beadapted by observing the long-term motion behaviour of theobjects. Sample query results are shown in Fig. 8-c. We are

(a) (b)

(c)

Fig. 8. (a) Inquiry system, (b) extracted instances of an ob-ject in all three cameras.

able to extract all appearances of the same person in a spec-ified time period accurately. We can also count the numberof different people.

We presented a novel color calibration method. Unlikethe existing approaches, our method does not require uni-formly illuminated color charts or controlled exposure im-age sets. Furthermore, our method can model non-linear,non-parametric distortions, and the transitive property makesthe calibration of large-scale systems much simpler. Theobject-based representation enables us to associate contentsemantics, so we can generate query based summaries. Thisis an important functionality to retrieve the target video seg-ments from a large database of surveillance video.

7. REFERENCES

[1] M. Grossberg and S. K. Nayar, “What can be known aboutradiometric resp. func. using images”, Proc. of ECCV, 2002.

[2] T. Chang and S. Gong. “Bayesian modality fusion for trackingmultiple people with a multi-camera system”. AVSS, 2001

[3] F. Porikli, O. Tuzel. “Human body tracking and movementanalysis”, Proc. ICVS-PETS, Austria, 2003.

[4] A. Divakaran “Video summarization using descriptors of mo-tion activity” Electronic Imaging, 2001.

tr2003-100 august 2003tr2003-100 august 2003 abstract an automatic object tracking and video...

Documents