tracking a person with 3d motion by integrating optical flow and depth

10
Tracking a Person with 3D Motion by Integrating Optical Flow and Depth Ryuzo Okada, 1 Yoshiaki Shirai, 2 Jun Miura, 2 and Yoshinori Kuno 2 1 Multimedia Laboratory, Corporate Research & Development Center, Toshiba Corp., Kawasaki, 212-8582 Japan 2 Department of Computer Controlled Mechanical Systems, Osaka University, Suita, 565-0871 Japan SUMMARY In this paper, a method of tracking a person in three- dimensional motion by integration of optical flow and depth is described. Using a human model, the objective is estimation of the state (position, posture, and motion) in a scene at each frame. Based on the distribution of the esti- mated state, the optical flow and the depth information are used to calculate the probability of each pixel belonging to the target object, from which the object region is estimated. The optimum state is estimated by integration of the shape, the position, the flow vectors, and the depth information with respect to the extracted target region by a Kalman filter. Because of the nonlinear relation between the state and the shape of the target object on the image, there are cases where the state cannot be determined uniquely. In such situations, the possible candidates are all generated and are tracked one by one. Based on the difference between the tracking results and the observations, the erroneous candidates are eliminated. The effectiveness of the method is shown by its application to real image examples. ' 2001 Scripta Technica, Syst Comp Jpn, 32(7): 2938, 2001 Key words: Three-dimensional tracking; sensor fusion; segmentation; optical flow; disparity. 1. Introduction Visual object tracking is necessary in various tech- niques such as self-guided moving cars, human interfaces, and motion analysis. In various applications of object track- ing, robust methods for general scenes have been desired. Several object tracking methods have been proposed. Typical methods make use of the difference between frames [1], the difference from the background, the correlation between brightness distributions over a small region of the image [2], the use of depth information by stereo vision [3], information on velocity [4, 5], and the like. Since these methods are based only on one type of information, tracking will fail once the target becomes indistinguishable from others. Tracking methods using multiple information that have been proposed include approaches based on velocity and color information [6], velocity and edges [7], velocity and depth [8], and so on. But in these methods, if the target becomes indistinguishable from the other object based on either type of information, it becomes impossible to cor- rectly extract the region of the target object and continue the tracking. The authors therefore proposed a method based on integration of the optical flow and depth, in which the object region could be extracted even if one of the types of infor- mation on the target object becomes indistinguishable, so that tracking can be successfully continued [9]. But this ' 2001 Scripta Technica Systems and Computers in Japan, Vol. 32, No. 7, 2001 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-II, No. 8, August 1999, pp. 12521261 29

Upload: ryuzo-okada

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tracking a person with 3D motion by integrating optical flow and depth

Tracking a Person with 3D Motion by Integrating Optical Flow

and Depth

Ryuzo Okada,1 Yoshiaki Shirai,

2 Jun Miura,

2 and Yoshinori Kuno

2

1Multimedia Laboratory, Corporate Research & Development Center, Toshiba Corp., Kawasaki, 212-8582 Japan

2Department of Computer Controlled Mechanical Systems, Osaka University, Suita, 565-0871 Japan

SUMMARY

In this paper, a method of tracking a person in three-

dimensional motion by integration of optical flow and

depth is described. Using a human model, the objective is

estimation of the state (position, posture, and motion) in a

scene at each frame. Based on the distribution of the esti-

mated state, the optical flow and the depth information are

used to calculate the probability of each pixel belonging to

the target object, from which the object region is estimated.

The optimum state is estimated by integration of the shape,

the position, the flow vectors, and the depth information

with respect to the extracted target region by a Kalman

filter. Because of the nonlinear relation between the state

and the shape of the target object on the image, there are

cases where the state cannot be determined uniquely. In

such situations, the possible candidates are all generated

and are tracked one by one. Based on the difference between

the tracking results and the observations, the erroneous

candidates are eliminated. The effectiveness of the method

is shown by its application to real image examples. © 2001

Scripta Technica, Syst Comp Jpn, 32(7): 29�38, 2001

Key words: Three-dimensional tracking; sensor

fusion; segmentation; optical flow; disparity.

1. Introduction

Visual object tracking is necessary in various tech-

niques such as self-guided moving cars, human interfaces,

and motion analysis. In various applications of object track-

ing, robust methods for general scenes have been desired.

Several object tracking methods have been proposed.

Typical methods make use of the difference between frames

[1], the difference from the background, the correlation

between brightness distributions over a small region of the

image [2], the use of depth information by stereo vision [3],

information on velocity [4, 5], and the like. Since these

methods are based only on one type of information, tracking

will fail once the target becomes indistinguishable from

others.

Tracking methods using multiple information that

have been proposed include approaches based on velocity

and color information [6], velocity and edges [7], velocity

and depth [8], and so on. But in these methods, if the target

becomes indistinguishable from the other object based on

either type of information, it becomes impossible to cor-

rectly extract the region of the target object and continue

the tracking.

The authors therefore proposed a method based on

integration of the optical flow and depth, in which the object

region could be extracted even if one of the types of infor-

mation on the target object becomes indistinguishable, so

that tracking can be successfully continued [9]. But this

© 2001 Scripta Technica

Systems and Computers in Japan, Vol. 32, No. 7, 2001Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-II, No. 8, August 1999, pp. 1252�1261

29

Page 2: Tracking a person with 3D motion by integrating optical flow and depth

method is based on the assumption of uniformity of the

optical flow over the object region. That is, the target is

assumed to move parallel to the image plane. Similarly, in

tracking by integration of the optical flow and the region

with constant brightness [10], or methods using similarity

of position (in the image), color, and spatiotemporal bright-

ness gradients [11], the assumption is that the optical flow

remains constant. Further, in tracking of multiple objects

by means of the split-and-merge active contour model, only

movements parallel to the image plane are treated [12]. In

these methods, if there is movement parallel to the optical

axis or rotation, the tracking would fail.

In overcoming such problems, it is necessary to con-

sider the 3D movement of the object. Ishii and colleagues

[13] extracted the 3D position of a human head and face

from a stereo image sequence, but because the region

extraction is based on the difference from the background,

the method cannot be applied to cases involving multiple

persons. Huttenlocher and colleagues [14] achieved track-

ing in the presence of 3D motion by determining the shift

and the deformation of the model of the target object, which

is represented by edges, through minimization of the dis-

tance between the edges in the current image and the model

edges. Since this method is based only on edge information,

tracking becomes difficult if there is a complicated texture

in the background.

There is also a method of estimation of 3D move-

ments based on optical flow [15�17]. But accurate estima-

tion becomes difficult when the target object moves along

the optical axis, because such a movement hardly varies

images compared to the movements parallel to the image

plane. Further, it is also difficult to estimate motions includ-

ing both the rotation about an axis perpendicular to the

optical axis and the translations parallel to the image plane

and vertical to the rotation axis, which is because these

motions generate similar flow vectors in the target object

region and the flow vectors are undetermined at the bound-

ary of the target region [18] (refer to Fig. 1). It is believed

that depth information is very useful in estimation of trans-

lations parallel to the optical axis. Further, when the target

object rotates about an axis perpendicular to the optical

axis, except for round target objects, the shape on the image

plane varies greatly, and therefore, if the shape of the 3D

object is known, it becomes possible to estimate the rotation

about an axis perpendicular to the optical axis. We estimate

the state of a 3D object (position, posture, and velocity) by

integration of all information such as optical flow, depth,

and the target object region in the image. In using the shape

of the target object on the image plane, it is necessary to

correctly extract the object region and, therefore, the optical

flow and depth information are integrated to obtain the

proper object region.

The tracking method is described in Section 2 while

the extraction of the optical flow and depth information is

presented in Section 3. The experimental results are given

in Section 4.

2. Tracking Method

The coordinate system in a scene is shown in Fig. 2.

The person model used in this method is also shown in Fig.

2. The head is expressed by a rectangular cube and the body

is represented by the combination of a rectangular cube and

two semicylinders.

The transformation from coordinate system i to sys-

tem j is expressed by

Fig. 1. Optical flow generated from rotation around an

axis perpendicular to an optical axis and translation

perpendicular to the optical axis.

Fig. 2. Coordinate system [Oc�XcYcZc, Ow�XwYwZw,

and Ob�XbYbZb are the camera (fixed to a camera),

world (fixed to a scene), and body (fixed to a torso)

coordinate system, respectively.]

(1)

30

Page 3: Tracking a person with 3D motion by integrating optical flow and depth

where Pi, Pj correspond to the points in the i and j coordi-

nate systems, respectively, Qij denotes the matrix for con-

version from coordinate system i to system j, and Rij and

Tij represent the rotation matrix and the translation vector,

respectively. Matrix Qcw for transforming camera coordi-

nates to world coordinates and matrix C for transforming

the camera coordinates to the image coordinates are known,

and the relation between the body and world coordinates is

to be estimated. The state vector x takes the form

where r, t represent rotation about Ob and parallel transla-

tion from Ob to Ow, the subscripts x, y, z stand for the

components in the X, Y, Z directions, and the time derivative

of q is represented by q�

. The state vector x is to be deter-

mined at each frame. A walking person is considered in this

paper and the restriction that the person remains perpen-

dicular to the floor is imposed. Therefore, the XZ plane of

the world coordinate system is taken to be parallel to the

floor with rx rz r�x r

�z 0. Further, the camera system

is assumed to be almost parallel to the floor.

Under such circumstances, rotations about an axis

perpendicular to the optical axis become rotations about the

vertical axis. For proper estimation of such rotations, the

body width is used as shape information on the image of

the target object. Further, in order to avoid the effects of arm

deviations, the width of the shoulders (shoulder width) is

used as the body width. As the position of the target object,

the origin of the body coordinate system Ob on the image

is used. The addition of such information makes our ap-

proach different from the method in which only the motion

parameters of the target object are computed during track-

ing [17], and the error does not accumulate.

2.1. Outline of the tracking method

In the initial frame, the optical flow and the disparity,

which represents depth information, are first calculated

from the initial images. These results are used in extraction

of the object region and the initial state of the target person

is estimated.

The flow of the processing in the following frames is

shown in Fig. 3. The optical flow and the depth are com-

puted. Then, the current state is predicted from that in the

previous frame. Using the predicted state, the flow vector,

and the disparity at each pixel, the probability of the pixel

belonging to the target object is calculated. The pixels with

a high probability are determined to belong to the target

object, and the shoulder width and the position of the object

region are computed. The current state is estimated from

these observations and the predicted target state. If the angle

of rotation about a vertical axis cannot be uniquely specified

from the shoulder width, the possible candidates are gener-

ated. The erroneous candidates are determined and elimi-

nated based on the difference between the predicted

observations computed from the predicted state and the

actual observations, and the whole procedure described

above is repeated. If there are multiple candidates, the

tracking procedure is applied to all of them.

2.2. Object shape in the image

2.2.1. Extraction of the object region

When the flow vector �ui, vi� and the disparity di are

observed at pixel pi, the probability P�pi � T|oi� that pi

belongs to the target object T is given by

where oi �uividi�T. P�pi � T� is the a priori probability,

P�oi|pi � T� is the conditional probability of observation of

oi when pi belongs to the target object, and P�oi� denotes

the probability of observation of oi. The method of calcula-

tion of each term on the right-hand side of Eq. (3) is

described below, followed by the extraction of the object

region.

P�oi|pi � T � can be calculated from the probability

distribution of the predicted flow vector and the predicted

disparity at pixel pi. In our method, this probability distri-

bution is approximated by a normal distribution. The rela-

tion between object state xt at time t and oi is given by

where hi is a vector function expressing the relation be-

tween the object state xt and the observation oi (see the

(2)

Fig. 3. Flow of tracking procedure.

(3)

(4)

31

Page 4: Tracking a person with 3D motion by integrating optical flow and depth

Appendix), wit represents the probability variable vector of

erroneous observation, which is the white noise with zero

mean and with covariance matrix Wit. Since the reliability

of the optical flow and the disparity depend on the contrast,

Wit depends on the threshold used in their calculation (see

Section 3). The expectation o~i and the covariance matrix

Oi of the probability distribution with respect to the pre-

dicted flow vector and disparity are given by

where Hi whi /wx|x~t

, and x~t and X~

t respectively denote the

mean of the estimated state at time t and the covariance

matrix (see Section 2.3).

The a priori probability P�pi � T� is given by the

following integral:

with

where D represents the set of all predicted states, R(x)

represents the region which is the projection of the target

model with the state x (Fig. 2), and p(x) represents the

probability density that the predicted state is x [the prob-

ability distribution is approximated by the normal distri-

bution with the expectation xt and the covariance matrix

Xt (see Section 2.3)].

P�oi� in the denominator of Eq. (3) can be rewritten

as

where TBB

represents the set of all pixels which do not belong to

the target object. The corresponding region is the background

or belongs to other objects and it is assumed that in the range

where observation is possible, the flow vector and the disparity

have equal probabilities, respectively. In this case, the second

term on the right-hand side of Eq. (7) becomes

where U�oi� is a uniform distribution. The range of the

uniform distribution is determined according to the objec-

tive scene. It is assumed that there is no drastic movement

among the frames because the frame interval is sufficiently

small in regard to the object motions existing inside the

scene. The range of each component of the flow vector is

taken to be from �10 to 10 pixels/frame (picture size: 320

pixels horizontally and 240 pixels vertically). Further, it is

assumed that there is no object which is extremely close to

the stereo cameras and the disparity range is determined to

be 0 to 130 pixels. The range of the uniform distribution

lies in such flow vector and disparity ranges.

A set of pixels which have the probability

P�pi � T |oi� larger than the threshold value tT is determined

to be the target object region St. Figure 4 shows the relation

between the relative number of erroneous pixels and the

threshold tT (the relative number of pixels En with the

background taken as the region of the target object and the

relative number of pixels Ep with the target object taken as

the background). The correct region of the target object is

specified manually. As can be seen, when the threshold is

between 0.2 and 0.8, En and Ep are both small and almost

insensitive to the threshold value. Therefore, it is desirable

that the threshold values be taken in this range. In the

present method, the threshold value is chosen to be 0.5.

2.2.2. Shoulder width and target position on

image

The boundary pixels of the target object region St on

the horizontal line Ls passing through the midpoint of the

shoulder are represented by psL, psR (see Fig. 5). The shoul-

der width Ws and its midpoint ps are the distance between

the two points psL and psR and the midpoint between them.

But, since observation only along Ls is unstable, the shoul-

der widths and the midpoints on NS straight lines are also

determined and their average is taken as the shoulder width

Ws and shoulder-center ps. Since the shoulder-width vari-

ance VWs

2 is the variance of the average of the widths Wsi

determined on each line, it is given [19] by

(5)

(6)

(7)

Fig. 4. Relation between threshold of probability and

error rate.

(8)

32

Page 5: Tracking a person with 3D motion by integrating optical flow and depth

The variance Vx2 of the mean of the horizontal components

of ps is similarly determined.

It is the uppermost horizontal line which includes the

pixels belonging to the target object region more than Nt

pixels. The vertex position in the vertical direction is taken

to be the average of the vertical components of the boundary

pixels above lt. The horizontal line corresponding to this

position is denoted by Lt. Similarly to the variance Vws

2 of

the shoulder width, the variance of the perpendicular posi-

tion of the vertex Vy2 can also be determined.

On the vertical line Lsc passing through ps, the object

position pB is taken to be lower than Lt by a distance dtc (the

distance from the vertex to the origin of the body coordinate

system); dtc can be calculated from the human model (Fig.

2) and the predicted state x~t. The variances of the horizontal

and vertical components of pB are the previously deter-

mined Vx2 and Vy

2.

2.3. State prediction and estimation

In this method, a Kalman filter is used in the predic-

tion and estimation of state xt of the target object at time t

[20]. If it is assumed that the object moves at the same

velocity in successive frames, the system equation of the

Kalman filter is

where I6 is a 6 u 6 unit matrix, u is the estimation error

which is the white noise with zero mean and the covariance

matrix U determined based on the acceleration range of the

target object.

If the average and the covariance matrix of the esti-

mated target state at time t � 1 are represented by

x̂t�1, X̂t�1, the average xt and the covariance matrix Xt of the

target state is predicted by

If the observa t ion vector i s denoted by

yt [WspBTo1

T� � � oiT� � � ]t, the observation equation becomes

where Ws is the shoulder width, pB is the position of the

target object on the image (the projection of the origin of

the body coordinate system on the image), oi �uividi�T is

the disparity and the flow vector at pixel pi belonging to the

target object region, h is a vector function relating the

observed value and the state (see the Appendix), and wt is

the observation error, which is a white noise with zero mean

and covariance matrix Wt. The variance of each component

of wt is uncorrelated and is determined from the reliability

of observation [with the variances of the shoulder width

Ws and the position pB of the target object being

VWs

2 , Vx2, Vy

2 (see Section 2.2.2) and the variances of the

optical flow and the disparity being Vu2, Vv

2, Vd2 (see Section

3)]. Since h is not linear, Eq. (11) is linearized around the

predicted value and an extended Kalman filter is formed.

The average x̂t and the covariance matrix X̂t of the estimated

state are given by

where Kt { X~

tHtT�HtX

~tHt

T � Wt��1 is the Kalman gain, I12 is

a 12 u 12 unit matrix, and Ht { wh�x� /wx|x~t

. When the target

object is occluded, because of the impossibility of extrac-

tion of its region, shape information cannot be used. In such

cases, the occlusion of the target object can be detected from

the variations of the optical flow and the disparity [9] and

tracking is performed without the use of shape information.

If there are multiple persons with similar flow vectors

and close to each other, since those regions are connected

to each other, the overall region is treated as the target object

region. However, if the observed shoulder width is suffi-

ciently larger than the predicted one, it is possible to detect

that multiple persons are tracked. In such a case, the track-

ing is performed without the use of shape information.

2.4. Generation and elimination of candidates

In the region where the observation of shoulder width

[i.e., Eq. (A.4)] is highly nonlinear, multiple angles become

possible for rotations about a vertical axis, relative to an

observed (shoulder-width) value. For example, in Fig. 6(a),

there are two solutions ryt1 , ryt

2 with regard to rotations about

a vertical axis for the observed value Ws. But because of

Fig. 5. Feature extraction from target region.

(9)

(10)

(11)

(12)

(13)

33

Page 6: Tracking a person with 3D motion by integrating optical flow and depth

linearization, only one solution can be determined. This hap-

pens only in the vicinity of a local maxima and minima and,

therefore, when the estimated angle comes out of the vicinity

of a local maxima and minima, the following equation is used

to establish the other state candidate symmetrical to the esti-

mated value r̂yt against the local maximum Mt [see Fig. 6(b)]

and tracking is performed for each candidate:

where r̂ytN�1 is the angle of rotation of the new state candidate

about the vertical axis, N denotes the number of candidates,

and Mt represents the nearest local maxima or local minima

at time t. The covariance matrix corresponding to the origi-

nal state candidate is used as that of the new state candidate.

The two possible candidates of the 13th frame, in the image

example of Fig. 9, are shown in Figs. 6(c) and 6(d). The

candidate whose predicted state contradicts the current

observations is erroneous, and based on the difference of

the predicted observation ytk h�x t

k� and the current obser-

vation yk, the criterion value

is calculated and compared with each other. Here, Wtk is the

covariance matrix of the observation error of candidate k

and ntk represents the dimension of yt

k. Since the dimensions

of the observation of the candidates are all different, the

criterion value is normalized. The criterion values Ctk of

frames 1 to 15 of the image example in Fig. 9 are shown in

Fig. 7. As seen, the criterion values Ctk of the erroneous

candidates are large. But there are cases where the criterion

values differ very slightly, as in the seventh frame, with a

possibility of error. Therefore, the candidates with large

criterion values compared to those of other candidates

during some time interval are eliminated.

2.5. State initialization

In the initial frame, the photographic system remains

static. If the target is the only moving object in the initial

frame, the set of the pixel where the flow vector is rather

large is determined to be the candidate region of the target

object Sv. Inside Sv, the average Zm of the disparities is

calculated and the set of the pixels similar to Zm is extracted

from Sv and is determined to be the target object region S0.

If other moving objects exist in the initial frame, the rectan-

gle circumscribing the target object is given, where the

above process is carried out.

As explained in Section 2.2, the target position ps and

the shoulder width Ws are determined by using S0 and Zm.

Then the parameter for conversion from the body to the

world coordinates [q of Eq. (2)] is calculated. Since the

angle of rotation about the vertical axis is not uniquely

determined from the shoulder width all the possible candi-

dates are generated. The initial speeds of rotation and

translation are taken to be 0.

3. Extraction of Information on Optical

Flow and Disparity

The optical flow and the disparity can be determined

from the point correspondences in successive images and

in left and right stereo pictures, respectively. In this method,

the point correspondence is determined by SAD [2]. How-

ever, no reliable point correspondence can be achieved for

the points with poor contrast. In regards to the optical flow,

correspondence cannot be assigned to the points satisfy-

ing

Fig. 6. State candidates.

(14)

(15)

Fig. 7. Criterion of candidates.

(16)

34

Page 7: Tracking a person with 3D motion by integrating optical flow and depth

where W is the length of one side of image subregion of

SAD calculation, f(x, y) denotes the brightness at point (x,

y), the subscripts x, y denote the partial derivatives in the

horizontal and vertical directions, and cv is a positive con-

stant. In regards to the disparity, no correspondence can be

assigned to the points satisfying

where cd is a positive constant. Since the reliability of the

optical flow and the disparity depends on the contrast, the

variance Vu2, Vv

2 of the flow vector and the variance Vd2 of the

disparity are determined to be

where kv, kd are positive constants. Further, even for an

established point correspondence, reliability is low if the

SAD value is large. Therefore, the flow vectors and dispari-

ties of such points are not used. Finally, if no corner edge

exists inside the subregions of SAD calculations, the point

correspondence cannot be established uniquely, and thus

flow vector calculation for these points is omitted [9].

Figure 8 shows the results of the optical flow and disparity

extractions.

4. Experiments

Figure 9 shows the results of tracking a person with

movement including rotation about an axis perpendicular

to the optical axis. In case of rotations about a vertical axis,

where the estimation process is difficult even if the optical

flow and depth information are used, proper estimation

becomes possible by using the shoulder-width information

Ws. For the sake of comparison, in tracking without the use

of shoulder width Ws and torso position pB [eliminated from

Eq. (11)], the error of speed estimation of the frames

accumulates as shown in Fig. 10, which includes results of

a failed tracking.

In the example shown in Fig. 11, from frames 7 to 16,

there are other persons close to the target object moving in

the reverse direction. In such cases, it is difficult to distin-

guish the target from other persons based only on the

disparity. They can be distinguished by using the optical

flow to properly extract the object region, leading to suc-

cessful tracking. In frames 165 to 195, there is an overlap

between the target and the other person with the same

velocity. It is again difficult to distinguish these persons

(17)

(18)

Fig. 8. Extracted optical flow and disparity at 13th

frame in Fig. 9. (Optical flow or disparity cannot

be calculated in white regions.)

Fig. 9. Motion including rotation around an axis

perpendicular to optical axis. [Upper rows show tracking

result. Lower rows show extracted target region (black),

vertex (gray line), shoulder width (gray lines),

and origin of torso (gray point).]

35

Page 8: Tracking a person with 3D motion by integrating optical flow and depth

based only on the optical flow. However, the target region

is properly extracted based on the disparity and the tracking

is carried out successfully. Further, in this example, despite

large motions along the optical axis, proper estimation

becomes possible by using the depth information. In thetracking results of this example shown in Fig. 12, the object

position and the direction of the torso are depicted. In Fig.

12, the target direction is suddenly changed at A, which

means a new state candidate is generated in the frame and

is eliminated in the following frame. The situation in Ac is

quite similar. At B, Bc, and Bs in this figure, there are

fluctuations in the estimation results due to errors in the

optical flow and disparities resulting in erroneous extrac-

tion of the target region.

5. Conclusions

The states of a person walking in a scene were esti-

mated by using the optical flow and the depth, and tracking

was carried out. In cases of rotations about an axis perpen-

dicular to the optical axis, where correct estimation became

difficult even if the optical flow and depth information were

used, proper estimation became possible when using infor-

mation on target shape. Because of the use of information

on target shape, correct estimation of the target region

became indispensable. Therefore, the predicted target state,

optical flow, and depth information were used for proper

estimation of the target region. Even if another person with

3D movements including rotations about an axis perpen-

dicular to the optical axis were present but it was possible

to distinguish him by only a single kind of information,

tracking became possible and the effectiveness of the

method was confirmed.

Fig. 10. Without shoulder width and position.

Fig. 11. There is an object which has similar depth or

velocity to the target.

Fig. 12. Top view of the target trajectory. (Each line

stands for the direction of the shoulder. The target

position is the midpoint of the line.)

36

Page 9: Tracking a person with 3D motion by integrating optical flow and depth

Acknowledgment. We acknowledge the assistance

of members of Shirai Laboratory, Osaka University, during

the research.

REFERENCES

1. Yachida M, Asada M, Tsuji S. Automatic analysis of

moving image. IEEE Trans PAMI 1981;3:12�20.

2. Inoue H, Tachkawa T, Inaba M. Robot vision system

with a correlation chip for real-time tracking, optical

flow and depth map generation. Proc IEEE Int Conf

R&A, p 1621�1626, 1992.

3. Coombs D, Brown C. Real-time smooth pursuit

tracking for a moving binocular robot. Proc

CVPR�92, p 23�28.

4. Yamamoto S, Mae Y, Shirai Y, Miura J. Realtime

multiple object tracking based on optical flows. Proc

IEEE Int Conf R&A 1995;3:2328�2333.

5. Chen HJ, Shirai Y, Asada M. Detecting multiple rigid

image motions from an optical flow field obtained

with multi-scale, multi-orientation filters. IEICE

Trans Inf Syst 1993;E76-D:1253�1262.

6. Kyo H, Harajima H. Region segmentation and track-

ing of moving pictures based on clustering of posi-

t ion, color and motion. Tech Rep IEICE

1995;IE94-152.

7. Koller D, Weber J, Malik J. Robust multiple car

tracking with occlusion reasoning. Proc ECCV�94, p

189�196.

8. Uhlin T, Nordlund P, Maki A, Eklundh JO. Towards

an active visual observer. Proc 5th ICCV, p 679�686,

1995.

9. Okada R, Shirai Y, Miura J, Kuno Y. Object tracking

based on optical flow and depth information. Trans

IEICE 1997;J80-D-II:1530�1583.

10. Yamane T, Shirai Y, Miura J. Person tracking by

integrating optical flow and uniform brightness re-

gions. Proc IEEE Int Conf R&A, p 3267�3272, 1998.

11. Etoh M, Shirai Y. 2-D motion estimation by region

segmentation based on color, position and intensity

gradients. Trans IEICE 1993;J76-D-II:2324�2332.

12. Araki S, Matsuoka T, Takemura H, Yokoya N. Real-

time tracking of multiple moving objects in moving

camera image sequence using robust statistics. Proc

14th ICPR 1998;2:1433�1435.

13. Ishii H, Mochizuki K, Kishino F. A method of motion

recognition from stereo images for human image

synthesis. Trans IEICE 1993;J76-D-II:1805�1812.

14. Huttenlocher DP, Noh JJ, Rucklidge WJ. Tracking

non-rigid objects in complex scenes. Proc 4th ICCV,

p 93�101, 1993.

15. Ju SX, Black MJ, Yacoob Y. Cardboard people: A

parameterized model of articulated image motion.

Proc Int Conf on Face and Gesture Recognition, p

461�465, 1996.

16. Mae Y, Shirai Y. Tracking moving object in 3-D space

based on optical flow and edges. Proc 14th ICPR

1998;2:1439�1441.

17. Yamamoto M, Kawada S, Kondo T, Koshikawa K.

Image tracking of persons with 3-D motions. Trans

IEICE 1996;J79-D-II:71�83.

18. Adiv G. Inherent ambiguities in recovering 3-D mo-

tion and structure from a noisy flow field. IEEE Trans

PAMI 1989;11:477�489.

19. Honma H, Kasugaya N. Dimensional analysis. Least

square method and experimental equations. Corona

Publishers; 1957. Chapter 2.

20. Arimoto T. Kalman filters. Industrial Library; 1977.

Chapter 3.

APPENDIX

Relation between Target State and Observed

Values

The flow vector �ui, vi� observed on the image when

the point Pib belongs to the target expressed in the body

coordinates is given by

where f is the focal length and Zic, Z�

ic are the z components

of Pic and P�

ic. Using the expectation x kt of the predicted

target state and the human model, the predicted depth Zic

can be obtained at each point of the object. Further, the

transformation matrix Q~

bw can be obtained from the expec-

tation of the predicted state xkt . For a point �xi, yi� in the

image, Pib is determined by

with the disparity di given by

where b is the distance between the stereo cameras. The

shoulder width Ws can be expressed by

(A.1)

(A.2)

(A.3)

37

Page 10: Tracking a person with 3D motion by integrating optical flow and depth

where Plb, Prb are the centers of the half cylinders at the

shoulder position of the human model in the body coordi-

nate system (see Fig. 2), r is the radius of the half cylinders,

and Zlc, Zrc stand for the Z components of Qcw�1QbwPlb and

Qcw�1QbwPrb.

The position of the target (on the image)

pBh �xByB1�T is expressed as

where PB �0 0 0 1�T is the origin of the body coordinates

and Zbc is the Z component of Qcw�1QbwPB. The observation

values ui, vi, di, Ws, xB, and yB are determined by matrices

Qbw and Q�

bw. In these matrices, the target state is included

as a parameter, and therefore the vector function h relating

the observation value and the target state can be obtained.

AUTHORS (from left to right)

Ryuzo Okada (member) received his B.S. degree in engineering from Osaka University in 1995 and finished his graduate

studies in the field of computer-controlled mechanical systems in 1999, receiving a D.Eng. degree. He joined Toshiba Co. in

1999. He is engaged in research on computer vision.

Yoshiaki Shirai (member) received his B.E. degree from Nagoya University in 1964 in the field of mechanical

engineering, and his M.E. and D.Eng. degrees from Tokyo University in 1969 and then joined the Electrotechnical Laboratory.

He spent 1971�72 at the MIT AI Labs as a visiting scholar. Since 1988 he has been a professor in the Department of

Computer-Controlled Mechanical Systems at Osaka University. He has been involved in research on computer vision, robotics,

and artificial intelligence. He is a member of the Artificial Intelligence Society, the Japan Society of Robotics, the Information

Processing Society, the Japan Society of Mechanical Engineers, the IEE Computer Society, and AAAI.

Jun Miura (member) received his B.E. degree in mechanical engineering from the University of Tokyo in 1984 and

finished his graduate studies in the field of information engineering in 1989, receiving a D.Eng. degree. He joined Osaka

University in 1989 as a research associate in the Department of Computer-Controlled Mechanical Systems, and became an

associate professor in 1999. He has been involved in research on intelligent robots. He was a visiting scholar at CMU during

1994�95. He is a member of IEEE, AAAI, the Japanese Society of Artificial Intelligence, the Robotics Society of Japan, the

Information Processing Society of Japan, and the Japan Society of Mechanical Engineers.

Yoshinori Kuno (member) received his B.S. and D.Eng. degrees in electrical and electronics engineering from the

University of Tokyo in 1977 and 1982. He joined Toshiba Co. in 1982 and was a visiting research scientist at CMU in the United

States during 1987�88. He joined Osaka University in 1993 as an associate professor in the Department of Computer-Controlled

Mechanical Systems. Since 2000, he has been a professor in the Department of Information and Computer Science at Saitama

University. He is engaged in research on computer vision, intelligent robots, and human interface. He is a member of the

Information Processing Society of Japan, the Japan Society of Mechanical Engineering, the Robotics Society of Japan, the

Japanese Society for Artificial Intelligence, the Society of Instrumentation and Control Engineers, ACM, and IEEE.

(A.4)

(A.5)

38