tracking a person with 3d motion by integrating optical flow and depth
Post on 06-Jul-2016
212 Views
Preview:
TRANSCRIPT
Tracking a Person with 3D Motion by Integrating Optical Flow
and Depth
Ryuzo Okada,1 Yoshiaki Shirai,
2 Jun Miura,
2 and Yoshinori Kuno
2
1Multimedia Laboratory, Corporate Research & Development Center, Toshiba Corp., Kawasaki, 212-8582 Japan
2Department of Computer Controlled Mechanical Systems, Osaka University, Suita, 565-0871 Japan
SUMMARY
In this paper, a method of tracking a person in three-
dimensional motion by integration of optical flow and
depth is described. Using a human model, the objective is
estimation of the state (position, posture, and motion) in a
scene at each frame. Based on the distribution of the esti-
mated state, the optical flow and the depth information are
used to calculate the probability of each pixel belonging to
the target object, from which the object region is estimated.
The optimum state is estimated by integration of the shape,
the position, the flow vectors, and the depth information
with respect to the extracted target region by a Kalman
filter. Because of the nonlinear relation between the state
and the shape of the target object on the image, there are
cases where the state cannot be determined uniquely. In
such situations, the possible candidates are all generated
and are tracked one by one. Based on the difference between
the tracking results and the observations, the erroneous
candidates are eliminated. The effectiveness of the method
is shown by its application to real image examples. © 2001
Scripta Technica, Syst Comp Jpn, 32(7): 29�38, 2001
Key words: Three-dimensional tracking; sensor
fusion; segmentation; optical flow; disparity.
1. Introduction
Visual object tracking is necessary in various tech-
niques such as self-guided moving cars, human interfaces,
and motion analysis. In various applications of object track-
ing, robust methods for general scenes have been desired.
Several object tracking methods have been proposed.
Typical methods make use of the difference between frames
[1], the difference from the background, the correlation
between brightness distributions over a small region of the
image [2], the use of depth information by stereo vision [3],
information on velocity [4, 5], and the like. Since these
methods are based only on one type of information, tracking
will fail once the target becomes indistinguishable from
others.
Tracking methods using multiple information that
have been proposed include approaches based on velocity
and color information [6], velocity and edges [7], velocity
and depth [8], and so on. But in these methods, if the target
becomes indistinguishable from the other object based on
either type of information, it becomes impossible to cor-
rectly extract the region of the target object and continue
the tracking.
The authors therefore proposed a method based on
integration of the optical flow and depth, in which the object
region could be extracted even if one of the types of infor-
mation on the target object becomes indistinguishable, so
that tracking can be successfully continued [9]. But this
© 2001 Scripta Technica
Systems and Computers in Japan, Vol. 32, No. 7, 2001Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-II, No. 8, August 1999, pp. 1252�1261
29
method is based on the assumption of uniformity of the
optical flow over the object region. That is, the target is
assumed to move parallel to the image plane. Similarly, in
tracking by integration of the optical flow and the region
with constant brightness [10], or methods using similarity
of position (in the image), color, and spatiotemporal bright-
ness gradients [11], the assumption is that the optical flow
remains constant. Further, in tracking of multiple objects
by means of the split-and-merge active contour model, only
movements parallel to the image plane are treated [12]. In
these methods, if there is movement parallel to the optical
axis or rotation, the tracking would fail.
In overcoming such problems, it is necessary to con-
sider the 3D movement of the object. Ishii and colleagues
[13] extracted the 3D position of a human head and face
from a stereo image sequence, but because the region
extraction is based on the difference from the background,
the method cannot be applied to cases involving multiple
persons. Huttenlocher and colleagues [14] achieved track-
ing in the presence of 3D motion by determining the shift
and the deformation of the model of the target object, which
is represented by edges, through minimization of the dis-
tance between the edges in the current image and the model
edges. Since this method is based only on edge information,
tracking becomes difficult if there is a complicated texture
in the background.
There is also a method of estimation of 3D move-
ments based on optical flow [15�17]. But accurate estima-
tion becomes difficult when the target object moves along
the optical axis, because such a movement hardly varies
images compared to the movements parallel to the image
plane. Further, it is also difficult to estimate motions includ-
ing both the rotation about an axis perpendicular to the
optical axis and the translations parallel to the image plane
and vertical to the rotation axis, which is because these
motions generate similar flow vectors in the target object
region and the flow vectors are undetermined at the bound-
ary of the target region [18] (refer to Fig. 1). It is believed
that depth information is very useful in estimation of trans-
lations parallel to the optical axis. Further, when the target
object rotates about an axis perpendicular to the optical
axis, except for round target objects, the shape on the image
plane varies greatly, and therefore, if the shape of the 3D
object is known, it becomes possible to estimate the rotation
about an axis perpendicular to the optical axis. We estimate
the state of a 3D object (position, posture, and velocity) by
integration of all information such as optical flow, depth,
and the target object region in the image. In using the shape
of the target object on the image plane, it is necessary to
correctly extract the object region and, therefore, the optical
flow and depth information are integrated to obtain the
proper object region.
The tracking method is described in Section 2 while
the extraction of the optical flow and depth information is
presented in Section 3. The experimental results are given
in Section 4.
2. Tracking Method
The coordinate system in a scene is shown in Fig. 2.
The person model used in this method is also shown in Fig.
2. The head is expressed by a rectangular cube and the body
is represented by the combination of a rectangular cube and
two semicylinders.
The transformation from coordinate system i to sys-
tem j is expressed by
Fig. 1. Optical flow generated from rotation around an
axis perpendicular to an optical axis and translation
perpendicular to the optical axis.
Fig. 2. Coordinate system [Oc�XcYcZc, Ow�XwYwZw,
and Ob�XbYbZb are the camera (fixed to a camera),
world (fixed to a scene), and body (fixed to a torso)
coordinate system, respectively.]
(1)
30
where Pi, Pj correspond to the points in the i and j coordi-
nate systems, respectively, Qij denotes the matrix for con-
version from coordinate system i to system j, and Rij and
Tij represent the rotation matrix and the translation vector,
respectively. Matrix Qcw for transforming camera coordi-
nates to world coordinates and matrix C for transforming
the camera coordinates to the image coordinates are known,
and the relation between the body and world coordinates is
to be estimated. The state vector x takes the form
where r, t represent rotation about Ob and parallel transla-
tion from Ob to Ow, the subscripts x, y, z stand for the
components in the X, Y, Z directions, and the time derivative
of q is represented by q�
. The state vector x is to be deter-
mined at each frame. A walking person is considered in this
paper and the restriction that the person remains perpen-
dicular to the floor is imposed. Therefore, the XZ plane of
the world coordinate system is taken to be parallel to the
floor with rx rz r�x r
�z 0. Further, the camera system
is assumed to be almost parallel to the floor.
Under such circumstances, rotations about an axis
perpendicular to the optical axis become rotations about the
vertical axis. For proper estimation of such rotations, the
body width is used as shape information on the image of
the target object. Further, in order to avoid the effects of arm
deviations, the width of the shoulders (shoulder width) is
used as the body width. As the position of the target object,
the origin of the body coordinate system Ob on the image
is used. The addition of such information makes our ap-
proach different from the method in which only the motion
parameters of the target object are computed during track-
ing [17], and the error does not accumulate.
2.1. Outline of the tracking method
In the initial frame, the optical flow and the disparity,
which represents depth information, are first calculated
from the initial images. These results are used in extraction
of the object region and the initial state of the target person
is estimated.
The flow of the processing in the following frames is
shown in Fig. 3. The optical flow and the depth are com-
puted. Then, the current state is predicted from that in the
previous frame. Using the predicted state, the flow vector,
and the disparity at each pixel, the probability of the pixel
belonging to the target object is calculated. The pixels with
a high probability are determined to belong to the target
object, and the shoulder width and the position of the object
region are computed. The current state is estimated from
these observations and the predicted target state. If the angle
of rotation about a vertical axis cannot be uniquely specified
from the shoulder width, the possible candidates are gener-
ated. The erroneous candidates are determined and elimi-
nated based on the difference between the predicted
observations computed from the predicted state and the
actual observations, and the whole procedure described
above is repeated. If there are multiple candidates, the
tracking procedure is applied to all of them.
2.2. Object shape in the image
2.2.1. Extraction of the object region
When the flow vector �ui, vi� and the disparity di are
observed at pixel pi, the probability P�pi � T|oi� that pi
belongs to the target object T is given by
where oi �uividi�T. P�pi � T� is the a priori probability,
P�oi|pi � T� is the conditional probability of observation of
oi when pi belongs to the target object, and P�oi� denotes
the probability of observation of oi. The method of calcula-
tion of each term on the right-hand side of Eq. (3) is
described below, followed by the extraction of the object
region.
P�oi|pi � T � can be calculated from the probability
distribution of the predicted flow vector and the predicted
disparity at pixel pi. In our method, this probability distri-
bution is approximated by a normal distribution. The rela-
tion between object state xt at time t and oi is given by
where hi is a vector function expressing the relation be-
tween the object state xt and the observation oi (see the
(2)
Fig. 3. Flow of tracking procedure.
(3)
(4)
31
Appendix), wit represents the probability variable vector of
erroneous observation, which is the white noise with zero
mean and with covariance matrix Wit. Since the reliability
of the optical flow and the disparity depend on the contrast,
Wit depends on the threshold used in their calculation (see
Section 3). The expectation o~i and the covariance matrix
Oi of the probability distribution with respect to the pre-
dicted flow vector and disparity are given by
where Hi whi /wx|x~t
, and x~t and X~
t respectively denote the
mean of the estimated state at time t and the covariance
matrix (see Section 2.3).
The a priori probability P�pi � T� is given by the
following integral:
with
where D represents the set of all predicted states, R(x)
represents the region which is the projection of the target
model with the state x (Fig. 2), and p(x) represents the
probability density that the predicted state is x [the prob-
ability distribution is approximated by the normal distri-
bution with the expectation xt and the covariance matrix
Xt (see Section 2.3)].
P�oi� in the denominator of Eq. (3) can be rewritten
as
where TBB
represents the set of all pixels which do not belong to
the target object. The corresponding region is the background
or belongs to other objects and it is assumed that in the range
where observation is possible, the flow vector and the disparity
have equal probabilities, respectively. In this case, the second
term on the right-hand side of Eq. (7) becomes
where U�oi� is a uniform distribution. The range of the
uniform distribution is determined according to the objec-
tive scene. It is assumed that there is no drastic movement
among the frames because the frame interval is sufficiently
small in regard to the object motions existing inside the
scene. The range of each component of the flow vector is
taken to be from �10 to 10 pixels/frame (picture size: 320
pixels horizontally and 240 pixels vertically). Further, it is
assumed that there is no object which is extremely close to
the stereo cameras and the disparity range is determined to
be 0 to 130 pixels. The range of the uniform distribution
lies in such flow vector and disparity ranges.
A set of pixels which have the probability
P�pi � T |oi� larger than the threshold value tT is determined
to be the target object region St. Figure 4 shows the relation
between the relative number of erroneous pixels and the
threshold tT (the relative number of pixels En with the
background taken as the region of the target object and the
relative number of pixels Ep with the target object taken as
the background). The correct region of the target object is
specified manually. As can be seen, when the threshold is
between 0.2 and 0.8, En and Ep are both small and almost
insensitive to the threshold value. Therefore, it is desirable
that the threshold values be taken in this range. In the
present method, the threshold value is chosen to be 0.5.
2.2.2. Shoulder width and target position on
image
The boundary pixels of the target object region St on
the horizontal line Ls passing through the midpoint of the
shoulder are represented by psL, psR (see Fig. 5). The shoul-
der width Ws and its midpoint ps are the distance between
the two points psL and psR and the midpoint between them.
But, since observation only along Ls is unstable, the shoul-
der widths and the midpoints on NS straight lines are also
determined and their average is taken as the shoulder width
Ws and shoulder-center ps. Since the shoulder-width vari-
ance VWs
2 is the variance of the average of the widths Wsi
determined on each line, it is given [19] by
(5)
(6)
(7)
Fig. 4. Relation between threshold of probability and
error rate.
(8)
32
The variance Vx2 of the mean of the horizontal components
of ps is similarly determined.
It is the uppermost horizontal line which includes the
pixels belonging to the target object region more than Nt
pixels. The vertex position in the vertical direction is taken
to be the average of the vertical components of the boundary
pixels above lt. The horizontal line corresponding to this
position is denoted by Lt. Similarly to the variance Vws
2 of
the shoulder width, the variance of the perpendicular posi-
tion of the vertex Vy2 can also be determined.
On the vertical line Lsc passing through ps, the object
position pB is taken to be lower than Lt by a distance dtc (the
distance from the vertex to the origin of the body coordinate
system); dtc can be calculated from the human model (Fig.
2) and the predicted state x~t. The variances of the horizontal
and vertical components of pB are the previously deter-
mined Vx2 and Vy
2.
2.3. State prediction and estimation
In this method, a Kalman filter is used in the predic-
tion and estimation of state xt of the target object at time t
[20]. If it is assumed that the object moves at the same
velocity in successive frames, the system equation of the
Kalman filter is
where I6 is a 6 u 6 unit matrix, u is the estimation error
which is the white noise with zero mean and the covariance
matrix U determined based on the acceleration range of the
target object.
If the average and the covariance matrix of the esti-
mated target state at time t � 1 are represented by
x̂t�1, X̂t�1, the average xt and the covariance matrix Xt of the
target state is predicted by
If the observa t ion vector i s denoted by
yt [WspBTo1
T� � � oiT� � � ]t, the observation equation becomes
where Ws is the shoulder width, pB is the position of the
target object on the image (the projection of the origin of
the body coordinate system on the image), oi �uividi�T is
the disparity and the flow vector at pixel pi belonging to the
target object region, h is a vector function relating the
observed value and the state (see the Appendix), and wt is
the observation error, which is a white noise with zero mean
and covariance matrix Wt. The variance of each component
of wt is uncorrelated and is determined from the reliability
of observation [with the variances of the shoulder width
Ws and the position pB of the target object being
VWs
2 , Vx2, Vy
2 (see Section 2.2.2) and the variances of the
optical flow and the disparity being Vu2, Vv
2, Vd2 (see Section
3)]. Since h is not linear, Eq. (11) is linearized around the
predicted value and an extended Kalman filter is formed.
The average x̂t and the covariance matrix X̂t of the estimated
state are given by
where Kt { X~
tHtT�HtX
~tHt
T � Wt��1 is the Kalman gain, I12 is
a 12 u 12 unit matrix, and Ht { wh�x� /wx|x~t
. When the target
object is occluded, because of the impossibility of extrac-
tion of its region, shape information cannot be used. In such
cases, the occlusion of the target object can be detected from
the variations of the optical flow and the disparity [9] and
tracking is performed without the use of shape information.
If there are multiple persons with similar flow vectors
and close to each other, since those regions are connected
to each other, the overall region is treated as the target object
region. However, if the observed shoulder width is suffi-
ciently larger than the predicted one, it is possible to detect
that multiple persons are tracked. In such a case, the track-
ing is performed without the use of shape information.
2.4. Generation and elimination of candidates
In the region where the observation of shoulder width
[i.e., Eq. (A.4)] is highly nonlinear, multiple angles become
possible for rotations about a vertical axis, relative to an
observed (shoulder-width) value. For example, in Fig. 6(a),
there are two solutions ryt1 , ryt
2 with regard to rotations about
a vertical axis for the observed value Ws. But because of
Fig. 5. Feature extraction from target region.
(9)
(10)
(11)
(12)
(13)
33
linearization, only one solution can be determined. This hap-
pens only in the vicinity of a local maxima and minima and,
therefore, when the estimated angle comes out of the vicinity
of a local maxima and minima, the following equation is used
to establish the other state candidate symmetrical to the esti-
mated value r̂yt against the local maximum Mt [see Fig. 6(b)]
and tracking is performed for each candidate:
where r̂ytN�1 is the angle of rotation of the new state candidate
about the vertical axis, N denotes the number of candidates,
and Mt represents the nearest local maxima or local minima
at time t. The covariance matrix corresponding to the origi-
nal state candidate is used as that of the new state candidate.
The two possible candidates of the 13th frame, in the image
example of Fig. 9, are shown in Figs. 6(c) and 6(d). The
candidate whose predicted state contradicts the current
observations is erroneous, and based on the difference of
the predicted observation ytk h�x t
k� and the current obser-
vation yk, the criterion value
is calculated and compared with each other. Here, Wtk is the
covariance matrix of the observation error of candidate k
and ntk represents the dimension of yt
k. Since the dimensions
of the observation of the candidates are all different, the
criterion value is normalized. The criterion values Ctk of
frames 1 to 15 of the image example in Fig. 9 are shown in
Fig. 7. As seen, the criterion values Ctk of the erroneous
candidates are large. But there are cases where the criterion
values differ very slightly, as in the seventh frame, with a
possibility of error. Therefore, the candidates with large
criterion values compared to those of other candidates
during some time interval are eliminated.
2.5. State initialization
In the initial frame, the photographic system remains
static. If the target is the only moving object in the initial
frame, the set of the pixel where the flow vector is rather
large is determined to be the candidate region of the target
object Sv. Inside Sv, the average Zm of the disparities is
calculated and the set of the pixels similar to Zm is extracted
from Sv and is determined to be the target object region S0.
If other moving objects exist in the initial frame, the rectan-
gle circumscribing the target object is given, where the
above process is carried out.
As explained in Section 2.2, the target position ps and
the shoulder width Ws are determined by using S0 and Zm.
Then the parameter for conversion from the body to the
world coordinates [q of Eq. (2)] is calculated. Since the
angle of rotation about the vertical axis is not uniquely
determined from the shoulder width all the possible candi-
dates are generated. The initial speeds of rotation and
translation are taken to be 0.
3. Extraction of Information on Optical
Flow and Disparity
The optical flow and the disparity can be determined
from the point correspondences in successive images and
in left and right stereo pictures, respectively. In this method,
the point correspondence is determined by SAD [2]. How-
ever, no reliable point correspondence can be achieved for
the points with poor contrast. In regards to the optical flow,
correspondence cannot be assigned to the points satisfy-
ing
Fig. 6. State candidates.
(14)
(15)
Fig. 7. Criterion of candidates.
(16)
34
where W is the length of one side of image subregion of
SAD calculation, f(x, y) denotes the brightness at point (x,
y), the subscripts x, y denote the partial derivatives in the
horizontal and vertical directions, and cv is a positive con-
stant. In regards to the disparity, no correspondence can be
assigned to the points satisfying
where cd is a positive constant. Since the reliability of the
optical flow and the disparity depends on the contrast, the
variance Vu2, Vv
2 of the flow vector and the variance Vd2 of the
disparity are determined to be
where kv, kd are positive constants. Further, even for an
established point correspondence, reliability is low if the
SAD value is large. Therefore, the flow vectors and dispari-
ties of such points are not used. Finally, if no corner edge
exists inside the subregions of SAD calculations, the point
correspondence cannot be established uniquely, and thus
flow vector calculation for these points is omitted [9].
Figure 8 shows the results of the optical flow and disparity
extractions.
4. Experiments
Figure 9 shows the results of tracking a person with
movement including rotation about an axis perpendicular
to the optical axis. In case of rotations about a vertical axis,
where the estimation process is difficult even if the optical
flow and depth information are used, proper estimation
becomes possible by using the shoulder-width information
Ws. For the sake of comparison, in tracking without the use
of shoulder width Ws and torso position pB [eliminated from
Eq. (11)], the error of speed estimation of the frames
accumulates as shown in Fig. 10, which includes results of
a failed tracking.
In the example shown in Fig. 11, from frames 7 to 16,
there are other persons close to the target object moving in
the reverse direction. In such cases, it is difficult to distin-
guish the target from other persons based only on the
disparity. They can be distinguished by using the optical
flow to properly extract the object region, leading to suc-
cessful tracking. In frames 165 to 195, there is an overlap
between the target and the other person with the same
velocity. It is again difficult to distinguish these persons
(17)
(18)
Fig. 8. Extracted optical flow and disparity at 13th
frame in Fig. 9. (Optical flow or disparity cannot
be calculated in white regions.)
Fig. 9. Motion including rotation around an axis
perpendicular to optical axis. [Upper rows show tracking
result. Lower rows show extracted target region (black),
vertex (gray line), shoulder width (gray lines),
and origin of torso (gray point).]
35
based only on the optical flow. However, the target region
is properly extracted based on the disparity and the tracking
is carried out successfully. Further, in this example, despite
large motions along the optical axis, proper estimation
becomes possible by using the depth information. In thetracking results of this example shown in Fig. 12, the object
position and the direction of the torso are depicted. In Fig.
12, the target direction is suddenly changed at A, which
means a new state candidate is generated in the frame and
is eliminated in the following frame. The situation in Ac is
quite similar. At B, Bc, and Bs in this figure, there are
fluctuations in the estimation results due to errors in the
optical flow and disparities resulting in erroneous extrac-
tion of the target region.
5. Conclusions
The states of a person walking in a scene were esti-
mated by using the optical flow and the depth, and tracking
was carried out. In cases of rotations about an axis perpen-
dicular to the optical axis, where correct estimation became
difficult even if the optical flow and depth information were
used, proper estimation became possible when using infor-
mation on target shape. Because of the use of information
on target shape, correct estimation of the target region
became indispensable. Therefore, the predicted target state,
optical flow, and depth information were used for proper
estimation of the target region. Even if another person with
3D movements including rotations about an axis perpen-
dicular to the optical axis were present but it was possible
to distinguish him by only a single kind of information,
tracking became possible and the effectiveness of the
method was confirmed.
Fig. 10. Without shoulder width and position.
Fig. 11. There is an object which has similar depth or
velocity to the target.
Fig. 12. Top view of the target trajectory. (Each line
stands for the direction of the shoulder. The target
position is the midpoint of the line.)
36
Acknowledgment. We acknowledge the assistance
of members of Shirai Laboratory, Osaka University, during
the research.
REFERENCES
1. Yachida M, Asada M, Tsuji S. Automatic analysis of
moving image. IEEE Trans PAMI 1981;3:12�20.
2. Inoue H, Tachkawa T, Inaba M. Robot vision system
with a correlation chip for real-time tracking, optical
flow and depth map generation. Proc IEEE Int Conf
R&A, p 1621�1626, 1992.
3. Coombs D, Brown C. Real-time smooth pursuit
tracking for a moving binocular robot. Proc
CVPR�92, p 23�28.
4. Yamamoto S, Mae Y, Shirai Y, Miura J. Realtime
multiple object tracking based on optical flows. Proc
IEEE Int Conf R&A 1995;3:2328�2333.
5. Chen HJ, Shirai Y, Asada M. Detecting multiple rigid
image motions from an optical flow field obtained
with multi-scale, multi-orientation filters. IEICE
Trans Inf Syst 1993;E76-D:1253�1262.
6. Kyo H, Harajima H. Region segmentation and track-
ing of moving pictures based on clustering of posi-
t ion, color and motion. Tech Rep IEICE
1995;IE94-152.
7. Koller D, Weber J, Malik J. Robust multiple car
tracking with occlusion reasoning. Proc ECCV�94, p
189�196.
8. Uhlin T, Nordlund P, Maki A, Eklundh JO. Towards
an active visual observer. Proc 5th ICCV, p 679�686,
1995.
9. Okada R, Shirai Y, Miura J, Kuno Y. Object tracking
based on optical flow and depth information. Trans
IEICE 1997;J80-D-II:1530�1583.
10. Yamane T, Shirai Y, Miura J. Person tracking by
integrating optical flow and uniform brightness re-
gions. Proc IEEE Int Conf R&A, p 3267�3272, 1998.
11. Etoh M, Shirai Y. 2-D motion estimation by region
segmentation based on color, position and intensity
gradients. Trans IEICE 1993;J76-D-II:2324�2332.
12. Araki S, Matsuoka T, Takemura H, Yokoya N. Real-
time tracking of multiple moving objects in moving
camera image sequence using robust statistics. Proc
14th ICPR 1998;2:1433�1435.
13. Ishii H, Mochizuki K, Kishino F. A method of motion
recognition from stereo images for human image
synthesis. Trans IEICE 1993;J76-D-II:1805�1812.
14. Huttenlocher DP, Noh JJ, Rucklidge WJ. Tracking
non-rigid objects in complex scenes. Proc 4th ICCV,
p 93�101, 1993.
15. Ju SX, Black MJ, Yacoob Y. Cardboard people: A
parameterized model of articulated image motion.
Proc Int Conf on Face and Gesture Recognition, p
461�465, 1996.
16. Mae Y, Shirai Y. Tracking moving object in 3-D space
based on optical flow and edges. Proc 14th ICPR
1998;2:1439�1441.
17. Yamamoto M, Kawada S, Kondo T, Koshikawa K.
Image tracking of persons with 3-D motions. Trans
IEICE 1996;J79-D-II:71�83.
18. Adiv G. Inherent ambiguities in recovering 3-D mo-
tion and structure from a noisy flow field. IEEE Trans
PAMI 1989;11:477�489.
19. Honma H, Kasugaya N. Dimensional analysis. Least
square method and experimental equations. Corona
Publishers; 1957. Chapter 2.
20. Arimoto T. Kalman filters. Industrial Library; 1977.
Chapter 3.
APPENDIX
Relation between Target State and Observed
Values
The flow vector �ui, vi� observed on the image when
the point Pib belongs to the target expressed in the body
coordinates is given by
where f is the focal length and Zic, Z�
ic are the z components
of Pic and P�
ic. Using the expectation x kt of the predicted
target state and the human model, the predicted depth Zic
can be obtained at each point of the object. Further, the
transformation matrix Q~
bw can be obtained from the expec-
tation of the predicted state xkt . For a point �xi, yi� in the
image, Pib is determined by
with the disparity di given by
where b is the distance between the stereo cameras. The
shoulder width Ws can be expressed by
(A.1)
(A.2)
(A.3)
37
where Plb, Prb are the centers of the half cylinders at the
shoulder position of the human model in the body coordi-
nate system (see Fig. 2), r is the radius of the half cylinders,
and Zlc, Zrc stand for the Z components of Qcw�1QbwPlb and
Qcw�1QbwPrb.
The position of the target (on the image)
pBh �xByB1�T is expressed as
where PB �0 0 0 1�T is the origin of the body coordinates
and Zbc is the Z component of Qcw�1QbwPB. The observation
values ui, vi, di, Ws, xB, and yB are determined by matrices
Qbw and Q�
bw. In these matrices, the target state is included
as a parameter, and therefore the vector function h relating
the observation value and the target state can be obtained.
AUTHORS (from left to right)
Ryuzo Okada (member) received his B.S. degree in engineering from Osaka University in 1995 and finished his graduate
studies in the field of computer-controlled mechanical systems in 1999, receiving a D.Eng. degree. He joined Toshiba Co. in
1999. He is engaged in research on computer vision.
Yoshiaki Shirai (member) received his B.E. degree from Nagoya University in 1964 in the field of mechanical
engineering, and his M.E. and D.Eng. degrees from Tokyo University in 1969 and then joined the Electrotechnical Laboratory.
He spent 1971�72 at the MIT AI Labs as a visiting scholar. Since 1988 he has been a professor in the Department of
Computer-Controlled Mechanical Systems at Osaka University. He has been involved in research on computer vision, robotics,
and artificial intelligence. He is a member of the Artificial Intelligence Society, the Japan Society of Robotics, the Information
Processing Society, the Japan Society of Mechanical Engineers, the IEE Computer Society, and AAAI.
Jun Miura (member) received his B.E. degree in mechanical engineering from the University of Tokyo in 1984 and
finished his graduate studies in the field of information engineering in 1989, receiving a D.Eng. degree. He joined Osaka
University in 1989 as a research associate in the Department of Computer-Controlled Mechanical Systems, and became an
associate professor in 1999. He has been involved in research on intelligent robots. He was a visiting scholar at CMU during
1994�95. He is a member of IEEE, AAAI, the Japanese Society of Artificial Intelligence, the Robotics Society of Japan, the
Information Processing Society of Japan, and the Japan Society of Mechanical Engineers.
Yoshinori Kuno (member) received his B.S. and D.Eng. degrees in electrical and electronics engineering from the
University of Tokyo in 1977 and 1982. He joined Toshiba Co. in 1982 and was a visiting research scientist at CMU in the United
States during 1987�88. He joined Osaka University in 1993 as an associate professor in the Department of Computer-Controlled
Mechanical Systems. Since 2000, he has been a professor in the Department of Information and Computer Science at Saitama
University. He is engaged in research on computer vision, intelligent robots, and human interface. He is a member of the
Information Processing Society of Japan, the Japan Society of Mechanical Engineering, the Robotics Society of Japan, the
Japanese Society for Artificial Intelligence, the Society of Instrumentation and Control Engineers, ACM, and IEEE.
(A.4)
(A.5)
38
top related