m. harville 1, a. rahimi 2, t. darrell 2, g. gordon 3, j. woodfill 3 3d head pose tracking with...
TRANSCRIPT
M. Harville1, A. Rahimi2, T. Darrell2,
G. Gordon3, J. Woodfill3
3D Head Pose Tracking with Linear Depth and Brightness
Constraints
1: Hewlett-Packard Labs; 2: MIT AI Lab; 3: Tyzx Inc.Part of work was done while all authors were employed by Interval Research.
The Basic Problem to be Solved
We want to know the rotation (3 DOF) and translation (3 DOF) that a rigid object undergoes from one frame in a video to the next.
In this case, the inter-frame motion can be expressed as rotation about a vertical axis, followed by rightward translation
t t + t
The Basic Problem to be Solved (cont.)
• Add up these incremental motions to get cumulative motion since start of video
• Motion estimation is equivalent to the tracking of object “pose”: position and orientation in some reference coordinate system.
• One way to visualize pose estimate: render axes in image as if they were rigidly affixed to object.
t t + t
Applications - Lots!
• Perceptual user interface: understanding of head gaze, gestures
• Virtual reality: avatars; prosthetic input devices• Camera ego-motion: robot or mobile vehicle self-
localization; panoramic scene-reconstruction from video
• Augmented reality: make rendered object in a scene move with scene even as camera turns
• Object-tracking: pick-and-place assembly machines; surveillance; automobile collision avoidance
Example: Head pose estimation
• Approximate head as a rigid body.• Want to know which way head is turned, and where
it is in space.
The Inspiration
In most situations, all you have is color or grayscale video from a single camera, and most prior methods have focused on how to solve the problem under these conditions => very difficult!
Suppose you had a little more information:
a registered, companion video of
dense (per-pixel) depth.
Now what would be the best thing to do, and how good is it?
Registered Intensity and Depth
The Sales Pitch for Our Solution
Under the assumption that, in addition to intensity and/or color information, you have dense depth from some source (e.g. stereo, laser, structured light), here is a method that...
• Is designed for speed (single linear system of equations) => good for real-time applications
• Does not require approximation of shape model or prior knowledge of object shape
• Provides superior or comparable accuracy to other methods
Prior Work:Feature-Based Methods
• Common approaches• General feature-tracking + Structure-from-Motion• Eye / Nose / Mouth tracking + Rigid Head model• State-of-the-art: Zelisky et. al. (Australia)
• Common problems• Features disappear• Rotation appears as Translation• Depth change must be inferred from scale change• Data are noisy: need to integrate information optimally
over entire observation
An Alternative:Direct Motion Estimation
• Use measurements based on change in image values rather than tracked features
-> More robust -- doesn’t discard uncertainty information
• Express constraints directly on image values• Pool information with least squares estimate over
all pixels
-> Not dependent on small set of key features
• Lots of prior work: Horn and Weldon ‘88, Bergen et al. ‘92, Black and Yacoob ‘95, Bregler and Malik ‘98, Stein and Shashua ‘98, ...
Some Variable Definitions
z
y
x
T
T
T
T
z
y
x
||||
Z
Y
X
P
X
Y
Z
Camera Center of Projection
y
xp
Points in Space and Points in Image
3D Coordinate System and Motion Parameters
System Input: I(x,y) and Z(x,y) at times t, t+1
System Output: inter-frame motion T and
O
Direct Motion Estimation Using BCCE
• Brightness Change Constraint Equation (BCCE):
)1,,(),,( tvyvxItyxI yx
0dt
dIv
dy
dIv
dx
dIyx
y
x
v
v
dy
dI
dx
dI
dt
dI
• First-order Taylor series expansion:
• Matrix formulation:
Direct Motion Estimation Using BCCE
Relate 2D velocities to 3D velocities via a camera projection model:
Orthographic Perspective
z
y
x
y
x
V
V
V
v
v
010
001
yVyvxVxv
YyXx
,
,
22 ,
,
Z
fYV
Z
fV
yvZ
fXV
Z
fVxv
Z
fYy
Z
fXx
zyzx
z
y
x
y
x
V
V
V
Z
y
Z
fZ
x
Z
f
v
v
0
0
OR OR
Direct Motion Estimation Using BCCE
Constrain 3D velocities to be consistent with rotation and translation of a single rigid body:
For small angle rotations,
0
0
0ˆ where
,ˆˆ
XY
XZ
YZ
TTPTV
V
V
V
z
y
x
P
PIP
Direct Motion Estimation Using BCCE
Chain these relations together to get one constraint equation per pixel:• Orthographic
• Perspective
Combine across pixels into one linear system and solve for [ T, ] via QR or SVD.
T
0100
0010
0001
)(1
XY
XZ
YZ
dy
dIy
dx
dIx
dy
dIf
dx
dIf
Zdt
dI
T
0100
0010
0001
0
XY
XZ
YZ
dy
dI
dx
dI
dt
dI
Direct Motion Estimation Using BCCE
• Z unknown !
• Past solutions:• Assume approximate shape: planar (Black and
Yacoob), ellipsoidal (Basu and Pentland; Bregler and Malik), polygonal (Essa et.al.), hyperquadrics, etc.
• Laser-scanned 3D model of object to be tracked• Estimate depth and motion successively via linear or
non-linear methods, or together with non-linear optimization => “open loop” issues
T
0100
0010
0001
)(1
XY
XZ
YZ
dy
dIy
dx
dIx
dy
dIf
dx
dIf
Zdt
dI
“Direct Depth”: two new ideas
1. Use (independently measured) Z directly in BCCE• Believe it or not, this appears to be novel.• Frees us from shape model that is either approximate
(e.g. planar, ellipsoidal, etc.) or which is known a priori.• Shape model can change (slowly) over time: allows for
360 degree rotations, better handles non-rigidity.• Related to Direct Motion Stereo of [Shieh et al.] and [Stein
and Shashua], but their methods assume infinitesimal camera baselines and require coarse-to-fine solution if disparities >1 pixel are generated. Also, they compute motion before depth; we use depth directly.
“Direct Depth”: two new ideas
2. Express a direct constraint on the depth gradient.• It operates on depth image very similarly to how the
classic Brightness Change Constraint Equation (BCCE) applies to the intensity image.
• We call this the “Depth Change Constraint Equation”, or “DCCE”.
zyx VtvyvxZtyxZ )1,,(),,(
)1,,(),,( tvyvxItyxI yx
0 zyx Vdt
dZv
dy
dZv
dx
dZ
The DCCE
Add in perspective projection and constrain to a single rigid motion:
Very similar to our result for BCCE:
T
0100
0010
0001
)(1
XY
XZ
YZ
dy
dZy
dx
dZxZ
dy
dZf
dx
dZf
Zdt
dZ
T
0100
0010
0001
)(1
XY
XZ
YZ
dy
dIy
dx
dIx
dy
dIf
dx
dIf
Zdt
dI
DCCE vs. BCCE
• Advantages of DCCE over BCCE• Depth information is more robust to lighting changes in
space and time.• The BCCE is an assumption that is true only for perfectly
uniform illumination and Lambertian surfaces, whereas the DCCE is just a linearization of a generic description of motion in 3D.
• But…real-time depth data tends to be very noisy and full of holes!• Smoothing seems to help.
Joint Constraint on Rigid Motion
Our proposal: combine the BCCE and DCCE constraint equations into a single linear system:
T
0100
0010
0001
)(
)(
XY
XZ
YZ
dy
dZy
dx
dZxZ
dy
dZf
dx
dZf
dy
dIy
dx
dIx
dy
dIf
dx
dIf
dt
dZdt
dI
b
b
THHTH
H1
Least squares problem, solve for six-parameter vector via QR or SVD.
Some Important Practical Details
• Support maps• Only use constraint equations where depth and all depth
derivatives are valid.
• Ignore locations of very high depth gradient (due to self-occlusion/disocclusion)
• Coordinate shift• If center of coordinate system is far from object, it is easy to
confuse translation with rotation about a distant axis, and vice versa -> numerical instability.
• Solution: At each time step, find object centroid, compute motion in coordinate system centered there, then transform motion parameters back to world coordinate system.
Experiments
• Synthetic and real sequences of moving heads• Synthetic sequences provide us with ground truth for
quantitative analysis• Real sequences show it’s not just theory.
• Hard cases: translation in Z, rotation out-of-plane• Compare four motion estimation methods
• BCCE only with planar depth -> representative of standard methods
• BCCE only with measured depth• DCCE only• BCCE + DCCE
Synthetic Image Sequences
Generated color and depth image sequences by rendering a laser-scanned model of a human face with a standard graphics package.
Rotation sequence Z-translation sequence
Synthetic Results - Rotation Sequence
Synthetic Results - Z-Trans Sequence
Real Data Sequence
Real Results: Still-Frame Comparison
Select Frames from BCCE+planar depth
Select Frames from BCCE+
DCCE
=>
=>
=>
=>
Frame 68 Frame 111 Frame 162
Real Results: Still-Frame Comparison
Select Frames from BCCE+planar depth
Select Frames from BCCE+
DCCE
=>
=>
=>
=>
Frame 211 Frame 293
Real Results: BCCE with Planar Depth
Real Results: BCCE + DCCE
Extensions and Future Work
• Complement it with a slower, non-differential approach that helps detect and remove gross errors
• Real-time implementation!• Experiment with some mathematical tweaks:
• Constrained or weighted least squares• Use a second iteration per frame
• Add coarse-to-fine to handle large motions, if needed
• More ambitious tests: 360 degree rotation, slow non-rigidity, etc. => things few or no other methods can do
Extensions & Future Work
• Apply direct depth and brightness constraint without rigid model: 3-D direct optic flow.
• Ego-motion: use joint depth and brightness constraint to recover camera motion.
• Articulated bodies: extend to use exponential twist formalism, a la Bregler and Malik.
M. Covell, A. Rahimi, M. Harville, T. Darrell. "Articulated-pose estimation using brightness- and depth-constancy constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head S.C., June 2000.
The End