m. harville 1, a. rahimi 2, t. darrell 2, g. gordon 3, j. woodfill 3 3d head pose tracking with...

M. Harville1, A. Rahimi2, T. Darrell2,

G. Gordon3, J. Woodfill3

3D Head Pose Tracking with Linear Depth and Brightness

Constraints

1: Hewlett-Packard Labs; 2: MIT AI Lab; 3: Tyzx Inc.Part of work was done while all authors were employed by Interval Research.

The Basic Problem to be Solved

We want to know the rotation (3 DOF) and translation (3 DOF) that a rigid object undergoes from one frame in a video to the next.

In this case, the inter-frame motion can be expressed as rotation about a vertical axis, followed by rightward translation

t t + t

The Basic Problem to be Solved (cont.)

• Add up these incremental motions to get cumulative motion since start of video

• Motion estimation is equivalent to the tracking of object “pose”: position and orientation in some reference coordinate system.

• One way to visualize pose estimate: render axes in image as if they were rigidly affixed to object.

t t + t

Applications - Lots!

• Perceptual user interface: understanding of head gaze, gestures

• Virtual reality: avatars; prosthetic input devices• Camera ego-motion: robot or mobile vehicle self-

localization; panoramic scene-reconstruction from video

• Augmented reality: make rendered object in a scene move with scene even as camera turns

• Object-tracking: pick-and-place assembly machines; surveillance; automobile collision avoidance

Example: Head pose estimation

• Approximate head as a rigid body.• Want to know which way head is turned, and where

it is in space.

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/gaile-persp.mpg

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/gailerawy.mpg

The Inspiration

In most situations, all you have is color or grayscale video from a single camera, and most prior methods have focused on how to solve the problem under these conditions => very difficult!

Suppose you had a little more information:

a registered, companion video of

dense (per-pixel) depth.

Now what would be the best thing to do, and how good is it?

Registered Intensity and Depth

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/gailey.mpg

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/gailed.mpg

The Sales Pitch for Our Solution

Under the assumption that, in addition to intensity and/or color information, you have dense depth from some source (e.g. stereo, laser, structured light), here is a method that...

• Is designed for speed (single linear system of equations) => good for real-time applications

• Does not require approximation of shape model or prior knowledge of object shape

• Provides superior or comparable accuracy to other methods

Prior Work:Feature-Based Methods

• Common approaches• General feature-tracking + Structure-from-Motion• Eye / Nose / Mouth tracking + Rigid Head model• State-of-the-art: Zelisky et. al. (Australia)

• Common problems• Features disappear• Rotation appears as Translation• Depth change must be inferred from scale change• Data are noisy: need to integrate information optimally

over entire observation

An Alternative:Direct Motion Estimation

• Use measurements based on change in image values rather than tracked features

-> More robust -- doesn’t discard uncertainty information

• Express constraints directly on image values• Pool information with least squares estimate over

all pixels

-> Not dependent on small set of key features

• Lots of prior work: Horn and Weldon ‘88, Bergen et al. ‘92, Black and Yacoob ‘95, Bregler and Malik ‘98, Stein and Shashua ‘98, ...

Some Variable Definitions

z

y

x

T

T

T

T

z

y

x

||||

Z

Y

X

P

X

Y

Z

Camera Center of Projection

y

xp

Points in Space and Points in Image

3D Coordinate System and Motion Parameters

System Input: I(x,y) and Z(x,y) at times t, t+1

System Output: inter-frame motion T and

O

Direct Motion Estimation Using BCCE

• Brightness Change Constraint Equation (BCCE):

)1,,(),,( tvyvxItyxI yx

0dt

dIv

dy

dIv

dx

dIyx

y

x

v

v

dy

dI

dx

dI

dt

dI

• First-order Taylor series expansion:

• Matrix formulation:


Relate 2D velocities to 3D velocities via a camera projection model:

Orthographic Perspective

z

y

x

y

x

V

V

V

v

v

010

001

yVyvxVxv

YyXx

,

,

22 ,

,

Z

fYV

Z

fV

yvZ

fXV

Z

fVxv

Z

fYy

Z

fXx

zyzx

z

y

x

y

x

V

V

V

Z

y

Z

fZ

x

Z

f

v

v

0

0

OR OR


Constrain 3D velocities to be consistent with rotation and translation of a single rigid body:

For small angle rotations,

0

0

0ˆ where

,ˆˆ

XY

XZ

YZ

TTPTV

V

V

V

z

y

x

P

PIP


Chain these relations together to get one constraint equation per pixel:• Orthographic

• Perspective

Combine across pixels into one linear system and solve for [ T, ] via QR or SVD.

T

0100

0010

0001

)(1

XY

XZ

YZ

dy

dIy

dx

dIx

dy

dIf

dx

dIf

Zdt

dI

T

0100

0010

0001

0

XY

XZ

YZ

dy

dI

dx

dI

dt

dI


• Z unknown !

• Past solutions:• Assume approximate shape: planar (Black and

Yacoob), ellipsoidal (Basu and Pentland; Bregler and Malik), polygonal (Essa et.al.), hyperquadrics, etc.

• Laser-scanned 3D model of object to be tracked• Estimate depth and motion successively via linear or

non-linear methods, or together with non-linear optimization => “open loop” issues

T

0100

0010

0001

)(1

XY

XZ

YZ

dy

dIy

dx

dIx

dy

dIf

dx

dIf

Zdt

dI

“Direct Depth”: two new ideas

1. Use (independently measured) Z directly in BCCE• Believe it or not, this appears to be novel.• Frees us from shape model that is either approximate

(e.g. planar, ellipsoidal, etc.) or which is known a priori.• Shape model can change (slowly) over time: allows for

360 degree rotations, better handles non-rigidity.• Related to Direct Motion Stereo of [Shieh et al.] and [Stein

and Shashua], but their methods assume infinitesimal camera baselines and require coarse-to-fine solution if disparities >1 pixel are generated. Also, they compute motion before depth; we use depth directly.

“Direct Depth”: two new ideas

2. Express a direct constraint on the depth gradient.• It operates on depth image very similarly to how the

classic Brightness Change Constraint Equation (BCCE) applies to the intensity image.

• We call this the “Depth Change Constraint Equation”, or “DCCE”.

zyx VtvyvxZtyxZ )1,,(),,(

)1,,(),,( tvyvxItyxI yx

0 zyx Vdt

dZv

dy

dZv

dx

dZ

The DCCE

Add in perspective projection and constrain to a single rigid motion:

Very similar to our result for BCCE:

T

0100

0010

0001

)(1

XY

XZ

YZ

dy

dZy

dx

dZxZ

dy

dZf

dx

dZf

Zdt

dZ

T

0100

0010

0001

)(1

XY

XZ

YZ

dy

dIy

dx

dIx

dy

dIf

dx

dIf

Zdt

dI

DCCE vs. BCCE

• Advantages of DCCE over BCCE• Depth information is more robust to lighting changes in

space and time.• The BCCE is an assumption that is true only for perfectly

uniform illumination and Lambertian surfaces, whereas the DCCE is just a linearization of a generic description of motion in 3D.

• But…real-time depth data tends to be very noisy and full of holes!• Smoothing seems to help.

Joint Constraint on Rigid Motion

Our proposal: combine the BCCE and DCCE constraint equations into a single linear system:

T

0100

0010

0001

)(

)(

XY

XZ

YZ

dy

dZy

dx

dZxZ

dy

dZf

dx

dZf

dy

dIy

dx

dIx

dy

dIf

dx

dIf

dt

dZdt

dI

b

b

THHTH

H1

Least squares problem, solve for six-parameter vector via QR or SVD.

Some Important Practical Details

• Support maps• Only use constraint equations where depth and all depth

derivatives are valid.

• Ignore locations of very high depth gradient (due to self-occlusion/disocclusion)

• Coordinate shift• If center of coordinate system is far from object, it is easy to

confuse translation with rotation about a distant axis, and vice versa -> numerical instability.

• Solution: At each time step, find object centroid, compute motion in coordinate system centered there, then transform motion parameters back to world coordinate system.

Experiments

• Synthetic and real sequences of moving heads• Synthetic sequences provide us with ground truth for

quantitative analysis• Real sequences show it’s not just theory.

• Hard cases: translation in Z, rotation out-of-plane• Compare four motion estimation methods

• BCCE only with planar depth -> representative of standard methods

• BCCE only with measured depth• DCCE only• BCCE + DCCE

Synthetic Image Sequences

Generated color and depth image sequences by rendering a laser-scanned model of a human face with a standard graphics package.

Rotation sequence Z-translation sequence

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/synthRot3-rgb.mpg

Synthetic Results - Rotation Sequence

Synthetic Results - Z-Trans Sequence

Real Data Sequence

http://hpl.hp.com/research/cp/cmsl/presentations/as/movies/gailerawy.mpg

Real Results: Still-Frame Comparison

Select Frames from BCCE+planar depth

Select Frames from BCCE+

DCCE

=>

=>

=>

=>

Frame 68 Frame 111 Frame 162

Real Results: Still-Frame Comparison

Select Frames from BCCE+planar depth

Select Frames from BCCE+

DCCE

=>

=>

=>

=>

Frame 211 Frame 293

Real Results: BCCE with Planar Depth

Real Results: BCCE + DCCE

Extensions and Future Work

• Complement it with a slower, non-differential approach that helps detect and remove gross errors

• Real-time implementation!• Experiment with some mathematical tweaks:

• Constrained or weighted least squares• Use a second iteration per frame

• Add coarse-to-fine to handle large motions, if needed

• More ambitious tests: 360 degree rotation, slow non-rigidity, etc. => things few or no other methods can do

Extensions & Future Work

• Apply direct depth and brightness constraint without rigid model: 3-D direct optic flow.

• Ego-motion: use joint depth and brightness constraint to recover camera motion.

• Articulated bodies: extend to use exponential twist formalism, a la Bregler and Malik.

M. Covell, A. Rahimi, M. Harville, T. Darrell. "Articulated-pose estimation using brightness- and depth-constancy constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head S.C., June 2000.

The End

m. harville 1, a. rahimi 2, t. darrell 2, g. gordon 3, j. woodfill 3 3d head pose tracking with...

Documents