high-resolution 3d surface strain magnitude using 2d camera and low-resolution depth sensor

9
High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor q Matthew Shreve , Mauricio Pamplona, Timur Luguev, Dmitry Goldgof, Sudeep Sarkar Department Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., Tampa, FL, USA article info Article history: Available online xxxx Keywords: 3-D Optical strain Optical flow abstract Generating 2-D strain maps of the face during facial expressions provides a useful feature that captures the bio-mechanics of facial skin tissue, and has had wide application in several research areas. However, most applications have been restricted to collecting data on a single pose. Moreover, methods that strictly use 2-D images for motion estimation can potentially suppress large strains because of projective distor- tions caused by the curvature of the face. This paper proposes a method that allows estimation of 3-D surface strain using a low-resolution depth sensor. The algorithm consists of automatically aligning a rough approximation of a 3-D surface with an external high resolution camera image. We provide exper- imental results that demonstrate the robustness of the method on a dataset collected using the Microsoft Kinect synchronized with two external high resolution cameras, as well as 101 subjects from a publicly available 3-D facial expression video database (BU4DFE). Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction Capturing and quantizing the strain observed during facial expressions has found useful application in areas such as expres- sion detection, face identification, and medical evaluations [1–3]. However one limitation in each of these applications (especially the latter two) is that in order to estimate the consistency of strain over several acquisition sessions, each subject was required to pro- vide a pose that aligned with prior data. We describe a method that is more invariant to view then when strictly using 2-D, thus allow- ing a greater variability in data acquisition. There are several advantages when supplementing 2-D motion estimations with aligned 3-D depth values. Due to the foreshorten- ing that occurs when the curvature of the face is projected onto the 2-D image plane, the horizontal motion estimated along these points are often suppressed. By projecting these vectors back onto the 3-D surface, we are able to adjust these vectors to better approximate the true displacements. Similarly, motion that is orthogonal to the image plane is lost in 2-D only approximations, but recovered when using 3-D data. This method is possible because the surface of the face is a 2-D surface embedded in 3-D. Moreover, we assume that any local region of the face (3 3 neighborhood) is flat and can be approximated using a 2-D plane. Hence, we can use standard optical flow techniques in order to describe the 2-D motion, adjust them using 3-D depth approxima- tions, and then determine the 3-D strain tensors. The proposed algorithm consists of the following steps: (a) the Kinect sensor is automatically calibrated to an external HD camera using an algorithm that is based on SIFT features (b) optical flow is calculated between each consecutive pair of frames in the video collected from the external camera; (c) we then stitch together pairs of motion vectors in order to obtain the motion from the first frame to every other frame; (d) each of the stitched motion fields are projected onto the aligned 3-D surface of the Kinect; (e) surface strain is calculated for each pixel of the face; (f) a masking tech- nique is used that separates reliable strain estimations from those regions where optical flow fails (such as the eyes, mouth, and the boundary of the head); (g) the strain map is back projected to the original orientation of the Kinect sensor, which generates an aligned 3-D surface strain map. We have reported preliminary re- sults in [4]. However, we now use a new method for automatic cal- ibration and facial tracking and extend our experiments from 40 subjects and 2 expression to over 100 subjects and 6 expressions from the BU4DFE database. While we have not found work in the literature that estimates the 3-D surface strain on the face, we found in [4] that there are many approaches for estimating 3-D volumetric motion. Perhaps one of the biggest applications is for the entertainment industry, where modeling the surface of the face during expressions can be used to animate humanoid avatars in interactive video games http://dx.doi.org/10.1016/j.patrec.2014.01.015 0167-8655/Ó 2014 Elsevier B.V. All rights reserved. q This paper has been recommended for acceptance by Xiaoyi Jiang. Corresponding author. Tel.: +1 (813) 974 3652; fax: +1 (813) 974 5456. E-mail address: [email protected] (M. Shreve). Pattern Recognition Letters xxx (2014) xxx–xxx Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

Upload: sudeep

Post on 30-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Pattern Recognition Letters xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

High-resolution 3D surface strain magnitude using 2D cameraand low-resolution depth sensor q

http://dx.doi.org/10.1016/j.patrec.2014.01.0150167-8655/� 2014 Elsevier B.V. All rights reserved.

q This paper has been recommended for acceptance by Xiaoyi Jiang.⇑ Corresponding author. Tel.: +1 (813) 974 3652; fax: +1 (813) 974 5456.

E-mail address: [email protected] (M. Shreve).

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor,Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

Matthew Shreve ⇑, Mauricio Pamplona, Timur Luguev, Dmitry Goldgof, Sudeep SarkarDepartment Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., Tampa, FL, USA

a r t i c l e i n f o a b s t r a c t

Article history:Available online xxxx

Keywords:3-DOptical strainOptical flow

Generating 2-D strain maps of the face during facial expressions provides a useful feature that capturesthe bio-mechanics of facial skin tissue, and has had wide application in several research areas. However,most applications have been restricted to collecting data on a single pose. Moreover, methods that strictlyuse 2-D images for motion estimation can potentially suppress large strains because of projective distor-tions caused by the curvature of the face. This paper proposes a method that allows estimation of 3-Dsurface strain using a low-resolution depth sensor. The algorithm consists of automatically aligning arough approximation of a 3-D surface with an external high resolution camera image. We provide exper-imental results that demonstrate the robustness of the method on a dataset collected using the MicrosoftKinect synchronized with two external high resolution cameras, as well as 101 subjects from a publiclyavailable 3-D facial expression video database (BU4DFE).

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

Capturing and quantizing the strain observed during facialexpressions has found useful application in areas such as expres-sion detection, face identification, and medical evaluations [1–3].However one limitation in each of these applications (especiallythe latter two) is that in order to estimate the consistency of strainover several acquisition sessions, each subject was required to pro-vide a pose that aligned with prior data. We describe a method thatis more invariant to view then when strictly using 2-D, thus allow-ing a greater variability in data acquisition.

There are several advantages when supplementing 2-D motionestimations with aligned 3-D depth values. Due to the foreshorten-ing that occurs when the curvature of the face is projected onto the2-D image plane, the horizontal motion estimated along thesepoints are often suppressed. By projecting these vectors back ontothe 3-D surface, we are able to adjust these vectors to betterapproximate the true displacements. Similarly, motion that isorthogonal to the image plane is lost in 2-D only approximations,but recovered when using 3-D data. This method is possiblebecause the surface of the face is a 2-D surface embedded in 3-D.Moreover, we assume that any local region of the face (3 � 3neighborhood) is flat and can be approximated using a 2-D plane.

Hence, we can use standard optical flow techniques in order todescribe the 2-D motion, adjust them using 3-D depth approxima-tions, and then determine the 3-D strain tensors.

The proposed algorithm consists of the following steps: (a) theKinect sensor is automatically calibrated to an external HD camerausing an algorithm that is based on SIFT features (b) optical flow iscalculated between each consecutive pair of frames in the videocollected from the external camera; (c) we then stitch togetherpairs of motion vectors in order to obtain the motion from the firstframe to every other frame; (d) each of the stitched motion fieldsare projected onto the aligned 3-D surface of the Kinect; (e) surfacestrain is calculated for each pixel of the face; (f) a masking tech-nique is used that separates reliable strain estimations from thoseregions where optical flow fails (such as the eyes, mouth, and theboundary of the head); (g) the strain map is back projected tothe original orientation of the Kinect sensor, which generates analigned 3-D surface strain map. We have reported preliminary re-sults in [4]. However, we now use a new method for automatic cal-ibration and facial tracking and extend our experiments from 40subjects and 2 expression to over 100 subjects and 6 expressionsfrom the BU4DFE database.

While we have not found work in the literature that estimatesthe 3-D surface strain on the face, we found in [4] that there aremany approaches for estimating 3-D volumetric motion. Perhapsone of the biggest applications is for the entertainment industry,where modeling the surface of the face during expressions canbe used to animate humanoid avatars in interactive video games

Pattern

Page 2: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Fig. 1. Overall algorithm design.

Fig. 2. Matched SIFT features from right camera view and the Kinect RGB image aregiven in (a). In (b), the aligned depth images are shown.

2 M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

or for realistic special effects in movies. Many of these methods(especially those that densely track key points on the face) donot work strictly on passive imaging the face, but require the man-ual application of facial landmarks and makeup that are used fortracking and 3-D reconstruction, especially when lighting condi-tions are not ideal [5–7]. The main drawback of these methods isthat they do not provide dense enough motion estimation thatwould allow for the capture of the elastic properties of the face.More similar to our method, a dense approach is given by Bradleyet al. [8] that uses optical flow to track a mesh generated from se-ven pairs of cameras and nine light grids. However, because of thecomplexity of their acquisition device it may not be feasible inmany applications. Other approaches that attempt to find 3-D dis-parity from multiple view triangulation can be found in [8–10],many of which have been used to find non-rigid disparities [11–13]. Methods that use a single camera while exploiting knowledgeof the object being studied (such as updating a generic 3-D facemesh on a 2-D image) is discussed in [14], while some commercialand open-source projects can also be found. One such product is[15], which is compatible with a range of commodity 3-D sensorsand can update a live avatar mesh by automatically tracking sev-eral key points on the face. Alternatively, the open source project[16] has similar capabilities, but works with 2-D images. However,none of these methods provide per-pixel tracking, thus using themto estimate surface strain is not feasible. Finally, dynamic depthmap approaches [17,18] assume a single 3-D acquisition devicewith a calibrated color image, such as the Microsoft Kinect. Inthe first approach, 3-D motion is estimated for rigid gesturesopposed to the non-rigid motion of the face; the latter uses apre-recorded animation priors to drive real-time avatar anima-tions. Our approach also uses dynamic depth maps in order to esti-mate the 3-D surface strain on the face by tracking the skin pixelsover a video sequence captured from an external high-resolutioncamera that has been automatically calibrated/ aligned with anRGB-D sensor.

The paper is organized in the following way: in Section 1 wegive a brief overview of optical flow and how we can use these mo-tion vectors to calculate surface strain. Section 2 gives an approachfor solving the 3-D transform of the low resolution depth map tomatch that of an external HD camera. Section 3 provides twoexperiments, one that demonstrates the robustness of the method

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

with respect depth resolution, and the second provides an experi-ment which compares the 3-D surface strain maps calculated fromtwo HD cameras angled approx. 45 degrees apart. We provide ourconclusions in Section 5.

2. Optical strain

Methods for calculating 2-D optical strain can be found in ourearlier work [1,3]. In this paper, we propose a new method thatuses depth approximation in order to calculate a more view-invari-ant 2-D strain map. The method works by first computing the 2-D

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 3: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Fig. 3. A mask is generated by defining the boundaries of the face, eyes, and mouth using the 66 automatically detected points on the face shown in (a). Part (b) shows theoriginal mask (strain values in white region pass through, black regions are set to 0). The transformed masks for the two external high resolution cameras are shown in (c) and(d) using the described calibration method. Part (e) shows the points that are visible to each view. The light gray areas are points only visible from the right camera, the darkgray areas are those found in the left camera, and the white regions are points common to both views.

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx 3

motion along the surface using standard optical flow techniques.Then, the motion vectors are projected to the 3-D surface of theface and adjusted based on the curvature of the face.

2.1. Optical flow

Optical flow is based on the brightness conservation principle[19]. In general, it assumes (i) constant intensity at each point overa pair of frames, and (ii) smooth pixel displacement within a smallimage region. These first constraint is encapsulated by the follow-ing equation:

ðrIÞT pþ It ¼ 0; ð1Þ

where Iðx; y; tÞ represents the temporal image intensity function atpoint x and y at time t, and rI represents the spatial and temporalgradient. The horizontal and vertical motion vectors are

Fig. 4. Depth reconstruction and surface strain maps for two example expressions (happ(each column after regular image). Depth maps are quantized to 20 levels to highlight c

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

represented by p ¼ ½p ¼ dx=dt; q ¼ dy=dt�T . We tested several opti-cal flow implementations which are summarized in [2]. Based onexperiments, we chose the [20] implementation of the Horn–Schunck method due to the speed, accuracy and flexibility thatthe implementation provided. First, the Horn–Schunck method for-mulates the flow using the following global energy functional:

E ¼ZZ½ðIxuþ Iyv þ ItÞ2 þ a2ðjjrujj2 þ jjrv jj2Þ�dxdy; ð2Þ

where Ix; Iy, and It are the derivatives of the image intensities in thex; y, and t dimensions. The optical flow is denoted by~V ¼ ½uðx; yÞ;vðx; yÞ�T and a is a regularization constant. The choiceof a specifies the degree of smoothness in the flow fields.

This global energy function can be minimized using theEuler–Lagrange equations. An iterative solution of these equationsare derived in [21]:

y and anger) with the depth data at resolutions of 200 � 200, 50 � 50, and 40 � 40onsistency.

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 4: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

100x100 66x66 50x50 40x400.2

0.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Happy

100x100 66x66 50x50 40x400

0.2

0.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Sad

100x100 66x66 50x50 40x400.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Angry

100x100 66x66 50x50 40x400

0.2

0.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Disgust

100x100 66x66 50x50 40x400.2

0.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Fear

100x100 66x66 50x50 40x400.4

0.6

0.8

1

Depth Resolution

Cor

rela

tion

Coe

ffici

ent

Surprise

Fig. 5. Box plots of the correlation coefficients of the 3-D surface strain when using full 1:1 depth information and depth downsampled at ratios of 1:2, 1:3, 1:4, and 1:5.Results are for all 101 subjects performing 6 expressions. The mean is denoted by the larger solid circle, while outliers are denoted by gray dots (approximately � 2.7 r).

4 M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

ukþ1 ¼ �uk � IxðIx�uk þ Iy �vk þ ItÞa2 þ I2

x þ I2y

;

vkþ1 ¼ �vk � IyðIx�uk þ Iy �vk þ ItÞa2 þ I2

x þ I2y

:

ð3Þ

The weighted average of u in a fixed neighborhood at a pixellocation (x; y) is denoted by �uðx; yÞ. With respect to a, weempirically found that a ¼ :05 and k ¼ 200 offered a goodtrade-off between capturing the non-rigid motion of the face andsuppressing noisy motion outliers.

Fig. 6. Acquisition setup for USF-Kinect Dataset. Used in experiment 2 todemonstrate the methods robustness under view changes.

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

Our method is based on tracking the motion occurring on theface during facial expressions, where expressions can last severalseconds and consist of motion that is too large to be accuratelyand consistently tracked by optical flow. To solve this, we use avector stitching technique that combines pairs of vectors from con-secutive frames in order to solve the overall larger displacementsoccurring over several frames [22]. In our previous work [1], wehave shown that a significant amount of noise accumulates afterseveral hundred frames (over 800) in a video, however in this workwe only need to stitch roughly 30–50 frames.

Although we chose to use vector stitching, it is worth notingthat there are alternative methods that could be used for trackingthe motion during facial expression including different approachesfor calculating optical flow (a summary evaluation of several opti-cal flow techniques can be found in [23]). A course-fine approachthat is capable of tracking larger motions is discussed in [19]; how-ever, the time complexity of this approach for large HD images wassignificantly greater than our current approach of stitching to-gether smaller flow estimations (3–5 min per frame comparedwith 2–4 s per frame). Another approach is to warp the 3-D modelto a 2-D plane using diffeomorphic mapping, such as the harmonicmapping given in [24]. Optical flow and hence optical strain wouldbe calculated in the warped domain and then inversely mappedback onto the 3-D model.

2.2. Strain

The 3-D displacement of any deformable object can be ex-pressed by a vector u ¼ ½u;v;w�T . If the motion is small enough,then the corresponding finite strain tensor is defined as:

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 5: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx 5

e ¼ 12½ruþ ðruÞT �; ð4Þ

which can be expanded to the form:

e ¼

exx ¼ @u@x eyx ¼ 1

2@u@y þ @v

@x

� �ezx ¼ 1

2@w@x þ @u

@z

� �

exy ¼ 12

@v@x þ @u

@y

� �eyy ¼ @v

@y ezy ¼ 12

@w@y þ @v

@z

� �

exz ¼ 12

@u@z þ @w

@x

� �eyz ¼ 1

2@v@z þ @w

@y

� �ezz ¼ @w

@z

266664

377775 ð5Þ

where (exx; eyy; ezz) are normal strain components and (exy; exz; eyz)are shear strain components.

Since each of these strain components are a function of dis-placement vectors (u;v ;w) over a continuous space, each straincomponent is approximated using the discrete optical flow data(p; q; r):

p ¼ DxDt¼ u

Dt; u ¼ pDt; ð6Þ

q ¼ DyDt¼ v

Dt; v ¼ qDt ð7Þ

r ¼ DzDt¼ w

Dt; w ¼ rDt ð8Þ

where Dt is the change in time between two image frames. SettingDt to a fixed interval length, we approximate the first-order partialderivatives:

@u@x¼ @p@x

Dt;@u@y¼ @p@y

Dt;@u@z¼ @p@z

Dt; ð9Þ

@v@x¼ @q@x

Dt;@v@y¼ @q@y

Dt;@v@z¼ @q@z

Dt; ð10Þ

@w@x¼ @r@x

Dt;@w@y¼ @r@y

Dt;@w@z¼ @r@z

Dt: ð11Þ

Finally, the above computation scheme can be implementedusing a variety of methods that take the spatial derivative over afinite number of points. We chose the central difference methoddue to its computational efficiency, thus calculating the normalstrain tensors as:

@u@x¼ uðxþ Dx; y; zÞ � uðx� Dx; y; zÞ

2Dx

¼ pðxþ Dx; y; zÞ � pðx� Dx; y; zÞ2Dx

; ð12Þ

@v@y¼ vðx; yþ Dy; zÞ � vðx; y� Dy; zÞ

2Dy

¼ qðx; yþ Dy; zÞ � qðx; y� Dy; zÞ2Dy

; ð13Þ

@w@z¼ wðx; y; zþ DzÞ �wðx; y; z� DzÞ

2Dz

¼ rðx; y; zþ DzÞ � rðx; y; z� DzÞ2Dz

; ð14Þ

where Dx=Dy=Dz = 1 pixel.

2.3. Surface strain

The equations described in the above section describe 3-Dvolumetric strain. However, in our approach we adjust the 2-Dsurface motion along a 3-D volume by supplementing the 2-D opti-cal flow vectors with a depth component. Due to the low resolutionof the Kinect data, the depth value for each point on the face is

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

determined by the change in z along a locally fit plane. However,because of the assumption that the face is locally flat, all compo-nents that differentiate with respect to z are then 0. Hence, weare left with two normal strains (e0xx; e0yy), one shear strain (e0xy),and the two partial shears (e0xz; e0yz). We can then rewrite Eq. 5 asfollows:

e0 ¼

e0xx ¼ @u@x e0yx ¼ 1

2@u@y þ @v

@x

� �e0zx ¼ @w

2@x

� �

e0xy ¼ 12

@v@x þ @u

@y

� �e0yy ¼ @v

@y e0zy ¼ @w2@y

� �

e0xz ¼ @w2@x

� �e0yz ¼ @w

2@y

� �e0zz ¼ 0

266664

377775: ð15Þ

Finally, we calculate the strain magnitude using the following:

e0mag ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðe0xxÞ

2 þ ðe0yyÞ2 þ ðe0xyÞ

2 þ ðe0xzÞ2 þ ðe0yzÞ

2q

: ð16Þ

The values in e0mag can then be normalized to ½0;255� to generatea strain map image where low and large intensity values corre-spond to small and large strain deformations respectively (theun-normalized values are used for comparison in our experi-ments). Note that in contrast to our previous work that was limitedto the projected 2-D motion [1], this final surface strain equationalso includes the motion lost due to foreshortening.

3. Algorithm

The algorithm is based on calculating optical flow on a 2-D im-age, with the depth approximated by a low resolution sensor. Sincethe cheaper commodity RGBD sensors such as the Microsoft Kinectdo not have sufficient optics and resolution for accurate flow track-ing, we align a separate HD webcam with the Kinect depth model.Hence optical flow is performed on the HD image, and then pro-jected onto a depth map which has been aligned to the same im-age. The overall design of the algorithm can be found in Fig. 1.

3.1. Automatic Kinect-to-webcam calibration

We automatically calibrate Kinect RGBD images to images froman HD webcam using local invariant features, as proposed byPamplona Segundo et al. [25]. First, the scale-invariant featuretransform (SIFT) [26] is employed to extract keypoints from bothKinect and webcam color images, represented by points inFig. 2(a). Then, non-face keypoints are discarded and the remainingkeypoints are used to establish correspondences between Kinectand webcam images, illustrated as lines. Robustness against 3Dnoise and correspondence errors is achieved by applying theRANSAC algorithm by Fischler and Bolles [27] to find the subsetof correspondences whose obtained transformation maximizesthe number of Kinect keypoints projected into their correspondentwebcam keypoints. At least four correct correspondences are re-quired to estimate the transformation that projects the Kinectdepth data into the webcam coordinate system (see Fig. 2(b)), con-sidering a camera model with seven parameters [28]: translationand rotation in X; Y and Z axes, and focal length. To retrieve theseparameters for a given subset of four correspondences we use theLevenberg–Marquardt minimization approach described byMarquardt [29].

3.2. Fitting planes using least squares

When working with a depth map with a resolution lower thanit’s calibrated RGB image, there will be many points within the im-age that will not have depth values. In our experiments using theKinect sensor, it is common to find that the face size is approxi-mately 500 � 500 pixels in the aligned high resolution camera,with noisy depth values for only a portion (approximately 1/2) of

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 6: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Fig. 7. Example of the automatic calibration process: (a) and (d) show correspondences between Kinect and right webcam in blue and between Kinect and left webcam inred; (b) and (c) show resulting projection of the Kinect depth data (center) to the coordinate system of the right webcam (left) and to the coordinate system of the leftwebcam (right). Figure best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

6 M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

these coordinates. In order to smooth out and interpolate for themissing depth values, we fit a plane into every 5 � 5 neighborhoodusing least squares fitting. Each pixel is then updated to it’s newplanar value based on it’s fitted plane.

3.3. Facial landmark detection and masking

Optical strain maps are derived from optical flow estimationsthat are susceptible tracking failures which result in optical strainfailure. In general, the regions most susceptible to this kind of errorare located around the eyes and mouth, due to the self-occlusionsthat occur when subjects open and close them. Moreover, sincefailure in flow estimations occur at the boundary of the head(due to the head moving in front of the background), areas outsideof the head are masked.

We segment these regions of the face using the facial landmarksautomatically detected by Saragih et al. [30]. Fig. 3 shows the de-tected facial landmarks as well as the original mask. The two trans-formed views using the transformation matrix identified by the

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

method described in the previous section can also be found in thisfigure. Please note that only one camera is needed to run the algo-rithm, however to demonstrate some robustness of the method toview, we aligned two cameras for an experiment in Section 4.2.

It is worth noting that for the second experiment in the nextsection, we performed additional masking by only comparingvalues that are visible in both views (see part (e) in Fig. 3). Thesepoints are found by back-projecting every pixel from eachcalibrated view back onto the 3-D model and incrementing a valueat each location.

4. Experiments

4.1. Experiment 1: strain map consistencies at varying depthresolutions

To demonstrate that a sparse depth resolution of the face is suf-ficient for reliable 3-D strain estimation, we compared strain mapscalculated at the full resolution of the BU datasets depth data with

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 7: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Fig. 8. Example strain maps for the happy expression from two different views. The first two columns are the RGB images for the left and right camera. The next two imagesare the corresponding 2-D strain maps. The last two images are the corresponding 3-D surface strain maps. Note that the 3-D surface strain images only show the strainvalues for pixels that are common to both views (see Fig. 3 part e.)

0 5 10 15 20 25 300.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame Number

Nor

mal

ized

Val

ue

Summed Strain MagnitudeCorrelation Coefficient (2D Aligned)Correlation Coefficient (3D Aligned)

Fig. 9. This figure shows the relationship between the normalized correlation coefficients of the 2-D and 3-D surface strain maps calculated from two different views, alongwith the normalized summed strain magnitude. When the summed strain magnitude is high, the 3-D surface strain appears correlated between the two different cameraviews, especially when compared to the correlation between 2-D strain maps after a feature-based alignment.

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx 7

those calculated with several downsampled depths. This is a pub-lically available 3-D dataset [31] that contains a single view andover 600 sequences of high-resolution (1280 � 720) videos withdepth data. We use 101 subjects and 6 expressions from this data-set to measure the consistency of strain map calculation at lowdepth resolutions. The depth resolution of the BU data is roughly200� 200 for the face image. The depth data is then downsampledto 100� 100; 66� 66; 50� 50; and 40 � 40 (see Fig. 4). For eachof these resolutions, pixels that are missing depth data are solved

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

by fitting a plane to it’s local neighborhood using linear leastsquares as described in Section 3.2.

We then estimate the 3-D surface strain using the 2-D opticalflow and the downsampled depth map and compare it when usingthe original depth map by finding the correlation coefficient. Thecorrelation coefficient provides a similarity between each pixelusing the normalized covariance function. Overall, Fig. 5 showsthat the strain estimation is stable across these resolutions, evenwhen only 40 � 40 depth coordinates are used from the original

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 8: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Subject Number

Nor

mal

ized

Cor

rela

tion

Coe

ffici

ent

1 2 3 4 5 6 7 80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

12−D Aligned3−D Aligned

Fig. 10. Boxplot showing the normalized correlation coefficients for 3-D surface strain maps and 2-D aligned strain maps. The means for each boxplot are denoted by ‘o’, the‘�’ within the box denotes the median, while the range of outliers are denoted by a the dotted lines (approximately � 2.7 r).

8 M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

200 � 200 depth map. A slight decrease in the correlation coeffi-cient is observed for each decrease in depth resolution. Thestandard deviation is tightly centered around the median, andmaintain at least 85% correlation at all four depth resolutions.

4.2. Experiment 2: surface strain maps from two views

While the first experiment demonstrated that the surface strainestimations are consistent when using low-resolution depth infor-mation, the next experiment (see Fig. 6) demonstrates that themethod has an increased robustness to view by calculating twostrain maps independently (one for each view) and then providinga similarity measure between the two.

We developed a dataset containing two webcam views (roughly45 degrees apart) and one Kinect sensor. These two cameras willallow us to calculate two separate strain maps from two differentviews, which we can then compare for similarities. We recorded 8subjects performing the smile and surprise facial expressions.Some example images from this dataset can be found in Fig. 7.Due to simultaneously acquiring large amounts of data overUSB2.0 bandwidth, the framerate of each video fluctuated between5–6 frames a second. Each sequences lasted approximately 50frames, or 10 s.

The first step is to align the Kinect’s 3-D data to each externalwebcam (see Fig. 7). We then calculate the surface strain for eachof these views, and then transform each of them back to the Kinectcoordinate system where we can then determine the normalizedcorrelation coefficient between the two registered strain maps.Note that this is only done for points in each image that are com-mon to both views (see Fig. 3) since there will be large occluded re-gions due to parallax between both views. For comparison, we alsoanalyze the correlation of the two 2-D strain maps calculated fromeach camera after aligning both to the center view using an affinetransformation based on the location of the eyes and mouth.

Fig. 8 shows a comparison of the 2-D strain maps and the 3-Dsurface strain maps generated using the method, where qualita-tively we can see that the 3-D surface strain maps are similar be-tween each view. Results for multiple frames in an examplesequence is given in Fig. 9. This figure plots the summed strainmagnitude along with the normalized correlation coefficients forboth 2-D and 3-D and quantitatively demonstrates that the facialsurface strain estimation is consistent between both cameras. Wealso observe that the normalized correlation coefficient increasesas the strain magnitude increases. This is likely due to the noise

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

in flow estimation when there is little or no motion occurring onthe face. Hence when we calculated the correlation coefficientsfor all the subjects in Fig. 10, we only look at values when the strainmagnitude is over 80%. Overall, the results are evidence that thereis a significant improvement observed when comparing strainmaps with and without 3-D information. We observe a 40–80%correlation in strain maps calculated from two different viewswhen using the 3-D information to calculate surface strain andalign them to a central view (note that means, denoted by a ‘o’,of these values range from 60–70%). However, when using a 2-Dbased alignment, there is a significant decrease in correlation.

5. Conclusions

Optical strain maps generated from facial expressions capturethe elastic deformation of facial skin tissue. This feature has beenshown (in 2-D) to have wide application in several research areas,including biometrics and medial imaging. However, previous workon strain imaging has been limited to a single view thus forcingtraining data to be manually aligned with future queries. In thispaper, we propose a novel method that removes this restrictionand allows for a range of views from multiple recording sessions.Furthermore, we have proposed a method that is able to recovermotion that is lost due to distortions from a 3-D model being pro-jected onto a 2-D image plane.

The method works by first calibrating an external high resolu-tion camera to a low resolution depth sensor using matched SIFTfeatures. Next, optical flow is calculated in the external cameraand then projected onto the calibrated depth map in order toapproximate the 3-D displacement. We have experimentally dem-onstrated that the method is robust at several depth resolutions bygiving quantitative and qualitative results that show that the sur-face strain maps for 101 subjects and over 600 expression se-quences at 4 sub-sampled depth resolutions are highly correlatedwith the surface strain maps using full depth resolution. Moreover,when evenly down-sampling a depth resolution of 200 � 200 to aslittle as 40 � 40, we lose less than 10% correlation on average whencompared to using full depth resolution.

We have also reported results that demonstrates the increasedview-invariance of the method by recording subjects from twoviews approximately 45 degrees apart. In this experiment, weshow that the independent 3-D surface strain maps generated fromeach of these two views are highly correlated. This is especiallytrue when the amount of strain is relatively high, opposed to when

train magnitude using 2D camera and low-resolution depth sensor, Pattern

Page 9: High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx 9

there is little to no strain present, because of noise in the opticalflow estimation.

In our future work, we will analyze several applications of thismethod and it’s potential to boost performance over methods thatare based on 2-D strain.

References

[1] M. Shreve, S. Godavarthy, D. Goldgof, S. Sarkar, Macro- and micro-expressionspotting in long videos using spatio-temporal strain, in: InternationalConference on Automatic Face and Gesture Recognition, 2011, pp. 51–56.

[2] M. Shreve, N. Jain, D. Goldgof, S. Sarkar, W. Kropatsch, C.-H.J. Tzou, M. Frey,Evaluation of facial reconstructive surgery on patients with facial palsy usingoptical strain, in: Proceedings of the 14th International Conference onComputer Analysis of Images and Patterns, 2011, pp. 512–519.

[3] M. Shreve, V. Manohar, D. Goldgof, S. Sarkar, Face recognition undercamouflage and adverse illumination, in: First IEEE International Conferenceon Biometrics: Theory, Applications, and Systems, 2010, pp. 1–6.

[4] M. Shreve, S. Fefilatyev, N. Bonilla, D. Goldgof, S. Sarkar, Method for calculatingview-invariant 3D optical strain, in: International Workshop on Depth ImageAnalysis, 2012.

[5] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, Multi-scalecapture of facial geometry and motion, in: ACM Transactions on Graphics, vol.29, 2007, p. 33.

[6] V. Blanz, C. Basso, T. Poggio, T. Vetter, Reanimating faces in images and video,Comput. Graphics Forum 22 (2003) 641–650.

[7] I. Lin, M. Ouhyoung, Mirror MoCap: automatic and efficient capture of dense3D facial motion parameters, Visual Comput. 21 (2005) 355–372.

[8] D. Bradley, W. Heidrich, A.S.T. Popa, High resolution passive facial performancecapture, in: ACM Transactions on Graphics, vol. 29, 2010, p. 41.

[9] Y. Furukawa, J. Ponce, Dense 3D motion capture from synchronized videostreams, in: Image and Geometry Processing for 3-D Cinematography, 2010,pp. 193–211.

[10] J. Pons, R. Keriven, O. Faugeras, Multi-view stereo reconstruction and sceneflow estimation with a global image-based matching score, Int. J. Comput.Vision 72 (2007) 179–193.

[11] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, C. Theobalt, Lightweight binocularfacial performance capture under uncontrolled lighting, ACM Trans. Graphics31 (6) (2012) 1–11.

[12] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R.W. Sumner,M. Gross, High-quality passive facial performance capture using anchorframes, in: ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, 2011, pp. 75:1–75:10.

[13] M. Klaudiny, A. Hilton, High-detail 3D capture and non-sequential alignmentof facial performance, in: 2012 Second International Conference on 3D

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface sRecognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT),2012, pp. 17–24.

[14] M. Penna, The incremental approximation of nonrigid motion, Comput. Vision60 (1994) 141–156.

[15] Faceshift, Markerless motion capture at every desk, 2013. http://www.faceshift.com/.

[16] Livedriver, Face tracking, 2013. <http://www.image-metrics.com/livedriver/>.[17] S. Hadfield, R. Bowden, Kinecting the dots: particle based scene flow from

depth sensors, in: Proceedings of International Conference on ComputerVision, 2011, pp. 2290–2295.

[18] T. Weise, S. Bouaziz, H. Li, M. Pauly, Realtime performance-based facialanimation, in: ACM Transactions on Graphics, vol. 30, 2011, p. 77.

[19] M.J. Black, P. Anandan, The robust estimation of multiple motions: parametricand piecewise-smooth flow fields, Computer Vision and Image Understanding,vol. 63, Elsevier Science Inc., New York, NY, USA, 1996, pp. 75–104 (ISSN 1077-3142, http://dx.doi.org/10.1006/cviu.1996.0006).

[20] MATLAB, version 7.10.0 (R2010a), The MathWorks Inc., Natick, Massachusetts,2010.

[21] B.K.P. Horn, B.G. Schunck, Determining Opt. Flow 17 (1981) 185–203.[22] S. Baker, S. Roth, D. Scharstein, M. Black, J.P. Lewis, R. Szeliski, A database and

evaluation methodology for optical flow, in: IEEE 11th InternationalConference on Computer Vision, 2007, pp. 1–8.

[23] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, R. Szeliski, A database andevaluation methodology for optical flow, Int. J. Comput. Vision 92 (2011) 1–31.

[24] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, P. Huang, Highresolution tracking of non-rigid motion of densely sampled 3D data usingharmonic maps, Int. J. Comput. Vision, 0920-5691 76 (3) (2008) 283–300,http://dx.doi.org/10.1007/s11263-007-0063-y.

[25] M. Pamplona Segundo, L. Gomes, O.R.P. Bellon, L. Silva, Automating 3Dreconstruction pipeline by SURF-based alignment, in: InternationalConference on Image Processing, 2012, pp. 1761–1764.

[26] D.G. Lowe, Object recognition from local scale-invariant features, in:International Conference on Computer Vision, 1999, pp. 1150–1157.

[27] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography,Commun. ACM 24 (6) (1981) 381–395.

[28] M. Corsini, M. Dellepiane, F. Ponchio, R. Scopigno, Image-to-geometryregistration: a mutual information method exploiting illumination-relatedgeometric properties, Comput. Graphics Forum 28 (7) (2009) 1755–1764.

[29] D.W. Marquardt, An algorithm for least-squares estimation of nonlinearparameters, SIAM J. Appl. Math. 11 (2) (1963) 431–441.

[30] J.M. Saragih, S. Lucey, J.F. Cohn, Face Alignment through subspace constrainedmean-shifts, in: International Conference of Computer Vision, 2009.

[31] L. Yin, X. Chen, Y. Sun, T. Worm, M. Reale, A high-resolution 3D dynamic facialexpression database, in: The 8th International Conference on Automatic Faceand Gesture Recognition, 2008, pp. 1–6.

train magnitude using 2D camera and low-resolution depth sensor, Pattern