tracking of facial deformations in multi-image sequences with elimination of rigid motion of the...

17
Tracking of facial deformations in multi-image sequences with elimination of rigid motion of the head Olli Jokinen Department of Real Estate, Planning and Geoinformatics, Aalto University, P.O. Box 15800, FI-00076 Aalto, Finland article info Article history: Received 9 November 2012 Received in revised form 18 June 2013 Accepted 22 June 2013 Available online 2 August 2013 Keywords: Multi-image sequence Point tracking Combined spatio-temporal matching Rigid motion Deformation Human face abstract The paper deals with measurement of human facial deformations from synchronized image sequences taken with multiple calibrated cameras from different viewpoints. SIFT (Scale Invariant Feature Trans- form) keypoints are utilized as image feature points in the first place to determine spatial and temporal correspondences between images. If no temporal match is found for an image point by keypoint match- ing, then the tracking of the point is switched to least squares matching provided the point has one or more spatial corresponding points in the other views of the previous frame. For this purpose, a new method based on affine multi-image least squares matching is proposed where multiple spatial and tem- poral template images are simultaneously matched against each search image and part of the spatial template images also change during adjustment. A new method based on analyzing temporal changes in the image coordinates of the tracked points in multiple views is then presented for detecting the 3- D points which move only rigidly between consecutive frames. These points are used to eliminate the effect of rigid motion of the head and to obtain the changes in the 3-D points and in the corresponding image points due to pure deformation of the face. The methods are thoroughly tested with three multi- image sequences of four cameras including also quite large changes of facial deformations. The test results prove that the proposed affine multi-image least squares matching yields better results than another method using only fixed templates of the previous frame. The elimination of the effect of rigid motion works well and the points where the face is deforming can be correctly detected and the true deformation estimated. A method based on a novel adaptive threshold is also proposed for automated extraction and tracking of circular targets on a moving calibration object. Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved. 1. Introduction Measurement of human facial deformations from synchronized image sequences taken with multiple cameras from different view- points has attracted quite a lot research during the last decades (Fasel and Luettin, 2003). Usually, the positions of either automat- ically generated or manually given feature points are tracked through the sequences and interpreted as different facial expres- sions. A general problem complicating the estimation of deforma- tion is that the head usually moves at the same time as the face deforms. When the deformation is required to be measured accu- rately, the effect of the rigid motion of the head should be elimi- nated and the pure deformation of the face estimated. The research questions investigated in this paper thus include how to eliminate the rigid motion and also, how the tracking of image points should be carried out in multi-image sequences in order to obtain results of highest accuracy. D’Apuzzo (2003) has proposed a method for tracking corre- sponding points in multi-image sequences based on the adaptive least squares matching of Gruen (1985). Spatial correspondences are first established between different views using least squares matching and epipolar constraints. Temporal matches between consecutive frames are then searched independently for each cam- era. Finally, a spatial correspondence check is carried out to verify the matches found at the next frame. The spatial matching and temporal matching are thus performed in a sequential fashion so that in each matching there is one template image and one search image at a time. Gruen and Baltsavias (1988) have developed a method where one template image is matched simultaneously against several search images using collinearity equations to con- strain the movement of corresponding points along epipolar lines. The method has been applied to establish spatial correspondences between multiple images. Denker and Umlauf (2011) use a similar method to move the matching windows simultaneously along epipolar lines in multiple images with weighted normalized 0924-2716/$ - see front matter Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.isprsjprs.2013.06.007 Tel.: +358 50 5936780. E-mail address: olli.jokinen@aalto.fi URL: http://www.foto.hut.fi/~ojokinen ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 Contents lists available at SciVerse ScienceDirect ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Upload: olli

Post on 19-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

Contents lists available at SciVerse ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier .com/ locate/ isprs jprs

Tracking of facial deformations in multi-image sequences withelimination of rigid motion of the head

0924-2716/$ - see front matter � 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.isprsjprs.2013.06.007

⇑ Tel.: +358 50 5936780.E-mail address: [email protected]: http://www.foto.hut.fi/~ojokinen

Olli Jokinen ⇑Department of Real Estate, Planning and Geoinformatics, Aalto University, P.O. Box 15800, FI-00076 Aalto, Finland

a r t i c l e i n f o a b s t r a c t

Article history:Received 9 November 2012Received in revised form 18 June 2013Accepted 22 June 2013Available online 2 August 2013

Keywords:Multi-image sequencePoint trackingCombined spatio-temporal matchingRigid motionDeformationHuman face

The paper deals with measurement of human facial deformations from synchronized image sequencestaken with multiple calibrated cameras from different viewpoints. SIFT (Scale Invariant Feature Trans-form) keypoints are utilized as image feature points in the first place to determine spatial and temporalcorrespondences between images. If no temporal match is found for an image point by keypoint match-ing, then the tracking of the point is switched to least squares matching provided the point has one ormore spatial corresponding points in the other views of the previous frame. For this purpose, a newmethod based on affine multi-image least squares matching is proposed where multiple spatial and tem-poral template images are simultaneously matched against each search image and part of the spatialtemplate images also change during adjustment. A new method based on analyzing temporal changesin the image coordinates of the tracked points in multiple views is then presented for detecting the 3-D points which move only rigidly between consecutive frames. These points are used to eliminate theeffect of rigid motion of the head and to obtain the changes in the 3-D points and in the correspondingimage points due to pure deformation of the face. The methods are thoroughly tested with three multi-image sequences of four cameras including also quite large changes of facial deformations. The testresults prove that the proposed affine multi-image least squares matching yields better results thananother method using only fixed templates of the previous frame. The elimination of the effect of rigidmotion works well and the points where the face is deforming can be correctly detected and the truedeformation estimated. A method based on a novel adaptive threshold is also proposed for automatedextraction and tracking of circular targets on a moving calibration object.� 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier

B.V. All rights reserved.

1. Introduction

Measurement of human facial deformations from synchronizedimage sequences taken with multiple cameras from different view-points has attracted quite a lot research during the last decades(Fasel and Luettin, 2003). Usually, the positions of either automat-ically generated or manually given feature points are trackedthrough the sequences and interpreted as different facial expres-sions. A general problem complicating the estimation of deforma-tion is that the head usually moves at the same time as the facedeforms. When the deformation is required to be measured accu-rately, the effect of the rigid motion of the head should be elimi-nated and the pure deformation of the face estimated. Theresearch questions investigated in this paper thus include how toeliminate the rigid motion and also, how the tracking of image

points should be carried out in multi-image sequences in orderto obtain results of highest accuracy.

D’Apuzzo (2003) has proposed a method for tracking corre-sponding points in multi-image sequences based on the adaptiveleast squares matching of Gruen (1985). Spatial correspondencesare first established between different views using least squaresmatching and epipolar constraints. Temporal matches betweenconsecutive frames are then searched independently for each cam-era. Finally, a spatial correspondence check is carried out to verifythe matches found at the next frame. The spatial matching andtemporal matching are thus performed in a sequential fashion sothat in each matching there is one template image and one searchimage at a time. Gruen and Baltsavias (1988) have developed amethod where one template image is matched simultaneouslyagainst several search images using collinearity equations to con-strain the movement of corresponding points along epipolar lines.The method has been applied to establish spatial correspondencesbetween multiple images. Denker and Umlauf (2011) use a similarmethod to move the matching windows simultaneously alongepipolar lines in multiple images with weighted normalized

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 53

cross-correlation as a matching criterion. The matching windowsare allowed to deform to fit the perspective deformation of the fa-cial surface.

Simultaneous tracking of multiple targets in multi-image se-quences is addressed in (Berclaz et al., 2011; Leal-Taixé et al.,2012). The targets moving independently in front of the camerasare detected separately for each frame, while the main focus ison connecting the detections to consistent 3-D trajectories. Thesame problem is considered for monocular multi-target 2-D track-ing in (Andriyenko et al., 2012; Yang and Nevatia, 2012). Althoughthese methods have been mainly applied to pedestrian tracking,the same ideas such as temporal smoothness of trajectories orcoherency between reconstructions from different camera pairs(Leal-Taixé et al., 2012) might be considered for tracking pointson deforming faces, too. Noteworthy is also the superpixel-basedmethod (Wang et al., 2011), which is able to track objects undergo-ing large shape deformations.

Previous research related to rigid motion elimination includemethods based on a deformable head model (Gokturk et al.,2001; Oka and Sato, 2005). The deformation is approximated as alinear combination of rigid shapes given by a few basis shape vec-tors estimated from 3-D trajectory data using principal componentanalysis (Gokturk et al., 2001). In (Oka and Sato, 2005), the motionof the head is assumed to be uniform straight between consecutiveframes. The head pose and face deformation are estimated simul-taneously and the pure 3-D deformation can be calculated fromthe face deformation parameters. Baltrusaitis et al. (2012) trainthe deformable head model using both intensity and depth imagesof different facial expressions. Zhang et al. (2008) combine seman-tic features, silhoutte features, and interest points to cope withocclusions caused by large variations in viewing directions. Thepose and shape of a deformable head model are estimated itera-tively, and drift is prevented by anchoring to special key-frames.Wang and Ji (2008) combine measurements from an offline-trainedgeneric face model with measurements from an online-learnedspecific face model to track the pose of a face when there appearlarge changes in the viewpoint and scale of the face.

Current state-of-the-art method for tracking 3-D facial defor-mations with an accurate facial surface model reconstructed usingpairwise stereo matching techniques in a setup of seven cameras ispresented by Beeler et al. (2011). An initial mesh is first recon-structed for each frame based on normalized cross-correlationwith smoothness and other constraints. One frame is selected asa reference frame and other frames with similar facial expressionand head orientation are automatically detected and labeled as an-chor frames. Image pixels are tracked from the reference frame tothe anchor ones and then sequentially to the other frames betweenthe anchor ones. The surface mesh of the reference frame is thenpropagated to the other frames using the tracked image pixelswhich provide temporal correspondences of meshes along the se-quence. The propagated meshes are finally refined to produce spa-tially and temporally coherent capture of the facial expressions.Furukawa and Ponce (2009) use 8–10 cameras and a pattern pro-jected onto the face to reconstruct an initial mesh at the first frame.Complicated facial expressions including stretch and shrink of theskin are then captured by estimating the nonrigid deformation ofthe mesh through the image sequence. The key contribution is aregularization term introduced to estimate complex tangentialdeformation at each vertex.

Our solution to rigid motion elimination is based on analyzingchanges in the image coordinates for detecting points which moveonly rigidly with an assumption that less than half of the pointsexperience a deformation between consecutive frames. No markersnor projected texture are used, but SIFT keypoints (Lowe, 2004) areutilized as image feature points. The descriptor vectors of thekeypoints are used to determine spatial and temporal correspon-

dences between images together with epipolar constraints forthe spatial case. If no temporal correspondence is found by key-point matching, the tracking of the point is switched to leastsquares matching. This occurs typically in areas where the facestarts deforming. The least squares matching performs as a truemulti-image matching where multiple template images arematched against each search image and part of the templateimages also change during adjustment. Our approach is motivatedby a larger study (Jokinen and Haggrén, 2011; Jokinen andHaggrén, 2012), where we develop image-based methods for mon-itoring small changes in the object points with emphasis on theaccuracy. Consequently, we prefer analyzing changes in the imagecoordinates to using an approximate deformable head model.

The paper includes three contributions to knowledge. Firstly, amethod based on a novel adaptive threshold is proposed for auto-mated extraction and tracking of circular targets on a moving cal-ibration object. Secondly, a new method is proposed forsimultaneous tracking of corresponding points in multi-view im-age sequences by matching both temporal and spatial imagepatches at the same time. Thirdly, a new method based on analyz-ing changes in the image coordinates is presented for detecting the3-D points which move only rigidly and which can be thus used toeliminate the effect of rigid motion and to obtain the changes inthe 3-D points due to pure deformation of the face.

The paper is organized as follows. Section 2.1 starts with an out-line of the proposed processing chain, Section 2.2 continues withcalibration and exterior orientation of the cameras, Section 2.3 de-scribes the capturing of face image sequences, Sections 2.4 and 2.5present the methods for finding spatial and temporal correspon-dences using SIFT keypoints and least squares matching, and Sec-tion 2.6 proposes the method for eliminating the effect of rigidmotion. Test results verifying the performance of the methodsare shown in Section 3 and conclusions are presented in Section 4.

2. Methods

2.1. Outline of the processing chain

The different steps from camera calibration and image captur-ing to point tracking and rigid motion elimination are outlined inFig. 1. These steps are described in detail in the followingsubsections.

2.2. Camera calibration and exterior orientation

The image data are acquired with four 1.3 Mpix Mightex indus-trial cameras accompanied with software for triggering and storingimages at a maximum speed of 20 fps. Each camera is calibratedusing an object with 28 circular targets attached on it as shownin Fig. 2. The positions of the targets in an object coordinate systemhave been previously measured by another setup consisting of four6–10 Mpix Nikon digital cameras (Jokinen and Haggrén, 2012). Thecalibration involves capturing an image sequence of about 370images at 2 fps during which the object is shown in different posi-tions and orientations to the camera. In our method, the centers ofthe circular targets are automatically extracted from the first im-age and tracked through the image sequence. The extraction of atarget is based on extracting first the black square surroundingthe circular area by thresholding and region growing, and findingthen the pixels in the circular area as non-classified pixels insidethe black area. The gray-level weighted mean of the pixels in thecircular area provides the position of the target.

The tracking of the target is based on searching in a local win-dow centered at the position of the target in the previous imageand having a size proportional to the size of the target in the

Fig. 1. Processing diagram.

Fig. 2. Calibration object with automatically extracted target centers shown as redplus signs in one image of the sequence.

54 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

previous image. The novelty of our algorithm lies in the adaptivethreshold for black pixels developed for extracting the black squarein the current image. The lighting conditions namely change as theobject is turned into different orientations and consequently, aconstant threshold value for black is not adequate for the whole se-quence. Moreover, the lighting conditions are different in differentparts of the object, and thus each target is considered to have itsown threshold which adaptively changes from image to imagebased on the histogram of gray levels in the search window. Themethod for updating the adaptive threshold is presented in Algo-rithm 1 and described as follows. First, the gray level where thehistogram obtains its maximum in the area below the gray levelthreshold of the target in the previous image of the sequence issought. If all the histogram values are zeros in this area, the area

is extended to the minimum of twice the gray level threshold inthe previous image and a fixed value of 0.8 at a scale where zerois black and one white. The maximum of the histogram in this areaindicates the peak position of black pixels. Then, the gray levelswhere the histogram has decreased below 20%, 10%, and 5% ofthe black peak count are determined in the area of gray levels lar-ger than the position of the black peak. It is checked if there existother maxima between the positions of 20%, 10%, and 5% havingcounts greater than half of the black peak one. If no such a maxi-mum exists, then the new threshold is selected as a position of5% decrease. If a maximum appears between the positions of 10%and 5% but not between 20% and 10%, then the new threshold isput to the position of 10% decrease. Finally, if a maximum is foundbetween the positions of 20% and 10%, then the new threshold is at20%. However, if the new threshold set in this way is greater than0.8, then the threshold is not updated but the threshold in the pre-vious image is used. The previous threshold is also used if the his-togram does not decrease below 5% of the black peak count at all.Fig. 2 illustrates the target centers found using the adaptivethreshold.

Algorithm 1. Updating the adaptive threshold Tk ? Tk+1 (see textfor explanation).

1: Compute the histogram H(g) of gray levels g in thesearch image.

2: Determine the peak position of black pixels asgp ¼ argðmax HðgÞ j g 6 T 0k

� �Þ, where T 0k ¼ Tk if

$g 6 Tk such thatH(g) > 0; otherwise T 0k ¼minf2Tk;0:8g.

3: Determine gray levelsg20 = min{g > gpjH(g) < 0.2H(gp)},g10 = min{g > gpjH(g) < 0.1H(gp)},g5 = min{g > gpjH(g) < 0.05H(gp)}.

4: If g5 exists do steps 5 and 6; else set Tk+1 = Tk.5: Check if there exist other maxima:

If $g 2 (g10,g5) such that H(g) > 0.5H(gp) then P1 = 1else P1 = 0.If $g 2 (g20,g10) such that H(g) > 0.5H(gp) then P2 = 1else P2 = 0.

6: Set the new threshold:If P1 = 0 and P2 = 0 then Tk+1 = g5;else if P1 = 1 and P2 = 0 then Tk+1 = g10;else if P2 = 1 then Tk+1 = g20.IfTk+1 > 0.8 then Tk+1 = Tk.

The calibration itself starts with estimating first for each image,the camera projection matrix P of 11 parameters, which maps theknown 3-D coordinates of the target points to the measured 2-Dimage coordinates of them, using the Direct Linear Transformationalgorithm with homogeneous coordinates (Hartley and Zisserman,2000, pp. 92 and 167). The left 3 � 3 submatrix E of the camera ma-trix is then decomposed into the product of an upper-triangularcalibration matrix K and an orthogonal rotation matrix R using Gi-vens rotations (Hartley and Zisserman, 2000, pp. 552). We thushave P = [KRjp4], where p4 is the last column of P. A correct scaleis obtained by requiring that the element K(3,3) = 1. The initial val-ues for the focal length and principal point of the camera are givenby averaging over the calibration matrices estimated for each im-age while the initial values for the radial and tangential lens distor-tion parameters are set to zero. The projection center of eachcamera position is given by C = �E�1p4. Finally, a bundle adjust-ment with fixed object coordinates is performed to refine the val-ues of the eight interior orientation parameters of the camera

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 55

(focal length, principal point, three radial and two tangential lensdistortion parameters) and six exterior orientation parameters ofeach image with respect to the moving calibration object. For a se-quence of 370 images, there are thus 2228 parameters that are ad-justed simultaneously.

Once all the four cameras have been calibrated, the calibrationobject is placed into a fixed position in front of the cameras so thatthe circular targets are visible in all the cameras. The coordinatesystem centered at the calibration object defines an object coordi-nate system. The exterior orientations of the cameras with respectto this coordinate system are solved as above by extracting the im-age coordinates of the circular targets followed by resection usingthe Direct Linear Transformation algorithm and a refinement bythe Levenberg–Marquardt algorithm.

2.3. Capturing of face image sequences

The input data for the point tracking and rigid motion elimina-tion algorithms consists of a multi-image sequence captured withthe four synchronized, convergent, and calibrated cameras. Thesetup is illustrated in Fig. 3. The cameras are located on tripodsside by side but at varying heights to obtain better imaginggeometry. The scene includes a human standing still and makingdifferent kind of facial expressions. It is assumed that the expres-sions vary quite slowly with respect to the frame rate. In additionto facial deformations, the object sways rigidly as it is difficult forthe human to stand totally still. The rigid movement is about ofthe same size of magnitude as the deformations. It is assumedthat most deformations in the image sequences appear at someparts of the face at a time while the rigid movements cover thewhole head.

2.4. Keypoint matching between different views and betweenconsecutive frames

SIFT keypoints are utilized as image feature points to find corre-sponding points between different views and between consecutiveframes. The keypoints are determined within an area of interestdescribed in the following.

2.4.1. Area of interestThe area of interest is defined for each camera as a region where

there appear changes in the pixel gray levels within the image se-quence. More specifically, a difference image sequence is

Fig. 3. Camera setup.

computed where each image is given by subtracting the previousframe from the current frame. A binary image is then constructedwhere the pixel value equals one if there appears a differencevalue, the absolute value of which is above a user-given thresholdT (0.03 or 0.05 when the gray levels of the images vary betweenzero and one) in any image of the difference image sequence at thispixel. Finally, a rough area of interest in the binary image is givenmanually and within this rough area, the region of ones is filled upby changing the pixel values to ones between the first and last pix-els having the value one in each column and row. The resulting pix-els with ones define the area of interest as described inAlgorithm 2.

Algorithm 2. Area of interest.

1: Compute a difference image sequence dgk = gk � gk�1,k = 2, . . . ,L.

2: Compute a binary image B, where B(i, j) = 1 if $ k suchthat jdgk(i, j)j > T; else B(i, j) = 0.

3: Set B(i, j) = 0 outside a rough area of interest given man-ually (a bounding box).

4: Determine for each row i of B the first column j1(i) andlast column j2(i) having the value one.

5: Determine for each column j of B the first row i1(j) andlast row i2(j) having the value one.

6: For each row i, set B(i, j) = 1 for j1(i) < j < j2(i).7: For each column j, set B(i, j) = 1 for i1(j) < i < i2(j).8: The area of interest is given by pixels where B(i, j) = 1.

2.4.2. Spatial keypoint matching

Spatial keypoint matching between images captured at thesame time from different viewpoints is based on the standard ap-proach (Lowe, 2004) where for a keypoint in one image, the corre-sponding keypoint in another image is the one having a descriptorvector nearest to the descriptor vector of the keypoint in the firstimage provided the ratio of vector angles from the nearest to sec-ond nearest neighbor is less than a given threshold (0.6 in theexperiments). In our approach, instead of testing all the keypointsin the other image, the search for the corresponding keypoint islimited to the neighborhood of the epipolar line (distance fromthe epipolar line less than 20 pixels). In the case where there isonly one candidate point close to the epipolar line, the point is ac-cepted if the vector angle to the descriptor vector of the keypoint inthe first image is less than a threshold (0.8 radians in the experi-ments). For four images, the epipolar lines for a keypoint in the firstimage are computed to all the other images and the correspondingkeypoints are determined in each of the other images first sepa-rately. The epipolar lines corresponding to the matched keypointin the second image are then computed to the third and fourthimages. If the keypoints in these latter images which were matchedwith the keypoint in the first image also match with the keypointin the second image, then corresponding points in three or fourimages are established. The order of the images is varied so thatkeypoints in all the images are processed. Finally, the point labelsset for the correspondences are automatically analyzed to mergenon-conflicting groups of correspondences into larger ones andto remove conflicting ones. The 3-D object coordinates are com-puted using least squares adjustment and the mean precision ofeach object point is evaluated as a square root of one third of traceof covariance matrix of the 3-D point. The correspondences arekept for which the mean precision of the 3-D point is higher thanmedian plus half of standard deviation of mean precisions over allthe points. The correspondences resulting in 3-D points of lowmean precision are thus removed.

56 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

2.4.3. Temporal keypoint matchingTemporal keypoint matching between consecutive frames of a

camera is performed using the standard approach (Lowe, 2004) asin the spatial keypoint matching described above. If a match isfound in the subsequent frame, the tracking of the point contin-ues using keypoint matching regardless whether there are spatialcorrespondences for the point or not. If no temporal match isfound, the tracking is switched to least squares matching to bedescribed in the next subsection as concerns points for whichthere appear also spatial correspondences. For other points, thetracking is terminated. In the subsequent frame, there may ap-pear also new keypoints which are not matched with any key-point of the current frame. These new keypoints are started tobe tracked with keypoint matching at the next iteration. The im-age sequences of the four cameras are processed at the sametime. After tracking of keypoints to the subsequent frame hasbeen performed separately for each camera, the spatial keypointmatching is carried out to find new correspondences between dif-ferent views using all temporally matched and new keypoints ofthe subsequent frames. After the whole image sequences havebeen processed, the tracked paths of the keypoints are analyzed.If there appears a jump larger than a threshold (20 or 50 pixels inthe experiments) in either image coordinate of the keypoint, thenthe path is split into two paths with a new keypoint assigned forthe path after the jump.

2.5. Point tracking with affine multi-image least squares matching

Temporal tracking of corresponding points in multiple imagesis based on least squares matching for images where no temporalcorrespondence is found by keypoint matching. Once the match-ing has been changed to the least squares one, it remains as so tothe end of the sequence or as long as a solution is found by leastsquares matching. Consider a single object point for which thecorresponding image points in M cameras at frame k are denotedby In

k ; Jnk

� �, n = 1, . . . ,M (without loss of generality, the cameras

where the point appears are numbered by 1, . . . ,M). LetM = M1 + M2, where the points in cameras Q1 = {1, . . . ,M1} aretracked by keypoint matching and the points in camerasQ2 = {M1 + 1, . . . ,M} by least squares matching. It is assumed that2 6M 6 4, 1 6M2 6M, and M1 = M �M2. For each cameran 2 Q2, an equally spaced grid Xn

k of points centered at Ink ; J

nk

� �is

constructed with points of the grid denoted by lower case lettersink ; j

nk

� �2 Xn

k . The gray levels gnk of the image are also bilinearly

interpolated to the intermediate pixels of the grid and denotedby ~gn

k . An affine transformation relating the current frame k andthe next frame l = k + 1 is defined as

inkl

jnkl

an1 an

2

an3 an

4

� �ink � In

k

jnk � Jn

k

an5

an6

� �

~gnkl ¼ ~gn

k þ an7

ð1Þ

where an ¼ an1 � � � an

7

� T are parameters to be estimated. For eachpair of cameras m 2 Q1 and n 2 Q2, another affine transformationrelating the different views is introduced as

imnl

jmnl

!¼ bmn

1 bmn2

bmn3 bmn

4

!iml � Im

l

jml � Jm

l

an5

an6

� �

~gmnl ¼ ~gm

l þ bmn5

ð2Þ

where bmn ¼ bmn1 � � � bmn

5

� Tare parameters to be estimated,

iml ; j

ml

� �2 Xm

l a grid of points centered at Iml ; J

ml

� �, and ~gm

l bilinearlyinterpolated gray levels at the points of Xm

l . For m 2 Q2,n 2 Q2,m < n, the point is first transformed from view m to the first viewand then to view n so that

imnl

jmnl

b1n1 b1n

2

b1n3 b1n

4

!b1m

1 b1m2

b1m3 b1m

4

!�1iml � am

5

jml � am

6

an5

an6

� �

~gmnl ¼ ~gm

l � b1m5 þ b1n

5

ð3Þ

where the grid points iml ; j

ml

� �are centered at am

5 ; am6

� �. The positions

of the grid points thus change during estimation of the parametersand the same holds also for the gray levels ~gm

l interpolated to thechanging grid points. If m = 1, then parameters b1m

1 ; b1m4 are consid-

ered to equal ones and parameters b1m2 ; b1m

3 ; b1m5 zeros in Eq. (3). It

should be noted that parameters an5; an

6 appear conveniently in allEqs. (1)–(3). This is essential for matching temporal and spatial im-age patches at the same time. The gray levels gn

l are bilinearly inter-polated to the transformed pixels in

kl; jnkl

� �of Eq. (1) and imn

l ; jmnl

� �of

Eqs. (2) and (3). The values of all the parameters a = {anjn 2 Q2}and parameters b = {bmnj m 2 Q1 [ Q2, m < n 2 Q2} are estimatedsimultaneously by least squares minimization. The merit functionto be minimized is given by

f ða;bÞ¼Xn2Q 2

XXn

k

wnkl in

k ;jnk

� �~gn

kl ink ;j

nk ;a

� ��~gn

l inkl;j

nkl;a

� �� 2=W

þX

m2Q 1[Q2 ;m<n

Xn2Q2

XXm

l

wmnl im

l ;jml

� �~gmn

l iml ;j

ml ;a;b

� ��~gn

l imnl ;jmn

l ;a;b� �� 2

=W

ð4Þ

where the weights wnkl and wmn

l equal ones at points, where ~gnkl � ~gn

l

and ~gmn

l � ~gnl

, respectively, are less than adaptive thresholds andzeros elsewhere, and W equals the sum of all the weights. The firstdouble sum in Eq. (4) takes care of temporal matching between con-secutive frames k and l and the second triple sum is related to spa-tial matching between different views of frame l. The adaptivethresholds are updated at each iteration according to the statisticaldistribution of gray level differences between the matched imagepatches following the same updating scheme as proposed by Zhang(1994) for matching of curves or surfaces. Each pair of imagepatches has its own adaptive threshold which gets tighter as theiteration converges. Initial values for all the parameters a, b areset to zeros except for an

5 and an6 which are given the values In

k andJn

k , respectively, and bmn1 and bmn

4 which are set to ones for m 2 Q1 -[ Q2, m < n 2 Q2. The minimization problem is solved using theLevenberg–Marquardt algorithm. After the iteration has converged,the tracked image points in the next frame are given byInl ; J

nl

� �¼ an

5; an6

� �; n 2 Q2. However, if the iteration does not con-

verge within 500 iterations or a transformed template image driftsoutside a search image, then the tracking of the correspondingpoints is terminated. For the converged points, the covariance ma-trix of the estimated parameters is obtained from twice the inverseof the Hessian matrix of the merit function multiplied by the refer-ence variance given by the value of the merit function per redun-dancy. The mean precision of each tracked image point isevaluated as a square root of average of variances of image coordi-nates of the point.

2.6. Rigid motion estimation and elimination

Estimation of the rigid motion caused by the swaying of the hu-man is based on detecting 3-D points which move only because ofthe swaying and not due to deformation of the face. The detectionis based on analyzing changes in the image coordinates of the 3-Dpoints between consecutive frames. The swaying leads to uniformmovement of the image points in each view while additional defor-mation shows as a movement which deviates from the main trend.Let the changes in the image coordinates beDIn ¼ In

l � Ink ; DJn ¼ Jn

l � Jnk , where In

k ; Jnk

� �and In

l ; Jnl

� �are tracked im-

age points at consecutive frames k and l = k + 1, respectively, of

Frame 35 Frame 50 Frame 70

Fig. 4. Tracked image points at segmented and cropped frames 35, 50, and 70 ofcamera 1 of face 3 multi-image sequence.

Frame 160 532emarFFrame 70

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 57

camera n, and for both of which there exist 3-D points estimatedusing corresponding points in one or several other views. Assum-ing that less than half of the 3-D points visible in a view undergoa deformation, the main trends of movements in the image rowand column directions are estimated as medians of changes inthe image coordinates denoted by nn = MedI(DIn) and jn = MedJ(-DJn), where the medians are taken over all the image points ofthe view having a 3-D point in both frames k and l. The distancesin the row and column directions from the medians are evaluatedas dn

I ¼ jDIn � nnj and dnJ ¼ jDJn � jnj. The 3-D points for which the

distances dnI ; dn

J are less than twice the standard deviations ofthese distances in all views n where the point has correspondingimage points are regarded as points subject to rigid motion only.Note that the standard deviations are computed separately foreach view and a point is an inlier if and only if the changes inthe image coordinates are close to the medians in the row and col-umn directions in every view.

The inlying 3-D points are used to estimate the rigid motion be-tween the position of the object in the first frame of the sequenceand the next frame l. This is done by transforming the 3-D pointsfrom the current frame k to the first frame according to the rigidmotion estimated at the previous iteration and applying the meth-od of singular value decomposition for rotation fitting and thepositions of centers of the 3-D data sets for translation estimationas in (Kanatani, 1994). Once the rigid motion has been estimated,the 3-D points of frame l are transformed to the coordinate systemof the first frame. The remaining differences in 3-D coordinateswith respect to the 3-D points estimated from the images of theframes where the points appear for the first time and transformedto the first frame show the 3-D deformation of the object. For ana-lyzing the results, it is also adequate to compute the image obser-vations where the effect of rigid motion has been eliminated. Thisis done by transforming the 3-D points to the camera coordinatesystems of each camera and projecting the transformed points tothe images according to the camera calibration data. When the ri-gid motion eliminated image coordinates are subtracted from theoriginal image observations, the effect of rigid motion on thechanges in the image coordinates is obtained. Subtracting the rigidmotion eliminated image coordinates at the first occurrence of thepoint in the sequence from the rigid motion eliminated coordinatesyields the influence of object deformation on the changes in theimage coordinates.

Fig. 5. Tracked image points at segmented and cropped frames 70, 160, and 235 ofcamera 2 of face 1 multi-image sequence.

3. Results and discussion

3.1. Tracking

The proposed method for tracking of image points was testedwith multi-image face sequences of three different persons de-noted by faces 1–3. The multi-image sequences contained 440(face 1), 600 (face 2), and 250 (face 3) frames captured at 20 fpswith the four synchronized, convergent, and calibrated cameras.The multi-image sequences fulfilled the assumptions described inSection 2.3 except that there appeared also fast changes of facialdeformations in some parts of the sequence of face 3. The facialexpressions were also much stronger in face 3 while the sequencesof faces 1 and 2 were characterized by slowly varying minor defor-mations. All the cameras were located at the same height in thefirst sequence while in the second sequence, cameras 1 and 3 wereat a lower height looking slightly upwards and cameras 2 and 4were at a higher height looking straight ahead. In the third se-quence, cameras 1 and 3 were at lower height looking straightahead and cameras 2 and 4 were at a higher height looking slightlydownwards. The imaging geometries were tried to be made good

in all setups taking into consideration the restriction that the cam-eras were on tripods side by side.

The upper row in Fig. 4 shows three segmented and croppedframes of camera 1 of face 3 multi-image sequence with differentfacial expressions. The segmentation is given by the area of interestso that gray levels outside the area of interest have been set tozero. In the lower row of Fig. 4, the white plus signs illustratethe image points for which there exist spatial corresponding pointsat least in one of the other cameras. The red lines show the tempo-ral correspondences tracked all the way from frame 35 to 50 andfrom frame 50 to 70 using sequential keypoint matching and affine

Frame 180 Frame 225 Frame 442

Fig. 6. Tracked image points at segmented and cropped frames 180, 225, and 442 ofcamera 3 of face 2 multi-image sequence.

58 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

multi-image least squares matching. There appear also points forwhich there are no temporal correspondences in the other frames,because the tracking of these points has terminated between theframes or they may be new keypoints for which the tracking hasbegun between the frames. Closer inspection of Fig. 4 mainlyshows that the tracked points are qualitatively in correct places.Similar results are obtained also for faces 1 and 2 as illustratedin Figs. 5 and 6, respectively.

The affine multi-image least squares matching is illustrated inmore detail in Fig. 7. At frame k = 385 of face 1 multi-image se-quence, we have an object point with corresponding image pointsin all the four cameras. The image point in camera 1 is traced to thenext frame k + 1 by keypoint matching and the image points in theother three cameras by affine multi-image least squares matching.The first row in Fig. 7 shows the fixed templates of size 31 pixelssquare at frame k, each of which is matched against the searchwindow of size 91 pixels square of the same camera at the nextframe shown on the second row. The third row shows the changingtemplates of cameras 2 and 3 at frame k + 1 which are matchedagainst the search windows of cameras 3 and 4, and 4, respectively.The fixed template of camera 1 at frame k + 1 is matched against allthe search windows. The blue plus signs in the search windows arethe initial locations of the image points given by their locations atthe previous frame k. The red plus signs in the search windowsshow the refined locations obtained by simultaneous matching ofall the templates against the search windows. The algorithm per-forms very well in this example. It can be also noticed that thecameras are not perfectly synchronized as the eye is closed in cam-eras 3 and 4 but slightly open in camera 2 at frame k + 1.

In order to demonstrate the advantages of the proposed multi-image matching method, this was compared to a more simplemethod, where the changing templates at frame k + 1 were re-placed by the fixed templates at frame k when matching betweenthe different views. In other words, the template of camera 1 atframe k was matched against all the search windows at framek + 1, the template of camera 2 at frame k against search windows2–4 at frame k + 1 (assuming there are 4 search windows), etc.Comparison between the results obtained using the multi-imagesequence of face 1 shows that for all the corresponding pointstracked, the distances between the resulting image points given

by the two methods are 1.7 pixels on the average with a standarddeviation of 4.3 pixels. The average distances vary for differentcameras, being 0.4, 2.5, 1.9, and 3.1 pixels for cameras 1–4, respec-tively. The standard deviations of the distances are 1.5, 5.2, 5.1, and5.3 pixels, respectively. The smallest differences for camera 1 areevidently due to the order the cameras are processed in the match-ing algorithms. If the order of the cameras is reversed, then thesmallest differences are obtained for camera 4. For most of thepoints the differences are rather small as the median of the dis-tances is 0.05 pixels and 75% and 90% of the distances are less than1.4 and 4.4 pixels, respectively. There appear, however, somepoints, where the two methods yield clearly different results. Tosee which method performs better, the points where the maximumdistance within the sequence exceeded 10 pixels were analyzedvisually. It turned out that the proposed method gave a better solu-tion in four cases while in one case, the point drifted to a wrongsolution by both of the methods. Fig. 8 shows one of the testswhere the point has corresponding image points in cameras 1and 2. At frame 211, the image points given by the two methodsare very close to each other in both cameras. When the eye isclosed and the muscles are stretched, the proposed method keepsthe correct location while the other method starts to move thepoint out of the eye as can be observed at frame 250. This move-ment continues after the eye is opened and the muscles are relaxedand the point ends up clearly out of the eye at frame 421. The pro-posed method, in turn, yields a slightly displaced location withinthe eye at frame 258 but the location returns to the original placeby frame 421. The power of the proposed method in this test ismost likely due to the changing templates at frame k + 1 whichtake into consideration the rapidly changing scene missed by thefixed templates at frame k. Another test in Fig. 9 illustrates a pointon the upper lip with corresponding image points in cameras 1 and4. In this case, the locations given by the SIFT keypoint matchingare not at the same places on the lip in the cameras, which meansthat the spatial correspondence is originally false. However, theproposed method is able to correct the location in camera 4 bymoving the image point to the right during frames 279–291. Theother method keeps the point in the false location. In addition tosystematic differences, the mean precisions of the tracked imagepoints were compared. The mean precisions of points tracked bythe proposed method were 0.03 pixels on the average and thoseby the other method were 0.02 pixels on the average. The fixedtemplates thus give slightly more precise solutions than the chang-ing templates of the proposed method.

The performance of the proposed tracking method was ana-lyzed visually in areas where large changes in deformation oc-curred between consecutive frames, using the multi-imagesequence of face 3. In order to improve the performance of thematching algorithm, larger template images of 51 pixels squareand search images of 151 pixels square were used than for theother multi-image sequences of faces 1 and 2. Although some min-or misplacements were found at a couple of points mainly on theforehead, the tracking results were otherwise very good. Fig. 10shows an example, where the mouth area is deforming consider-ably from frame 145 to 153. Still, the matching algorithm is ableto track correctly the point round about the corner of the mouth.Fig. 11 further illustrates the matched gray level surfaces of thetemplate and search images within the overlapping areas whentracking from frame 146 to frame 147. The red surfaces are thetemplate images and the blue ones are the search images. Thematching works even if the surfaces differ from each other dueto facial deformation occurred between the frames or due to differ-ent viewpoints of the cameras. The tracking results for face 2 mul-ti-image sequence were also found very good. Fig. 12 shows asuccessful tracking of a point on the upper lip when the mouth isclosing and the lips are deforming. The performance of the

Camera 3 Camera 4Camera 2Camera 1

Fixedtemplate

at frame k+1

Fixedtemplatesat frame k

Changingtemplates

at frame k+1

Searchwindows

at frame k+1

Fig. 7. The initial estimates shown as blue plus signs in the search windows are refined to red plus signs as a result of tracking in face 1 by affine multi-image least squaresmatching.

Camera 1

Camera 2

Frame 211 Frame 240 Frame 250 Frame 258 Frame 421

Fig. 8. Comparison of affine multi-image least squares matching (red plus signs) to another method with fixed templates only (blue x-marks) using face 1 multi-imagesequence.

Frame 279 Frame 285 Frame 291

Camera 4

Camera 1

Fig. 9. The affine multi-image least squares matching (red plus signs) is able tocorrect the location of the point in camera 4 of face 1 multi-image sequence unlikethe other method with fixed templates only (white x-marks).

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 59

weighting technique used in the matching algorithm is demon-strated in Fig. 13. The weighting based on the adaptive thresholdsrestricts the matching to compatible areas while pixels where thegray levels of the matched images differ much are discarded. Con-sequently, in temporal matching in Fig. 13a–b, the gray level sur-faces are perfectly aligned except in the middle wheredeformation has occurred and the edge in the surface profile hasmoved as the mouth closes. The pixels in this area are thus notused in the matching. Simultaneous spatial matching between

the different cameras of the same frame results in perfectly alignedsurfaces shown in Fig. 13c.

3.2. Rigid motion elimination

The results for elimination of the effect of rigid motion from theimage row and column coordinates are illustrated for camera 2 offace 1 multi-image sequence in Figs. 14 (row coordinates) and 16(column coordinates). Figs. 14 and 16a show the paths of all themeasured image points obtained by tracking through the wholeimage sequence. It can be seen that there are clear trends in thepaths of the image points. The effect of rigid motion has been theneliminated in Figs. 14 and 16b. The paths have now straightenedand the overall trends disappeared but local variations remainedsuch as between rows 410–510 round frame 75 in Fig. 14b, azoomed view of which can be seen in Fig. 15a. The number of im-age points has also reduced as Figs. 14 and 16b–d include only im-age points having corresponding points in one or several othercameras. Subtracting the image coordinates in Figs. 14 and 16bfrom the corresponding ones in Figs. 14 and 16a, respectively,yields the effect of rigid motion on the changes in the image coor-dinates shown in Figs. 14 and 16c. The image paths are clearly con-sistent in these figures. The influence of facial deformation on thechanges in the image coordinates is illustrated in Figs. 14 and 16dobtained from Figs. 14 and 16b by subtracting from the image rowand column coordinates of each path the row and column

Camera 1

Camera 3

Frame 145 Frame 147 Frame 151 Frame 153Frame 149

Fig. 10. Perfect tracking of a point (red plus sign) on face 3 within an area of large change in deformation by affine multi-image least squares matching.

620640

660680

700

720740

760780

8000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

i / pixelj / pixel

gray

leve

l

670680

690700

710720

540560

580600

6200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

i / pixelj / pixel

gray

leve

l

670680

690700

710720

540560

580600

620−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

i / pixelj / pixel

gray

leve

l

(a) (b)

(c)

Fig. 11. The gray level surfaces after affine multi-image least squares matching when tracking the point in Fig. 10 from frame 146 to frame 147 of face 3 multi-imagesequence. (a) Temporal matching in camera 1. (b) Temporal matching in camera 3. (c) Spatial matching between cameras 1 and 3.

60 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

coordinates, respectively, of the same path at the frame where thepoint appears for the first time. Figs. 14 and 16d point out the accu-mulation of errors in the sequential estimation of rigid motion asthere is a dispersing trend in the paths of the image points in addi-tion to the effect of deformation. Similar results as in Figs. 14 and16 are also obtained for the other cameras of face 1 multi-imagesequence and also for the other faces as illustrated for the columncoordinates of camera 3 of face 2 multi-image sequence in Fig. 17and for the row coordinates of camera 4 of face 3 multi-image se-quence in Fig. 18. The effect of rigid motion shown in Fig. 17cpoints out not only the overall trend but also the smaller scalecurly variations in the paths of the image points. In face 3 multi-image sequence, there appear also changes in deformation whichcover most of the facial area clearly visible, e.g., between frames67 and 74 in Fig. 18b.

It is next visually illustrated that the effect of rigid motion hasbeen successfully eliminated and that the changes in facial defor-mations can be correctly detected, at least as regards points wherelarge deformations are occurring and which are thus easier to ver-ify visually. The points where large changes in the image coordi-nates appear between consecutive frames due to changes inobject deformation are thus detected by thresholding with differ-ent thresholds for row and column directions of each camera. Eachthreshold is given by the median of absolute changes in the rigidmotion eliminated image row or column coordinates plus 1.5 timesthe standard deviation of changes in the rigid motion eliminatedimage row or column coordinates over all the points and all theframes of the camera in question. This resembles the determina-tion of points moving only due to object rigid motion between con-secutive frames described in Section 2.6, but differs from it in that

Frame 444 Frame 445 Frame 446

Camera 3

Camera 1

Fig. 12. Tracking of a point (red plus sign) on face 2 by affine multi-image leastsquares matching.

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 61

the medians and standard deviations are now computed over lar-ger sets and the effect of rigid motion has been already eliminatedfrom the image coordinates. The factor 1.5 has been determinedexperimentally. Fig. 19 shows an example where one or both ofthe image coordinates of the points marked with red plus signs un-dergo a large change due to object deformation while the pointsmarked with white plus signs stay still or change only due to objectrigid motion. The red lines visualize the directions and magnitudesof the changes in the rigid motion eliminated coordinates from the

465 470 475 480 485 490 495 500560

580

600

0

0.1

0.2

0.3

0.4

0.5

j / pixel

i / pixel

gray

leve

l

460 465 470430

440

450

460

4700

0.1

0.2

0.3

0.4

0.5

j / pixel

gray

leve

l

(a)

(c)

Fig. 13. The gray level surfaces after affine multi-image least squares matching whensequence. (a) Temporal matching in camera 1. (b) Temporal matching in camera 3. (c) S

previous frame to the current frame so that the changes are to-wards the plus signs. For a better visual presentation, the lengthsof the lines are 20 times the actual changes in the image coordi-nates. Most of the points on the face are correctly white at framesjust before and after the change in deformation. Although the faceis in a deformed state, the points are white as there are no changesin deformation occurring at these frames. The opening of themouth and lowering of the chin in Fig. 19 corresponds to the jumpsof about 10 pixels round frame 40 in the deformation curves inFig. 15b and also in the motion eliminated paths round image rows450–520 and frame 40 in Fig. 15a. Good results are also obtainedfor the other cameras of face 1 multi-image sequence.

Fig. 20 further illustrates elimination of the effect of rigid mo-tion at frame 245 of camera 3 of face 2 multi-image sequence.The left image is without rigid motion elimination and the rightimage with rigid motion elimination. The thresholds (median plus1.5 times the standard deviation) used for detection of largechanges in the image row and column coordinates between con-secutive frames are computed using changes in the original imageobservations instead of the rigid motion eliminated coordinates asconcerns the left image in Fig. 20. The movements of the imagepoints are mainly caused by the rigid motion of the head andone cannot easily see the actual change in facial deformation inthe left image while after elimination of the effect of rigid motionin the right image, the raising of the eyebrows is clearly pointedout together with the magnitude and direction of the change in

465 470 475 480 485 490 495 500430

440

450

460

4700

0.1

0.2

0.3

0.4

i / pixel

j / pixel

gray

leve

l

475 480 485 490 495 500

i / pixel

(b)

tracking the point in Fig. 12 from frame 445 to frame 446 of face 2 multi-imagepatial matching between cameras 1 and 3.

50 100 150 200 250 300 350 400

100

200

300

400

500

600

frame

obse

rved

i / p

ixel

50 100 150 200 250 300 350 400

100

200

300

400

500

600

frame

mot

ion

elim

inat

ed i

/ pix

el

50 100 150 200 250 300 350 400

−20

−10

0

10

20

30

40

frame

mot

ion

i / p

ixel

50 100 150 200 250 300 350 400

−20

−10

0

10

20

30

40

frame

defo

rmat

ion

i / p

ixel

(a) (b)

(c) (d)

Fig. 14. Image row coordinates of tracked points in camera 2 of face 1 multi-image sequence as a function of the frame number. (a) Original measurements. (b) Rigid motioneliminated. (c) Effect of rigid motion. (d) Effect of deformation.

20 40 60 80 100400

420

440

460

480

500

520

frame

mot

ion

elim

inat

ed i

/ pix

el

20 40 60 80 100−15

−10

−5

0

5

10

frame

defo

rmat

ion

i / p

ixel

(a) (b)

Fig. 15. (a) Zoomed view of Fig. 14b (rigid motion eliminated image rowcoordinates). (b) Zoomed view of Fig. 14d (effect of deformation).

62 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

the rigid motion eliminated image coordinates. The results for theprevious frames 242–244 together with frame 245 within the fa-cial area around the eyebrows are shown in Fig. 21. It can be seenthat the effect of rigid motion is clearly removed at frame 242while at frames 243 and 244, the head stays mainly at the sameplace and the movements of the image points are mostly similarwith and without rigid motion elimination. Similarly as inFig. 20, elimination of the effect of rigid motion is illustrated atframe 177 of camera 4 of face 3 multi-image sequence in Fig. 22.

In the left image, most of the points are moving to the right dueto the rigid motion of the head while in the right image, the defor-mation around the mouth is clearly pointed out. The points abovethe mouth are labeled stable without rigid motion elimination,while after elimination of the effect of rigid motion, they are actu-ally moving to the southwest direction.

The 3-D facial deformation at each object point with respect tothe 3-D position of the point at the frame where the point appearsfor the first time and transformed to the first frame can be visual-ized as a sequence of set of vectors in 3-D space. However, since wedo not have a dense surface model for the face, such pictures areless informative and we prefer showing the results in the imagespace as in Figs. 19–22.

3.3. Different number of cameras

It was studied how the tracking results depend on the numberof cameras. For the elimination of the effect of rigid motion, weneed at least two cameras and on the other hand, we have onlyfour cameras available so that this study is limited to two to fourcameras. The mean precisions of the tracked image points aver-aged over all the points are shown for the different faces and differ-ent combinations of the cameras in Table 1. The cases consisting ofthree or two cameras have been computed using those correspon-dences established in the case of four cameras which appear onlyin the three or two cameras in question. This means that some cor-respondences, which may have appeared if the keypoint extractionand the tracking had been performed again using the cameras inquestion, are missing. Anyway, there are plenty of points, overwhich the mean precisions have been averaged, as given in Table 2.These numbers indicate the total amount of 3-D points estimatedfrom the whole image sequences and averaged over the different

50 100 150 200 250 300 350 400

550

600

650

700

750

800

850

900

frame

obse

rved

j / p

ixel

50 100 150 200 250 300 350 400

600

650

700

750

800

850

900

frame

mot

ion

elim

inat

ed j

/ pix

el

50 100 150 200 250 300 350 400

−25

−20

−15

−10

−5

0

5

10

15

20

25

frame

mot

ion

j / p

ixel

50 100 150 200 250 300 350 400

−25

−20

−15

−10

−5

0

5

10

15

20

25

frame

defo

rmat

ion

j / p

ixel

(a) (b)

(c) (d)

Fig. 16. Image column coordinates of tracked points in camera 2 of face 1 multi-image sequence as a function of the frame number. (a) Original measurements. (b) Rigidmotion eliminated. (c) Effect of rigid motion. (d) Effect of deformation.

100 200 300 400 500 600350

400

450

500

550

600

650

700

frame

obse

rved

j / p

ixel

100 200 300 400 500 600

400

450

500

550

600

650

700

750

frame

mot

ion

elim

inat

ed j

/ pix

el

100 200 300 400 500 600−60

−40

−20

0

20

40

60

80

frame

mot

ion

j / p

ixel

100 200 300 400 500 600−40

−30

−20

−10

0

10

20

30

frame

defo

rmat

ion

j / p

ixel

(a) (b)

(c) (d)

Fig. 17. Image column coordinates of tracked points in camera 3 of face 2 multi-image sequence as a function of the frame number. (a) Original measurements. (b) Rigidmotion eliminated. (c) Effect of rigid motion. (d) Effect of deformation.

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 63

50 100 150 200 250300

400

500

600

700

800

900

frame

obse

rved

i / p

ixel

50 100 150 200 250300

400

500

600

700

800

900

frame

mot

ion

elim

inat

ed i

/ pix

el

50 100 150 200 250

−100

−50

0

50

100

frame

mot

ion

i / p

ixel

50 100 150 200 250

−60

−40

−20

0

20

40

60

frame

defo

rmat

ion

i / p

ixel

(a) (b)

(c) (d)

Fig. 18. Image row coordinates of tracked points in camera 4 of face 3 multi-image sequence as a function of the frame number. (a) Original measurements. (b) Rigid motioneliminated. (c) Effect of rigid motion. (d) Effect of deformation.

Fig. 19. The mouth is opening and the chin is lowering at cropped frame 40 ofcamera 2 of face 1 multi-image sequence. The points marked with red plus signs aredeforming with the red lines showing the direction and 20 times the magnitude ofthe change while the points marked with white plus signs move only due to rigidmotion.

Fig. 20. Large changes in the image coordinates without (left image) and with (rightimage) elimination of the effect of rigid motion at cropped frame 245 of camera 3 offace 2 multi-image sequence. The raising of the eyebrows is clearly pointed out inthe right image.

64 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

camera combinations. In the tables, M denotes the number of cam-eras where the point has corresponding points (cf. Section 2.5). Inaddition to the number of cameras, the imaging geometries are dif-ferent for the different camera combinations and for the differentfaces. The results in Table 1 show that there are no large differ-ences in the mean precisions of the image points tracked by affinemulti-image least squares matching between the different num-bers or imaging geometries of the cameras. For faces 2 and 3, pre-cise results are obtained when the point has corresponding imagepoints in all the four cameras but for face 1, somewhat highermean precisions are obtained by many of the other camera combi-nations when the point has corresponding image points in onlytwo cameras.

Possible systematic tracking errors were investigated by com-paring the results given by the different camera combinations tothe results obtained by using all the four cameras. The pointswhich had corresponding image points tracked by affine multi-im-age least squares matching in all the four cameras were selected astest points. The analysis was done frame by frame so that the posi-tion of each test point was initialized according to the case of fourcameras for each frame. In other words, the points at differentframes were considered as different test points which were trackedto the next frames and compared to the case of four cameras. Thecomparison results are shown in Table 3. The differences are smallfor face 2, somewhat bigger for face 1 and biggest for face 3. Thecases where large differences appeared were analyzed visually. It

Rigid motionnot eliminated

Rigid motioneliminated

Frame 242 Frame 243 Frame 244 Frame 245

Fig. 21. The raising of the eyebrows is visible at all cropped frames 242–245 after elimination of the effect of rigid motion (lower row of images) while it is confused with therigid motion if not eliminated (upper row of images) at cropped frames 242 and 245 of camera 3 of face 2 multi-image sequence.

Fig. 22. Large changes in the image coordinates without (left image) and with (rightimage) elimination of the effect of rigid motion at cropped frame 177 of camera 4 offace 3 multi-image sequence. The facial deformation around the mouth is correctlypointed out in the right image.

Table 2Number of 3-D points used to estimate the mean precisions on the average.

Face Number of 3-D points

M = 4 M = 3 M = 2

1 942 1468 343372 314 8463 1410893 267 2096 78839

Table 3RMS and maximum absolute difference of the image points tracked by affine multi-image least squares matching using a subset of cameras when compared to using allthe four cameras in pixels.

Cameras Face 1 Face 2 Face 3

Max RMS Max RMS Max RMS

1, 2, 3 2.09 0.31 0.80 0.08 7.50 0.561, 2, 4 3.51 0.49 0.77 0.11 8.03 0.741, 3, 4 2.52 0.32 0.48 0.09 6.35 0.652, 3, 4 4.37 0.52 0.84 0.19 9.42 1.351, 2 1.87 0.25 0.68 0.10 5.09 0.521, 3 4.21 0.47 0.34 0.08 7.77 0.681, 4 1.88 0.49 0.93 0.14 11.40 0.902, 3 4.16 0.56 0.84 0.19 8.06 1.282, 4 2.34 0.45 0.84 0.20 11.63 1.323, 4 11.01 1.48 0.80 0.21 9.25 1.54

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 65

was found that in five cases, the setup of four cameras gave a cor-rect solution while several other camera combinations yieldedfalse tracking. In one case, it was the other way round, namely,the four camera setup failed and several other camera combina-tions succeeded. There were also two cases which none of the cam-era combinations could handle. These included points on theforehead with large motion between successive frames. Figs. 23and 24 show an example, where a point on the forehead of face3 is successfully tracked over all the frames when all the four cam-eras are used while the tracking results are misplaced at frames 40and 42 if only cameras 2, 3, and 4 are used. At these frames, the im-age points are about 5 pixels below those of the four cameras set-up. Note that the solution in Fig. 24 is not a tracked one from frame

Table 1Mean precisions of the image points tracked by affine multi-image least squares matching

Cameras Face 1 Face 2

M = 4 M = 3 M = 2 M = 4

1, 2, 3, 4 0.042 0.038 0.033 0.0251, 2, 3 – 0.079 0.030 –1, 2, 4 – 0.036 0.031 –1, 3, 4 – 0.048 0.042 –2, 3, 4 – – 0.046 –1, 2 – – 0.029 –1, 3 – – 0.038 –1, 4 – – 0.041 –2, 3 – – 0.030 –2, 4 – – 0.054 –3, 4 – – 0.044 –

39 to 43, but the initial positions of the image points at each frameare the positions at the previous frame given by the four camerasolution. Based on these experiments, we may conclude that inoverall, best tracking results are obtained when all the four cam-eras are used.

The precision of a 3-D point estimated from image measure-ments usually increases when the number of images where thepoint appears increases and when the imaging geometry of the

on the average in pixels.

Face 3

M = 3 M = 2 M = 4 M = 3 M = 2

0.026 0.027 0.014 0.018 0.0170.027 0.026 – 0.018 0.0190.028 0.027 – 0.017 0.0200.026 0.027 – 0.020 0.0160.025 0.027 – 0.018 0.015– 0.026 – – 0.020– 0.027 – – 0.020– 0.030 – – 0.021– 0.026 – – 0.017– 0.028 – – 0.018– 0.026 – – 0.014

Camera 1

Camera 2

Camera 3

Camera 4

Frame 39 Frame 40 Frame 41 Frame 42 Frame 43

Fig. 23. Correct tracking of a point on the forehead of face 3 with all the four cameras. The small image on the upper left corner of each image provides a close-up around thepoint. The head moves upwards between the frames.

Frame 39 Frame 40 Frame 41 Frame 42 Frame 43

Camera 4

Camera 3

Camera 2

Fig. 24. Misplaced tracking at frames 40 and 42 when only three cameras are used. The initial estimates for each frame are the positions at the previous frame given in Fig. 23.

Table 4Computing times and numbers of points per a frame of four cameras on the average.

CPU time/frame(s)

Number ofpoints

Keypoint detection and matching 1.6 481Multi-image least squares matching 110.7 438Rigid motion estimation and

elimination0.1 223

66 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

cameras improves (Fraser, 1984; Remondino and El-Hakim, 2006).The improvement propagates to the estimated rigid motionparameters and improves the precision of the final 3-D deforma-tion estimated at each point. When multiple cameras with ade-quate viewing directions are used, all the details of the face canbe also measured better than with only say two cameras. It is thusadvantageous to use all the four cameras instead of a subset ofthem.

3.4. Computing times

The computing times for tracking the image points are long inour current implementation using matlab software in a HP Pro-Liant DL580 G7 server of CSC – IT Center for Science Ltd. The SIFTkeypoint detection was implemented using demo software with

compiled binaries available at David Lowe’s web page. As shownin Table 4, the CPU time for keypoint detection and spatial andtemporal keypoint matching is moderate while the tracking by af-fine multi-image least squares matching without precision estima-tion is much slower. However, the trackings of different object

O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68 67

points by multi-image matching are independent and they mightbe thus parallelized, which would shorten the computing timeconsiderably. The estimation and elimination of rigid motion is al-ready fast.

3.5. General comments and discussion

When the face experiences a deformation, the image of the faceundergoes a transformation which may be rather complex includ-ing geometrical and radiometrical changes. The experimentsshowed that the proposed affine multi-image least squares match-ing is able to track points on the deforming facial surface, but thepoints are not allowed to move too much between consecutiveframes as assumed in Section 2.3. This is since the initial locationsneed to be close enough to the true ones for the merit function inEq. (4) to converge to a correct solution, as it is usual in nonlinearleast squares adjustment. The convergence is also facilitated byusing template images of appropriate size (31 or 51 pixels square)to contain enough gray level variation within the image patchesand to handle large changes in deformation so that there are en-ough similar areas in the patches matched. In the experiments,there are only a few points per frame if any where the iterationdoes not converge.

In the first stage of tracking, we used SIFT keypoints mainly be-cause we wanted spatial correspondences between different viewswithout an additional search based on least squares adjustment,and on the other hand, SIFT-based descriptors have been foundto outperform other descriptors (Mikolajczyk and Schmid, 2005).Instead of SIFT keypoints, other interest point detectors anddescriptors might be used as well. See (Gauglitz et al., 2011; Miko-lajczyk et al., 2005; Mikolajczyk and Schmid, 2005) for compari-sons of various detectors and descriptors. As our multi-imagematching is formulated using affine transformations, we tested tobe sure Harris-Affine and Hessian-Affine interest points with SIFTdescriptors for the face 1 multi-image sequence. These combina-tions resulted, however, in short tracking paths of only 2.2 and2.4 frames on the average for the Harris-Affine and Hessian-Affinedetector, respectively, after which no temporal correspondencewas found for the point. Using SIFT keypoints, the average lengthof the paths successfully tracked was 10.8 frames. Our affine mul-ti-image least squares matching could then track the points 116frames on the average.

The rigid motion elimination was based on an assumption thatless than half of the 3-D points visible in a view undergo a defor-mation at a time so that the main trend can be estimated usingthe median of changes in the image coordinates. This assumptionmay be restrictive in some cases where changes of facial expres-sions include deformations within the whole face. In the experi-ments, there appeared also points on the neckline and clothes ofthe upper body of the test persons. These points do not move con-sistently with the points on the head if one turns his/her head andthis may result in a false alarm of deformation on the neckline orupper body. A solution would be to segment out these points. Inthe experiments, however, the test persons held their heads mostlyin the same positions with respect to the upper body and thepoints on the neckline and upper body were thus included in theanalysis.

4. Conclusions

The paper has proposed a new method for tracking correspond-ing points in multi-image sequences based on simultaneousmatching of multiple spatial and temporal template images againsteach search image. The novelty lies in that the spatial templateimages are from the same frame as the search images which

implies that these template images are changing during adjust-ment rather than being fixed. The test results prove that the pro-posed method gives better results for rapidly changing areas ofthe face than if only fixed template images of the previous framewere used. In one test case, the proposed method is also able tocorrect a small misplacement of corresponding points originallydetermined by SIFT keypoint detection and matching. On the otherhand, the proposed method yields slightly lower mean precisionfor the tracked image points than if fixed template images wereused. The experimental tests also show that best tracking resultsare obtained when all the four cameras are used instead of a subsetof three or two cameras.

The paper has presented a new method for elimination of theeffect of the rigid motion of the head and estimation of the puredeformation of the face based on analyzing changes in the imagecoordinates of the tracked corresponding points in multiple views.The test results verify good performance of the method as uniformmovements in the paths of the image points have disappeared afterelimination of the effect of rigid motion and the remaining largechanges in the paths occur only in areas where the face is trulydeforming. The assumptions include that less than half of thepoints are deforming at a time and that the cameras have been cal-ibrated and the exterior orientations solved in advance. Related tocalibration, the paper has proposed a novel adaptive threshold forextraction and tracking of circular targets of a moving calibrationobject. This adaptive threshold works very well for the image se-quences tested.

In overall, the emphasis of the paper has been on developinghighly accurate methods suitable also for other applications thantracking facial deformations. Such applications may include moni-toring long-term deformations of various construction targets suchas bridges or roofs of buildings under a heavy load of snow. Inthese applications, the processing time is not so critical while par-allel implementation of the multi-image least squares matchingalgorithm adequate to studying facial deformations is left for fu-ture research.

Acknowledgments

We would like to thank Henrik Haggrén and Tapani Jokinen foracting as test persons in addition to the author. Henrik Haggrén isalso acknowledged for fruitful discussions on tracking changes inimage measurements. David Lowe is acknowledged for providingdemo software for SIFT keypoint detection and matching at (Lowe,2005). Harris-Affine and Hessian-Affine detectors were imple-mented using software available at (Mikolajczyk, 2007). The com-putations were performed in a HP ProLiant DL580 G7 server of CSC– IT Center for Science Ltd.

References

Andriyenko, A., Schindler, K., Roth, S., 2012. Discrete-continuous optimization formulti-target tracking. In: Proc. IEEE Conference on Computer Vision and PatternRecognition, Providence, RI, USA, 16–21 June, pp. 1926–1933.

Baltrusaitis, T., Robinson, P., Morency, L.-P., 2012. 3D constrained local model forrigid and non-rigid facial tracking. In: Proc. IEEE Conference on Computer Visionand Pattern Recognition, Providence, RI, USA, 16–21 June, pp. 2610–2617.

Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Gross, M., 2011.High-quality passive facial performance capture using anchor frames. In: Proc.ACM SIGGRAPH, Vancouver, BC, Canada, 7–11 August, ACM Transactions onGraphics 30 (4), (Paper No. 75).

Berclaz, J., Fleuret, F., Türetken, E., Fua, P., 2011. Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and MachineIntelligence 33 (9), 1806–1819.

D’Apuzzo, N., 2003. Surface Measurement and Tracking of Human Body Parts fromMultistation Video Sequences. Ph.D. Thesis, Institute of Geodesy andPhotogrammetry, ETH Zurich, Switzerland, Mitteilungen No. 81.

Denker, K., Umlauf, G., 2011. Accurate real-time multi-camera stereo-matching onthe GPU for 3D reconstruction. Journal of WSCG 19 (1), 9–16.

68 O. Jokinen / ISPRS Journal of Photogrammetry and Remote Sensing 84 (2013) 52–68

Fasel, B., Luettin, J., 2003. Automatic facial expression analysis: a survey. PatternRecognition 36, 259–275.

Fraser, C.S., 1984. Network design considerations for non-topographicphotogrammetry. Photogrammetric Engineering and Remote Sensing 50 (8),1115–1126.

Furukawa, Y., Ponce, J., 2009. Dense 3D motion capture for human faces. In: Proc.IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,20–25 June, pp. 1674–1681.

Gauglitz, S., Höllerer, T., Turk, M., 2011. Evaluation of interest point detectors andfeature descriptors for visual tracking. International Journal of Computer Vision94 (3), 335–360.

Gokturk, S.B., Bouguet, Y.-J., Grzeszczuk, R., 2001. A data-driven model formonocular face tracking. In: Proc. 8th IEEE International Conference onComputer Vision, Vancouver, BC, Canada, 7–14 July, pp. 701–708.

Gruen, A., 1985. Adaptive least squares correlation: a powerful image matchingtechnique. South African Journal of Photogrammetry, Remote Sensing andCartography 14 (3), 175–187.

Gruen, A., Baltsavias, E.P., 1988. Geometrically constrained multiphoto matching.Photogrammetric Engineering and Remote Sensing 54 (5), 633–641.

Hartley, R., Zisserman, A., 2000. Multiple View Geometry in Computer Vision.Cambridge University Press, Cambridge.

Jokinen, O., Haggrén, H., 2011. Estimation of 3-D deformation from one or moreimages with weak imaging geometry. The Photogrammetric Journal of Finland22 (2), 14–26.

Jokinen, O., Haggrén, H., 2012. Detection and correction of changes in exterior andinterior orientations while estimating 3-D object deformations from multipleimages with weak or strong imaging geometry. ISPRS Annals of thePhotogrammetry, Remote Sensing and Spatial Information Sciences I-3, 43–48.

Kanatani, K., 1994. Analysis of 3-D rotation fitting. IEEE Transactions on PatternAnalysis and Machine Intelligence 16 (5), 543–549.

Leal-Taixé, L., Pons-Moll, G., Rosenhahn, B., 2012. Branch-and-price globaloptimization for multi-view multi-target tracking. In: Proc. IEEE Conference

on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June,pp. 1987–1994.

Lowe, D., 2004. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision 60 (2), 91–110.

Lowe, D., 2005. SIFT Demo Program, Version 4. <http://www.cs.ubc.ca/�lowe/keypoints/> (accessed 18.06.13).

Mikolajczyk, K., 2007. Software for Harris-Affine and Hessian-Affine Detectors.<http://www.robots.ox.ac.uk/�vgg/research/affine/> (accessed 18.06.13).

Mikolajczyk, K., Schmid, C., 2005. A performance evaluation of local descriptors.IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (10), 1615–1630.

Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F.,Kadir, T., Van Gool, L., 2005. A comparison of affine region detectors.International Journal of Computer Vision 65 (1/2), 43–72.

Oka, K., Sato, Y., 2005. Real-time modeling of face deformation for 3D head poseestimation. In: Proc. Second International Workshop on Analysis and Modelingof Faces and Gestures, Beijing, China, 16 October, pp. 308–320.

Remondino, F., El-Hakim, S., 2006. Image-based 3D modelling: a review. ThePhotogrammetric Record 21 (115), 269–291.

Wang, P., Ji, Q., 2008. Robust face tracking via collaboration of generic and specificmodels. IEEE Transactions on Image Processing 17 (7), 1189–1199.

Wang, S., Lu, H., Yang, F., Yang, M.-H., 2011. Superpixel tracking. In: Proc. 13th IEEEInternational Conference on Computer Vision, Barcelona, Spain, 6–13November, pp. 1323–1330.

Yang, B., Nevatia, R., 2012. An online learned CRF model for multi-target tracking.In: Proc. IEEE Conference on Computer Vision and Pattern Recognition,Providence, RI, USA, 16–21 June, pp. 2034–2041.

Zhang, Z., 1994. Iterative point matching for registration of free-form curves andsurfaces. International Journal of Computer Vision 13 (2), 119–152.

Zhang, W., Wang, Q., Tang, X., 2008. Real time feature based 3-D deformable facetracking. In: Proc. 10th European Conference on Computer Vision, Marseille,France, 12–18 October, Part II, LNCS 5303, pp. 720–732.