Download - [Lecture Notes in Computer Science] Computer Vision and Graphics Volume 8671 || Gaze-Driven Object Tracking Based on Optical Flow Estimation

Gaze-Driven Object Tracking

Based on Optical Flow Estimation

Bartosz Bazyluk and Rados�law Mantiuk

West Pomeranian University of Technology, SzczecinFaculty of Computer Science and Information Technology

Zo�lnierska Str. 49, 71–210 Szczecin, Poland{bbazyluk,rmantiuk}@wi.zut.edu.pl

Abstract. To efficiently deploy eye tracking within gaze-dependentimage analysis tasks, we present an optical flow-aided extension of thegaze-driven object tracking technique (GDOT). GDOT assumes that ob-jects in a 3-dimensional space are fixation targets and with high probabi-lity computes the fixation directions towards the target observed by theuser. We research whether this technique proves its efficiency for videofootage in 2-dimensional space in which the targets are tracked by opticalflow tracking technique with inaccuracies characteristic for this method.In the conducted perceptual experiments, we assess efficiency of the gaze-driven object identification by comparing results with the reference datawhere attended objects are known. The GDOT extension reveals highererrors in comparison to 3D graphics tasks but still outperforms typicalfixation techniques.

1 Introduction

Gaze tracking is a powerful technique which can be applied for the visual sa-liency analysis. It can indicate the regions of an image that are most frequentlyattended by human observer. Gaze data captured by eye trackers show how vi-sual space is scanned by the constantly changing gaze to extract informationand build the conscious image of a 3-dimensional scene. Gaze tracking could besuccessfully applied in many computer vision application, for example in analysisof advertisement visibility in movie clips, or in gaze-driven image segmentation.However, it is disappointing that neither display devices nor image analysis me-thods make full use of this property of the human visual system (HVS). We arguethat the main reason of this fact is the low accuracy of eye-tracking systems.

Eye trackers capture gaze direction indicated by the physical pose of the eye,in particular location of the pupil centre [1]. The actual fixation direction (thedirection consistent with intention of the observer) is determined with the aidof gaze data that changes over time. Typical fixation detection algorithms areprone to accuracy error of up to two degrees of visual angle [3], which is anequivalent of 80-pixel distance between the reference fixation point watched byobserver and the fixation point captured by the device (value estimated for a

L.J. Chmielewski et al. (Eds.): ICCVG 2014, LNCS 8671, pp. 84–91, 2014.c© Springer International Publishing Switzerland 2014

Gaze-Driven Object Tracking Based on Optical Flow Estimation 85

typical 22-inch display of 1680x1050 pixel resolution, seen from a distance of65 cm).

In this work we propose a gaze-tracking system that combines gaze data withinformation about the image content to greatly improve the accuracy and sta-bility of the contemporary eye trackers. The general purpose fixation detectiontechniques are not suitable for tracking the moving objects, which are followedwith eyes in the smooth pursuit movement [9]. To overcome this limitation weuse the gaze-driven object tracking (GDOT) technique [2], which clearly outper-forms the standard fixation detection techniques in this matter. GDOT treatsdistinct objects in a scene as potential fixation targets and with high probabilitycomputes the fixation directions towards the target observed by the viewer. Itwas demonstrated that this technique can be successfully applied to the depth-of-field effect simulation [10], in tone mapping [6], and as an auxiliary controllerin a computer game [2].

In this paper we present a novel application of GDOT in which it is appliedto identify visually attended objects in a TV broadcast or other examples ofmoving pictures. A set of targets (e.g. players in a football game) is tracked bythe sparse optical flow-aided method. We use the output from this system asa set of potential attention targets that allow us to compute attention-relatedstatistics, e.g. estimation of the most attended player in the game. The basic dif-ference between the original algorithm and our extension is that GDOT assumesperfect location of the targets. But in the optical flow approach inaccuracies intarget positions may occur. We also take into account the cases in which targetsgo beyond the camera boundaries during footage and must be temporarily orpermanently removed from the list of potential targets. We test if this data qu-ality deterioration can be compensated by the regular GDOT routines. We alsocompare achieved results with efficiency of the typical fixation protocols.

Sect. 2 introduces the main concept of our work, i.e. an optical flow-aidedattention tracking system. In Sect. 3 we describe the performed experimentsand then discuss their results in Sect. 4. The paper ends with Conclusions.

2 Gaze-Attended Target Identification

The overview of our proposed tracking system is shown in Fig. 1. The opticalflow supplies the GDOT module with current positions of the potential targetsof attention and eye tracker sends the current gaze positions. The task of GDOTis to select the most likely target of attention for the current time instance.

Proposed Optical Flow-Based Tracking Extension. To use GDOT tech-nique for effective tracking, first a set of potential attention targets has to beprovided. In our proposed solution an expert who has the knowledge about scopeof the study is responsible for identification and marking of targets. Since ourgoal is to choose targets that are semantically important for the study and canattract observer’s attention, the general purpose automated feature extractionmethods may not be used instead. In this work we use re-playable finite video

86 B. Bazyluk and R. Mantiuk

Fig. 1. The design of our gaze-driven target identification system supplied by theoptical flow object tracking

clips as stimuli. This allows us to introduce a semi-manual, computer vision-aidedframework for target designation and tracking of their movement. In a footballmatch the expert would select game-related visible items like the ball, players,referees, on-screen information like score and TV station logo, as well as othervisual attractors like advertisements if they are of interest during study. By mar-king their position within respective first frames of appearance, an automatedtracking procedure can be initiated.

The sparse optical flow estimation is calculated to follow marked targets mo-vement in frame space. To accomplish this task, a frame-to-frame Lucas-Kanademethod is used [8]. This well known algorithm considers tracked features tobe single pixels, which movement between every two consecutive frames is ap-proximately bound to their local neighbourhoods. This way usually an over-determined system is produced. It is then solved with a least-squares method.

However tracking the visual features using simple optical flow analysis canbe problematic in real-world videos. Such stimuli are often prone to artefactsand general quality issues. Low frame rate together with slow shutter speedscan lead to motion blur which affects sparse flow estimation [7], as well as rapidmovement and the natural tendency of moving objects to rotate and occludeeach other, often lead to unrecoverable gaps in automatic tracking process. The-refore we found it necessary to implement in our software a way for manual keyframe insertion, that would help tracking in these critical moments (e.g. when aquickly moving ball is occluded by players and its tracking cannot be recoveredautomatically). The expert is also allowed to perform basic linear interpolationof object’s path to cater for transient disappearance periods, during which theuser’s attention can still be bound to the target despite its temporal lack of vi-sibility [9]. These tasks are performed off-line, however implementation of a fullreal time version is also feasible as long as a reliable automated tracking methodcan be provided. We consider it to be a part of our future work.

GDOT Algorithm. A small distance between the eye-tracker gaze point andthe target is the strongest indicator that an object is attended. The GDOT


technique models this indicator as the position probability proportional to theEuclidean distance between the gaze point and targets. If the position consi-stency becomes unreliable, the object can still be tracked if its velocity is consi-stent with the smooth pursuit motion of the eye. The velocity computed directlyfrom scan paths (gaze data) is an unreliable measure as it is dominated by theeye-tracker noise, saccades and the tremor of the eye. Fortunately, the smoothpursuit motion operates over longer time periods and thus can be extracted fromthe noisy data with the help of a low-pass filter. The sum of probabilities is usedfor the position and velocity consistency because it is likely that either positionor velocity is inconsistent even if the target is attended. A naive target trackingalgorithm could compute the probability of each target at a given point in time(or at a given frame) and choose the object with the highest probability. This, ho-wever, would result in frequent shifts of tracking from one target to another. Anelegant way to penalise implausible interpretation of the data without excessivetime delay is to model the attention shifts between targets as a Hidden Markovprocess. Each state in such a process corresponds to tracking a particular target.Since the fixation cannot be too short, the probability of staying in a particularstate is usually much higher than the probability of moving to another state (inour case 95% vs. 5%). This way the eye is more likely to keep tracking the sametarget than to rapidly jump from one target to another. Further details on theGDOT technique can be found in [2].

3 Experiment

The main objective of the performed perceptual experiments was to test theaccuracy of the optical flow-based GDOT extension while used with motionpicture as stimulus.

Stimuli and Procedure. Three short video clips (18, 29 and 13 seconds long)labelled as A, B and C, each containing a different fragment of a football game,were shown to the observer. In the first case observer was asked to freely watchthe match. This training session allowed for familiarisation with the stimuli,procedure and apparatus. For video B, the participant was asked to follow withhis eyes a coloured, moving marker associated with either one of the players orthe ball (see Fig. 2, left). For video C, the observer was asked to follow the playerwith the ball (no marker was displayed). The whole experiment was repeated3 times for each person. For every sequence a manually defined set of potentialattention targets was distributed over the scene. The targets were associatedwith the ball, players, both score and time indicators and the TV station logo(see example in Fig. 2, right). Their positions were tracked during playback usingthe optical flow-aided technique described in Section 2.

Each observer sat at a distance of 65 cm from the display (this distance wasrestricted with a chin rest). The actual experiment was preceded by a 9-pointeye tracker calibration. Then, three videos were displayed one by one while eyemovements were recorded. In particular, we captured the raw gaze positions and


fixation points. The latter were computed by the proprietary eye tracker softwarebased on combination of the Dispersion Threshold Identification (I-DT) [4] andVelocity Threshold Identification (I-VT) [5] fixation detection techniques.

Fig. 2. Left: frame from video B with an example location of the marker. Right: allo-cation of the potential fixation targets.

Participants and Apparatus. Gaze points were collected from 9 individualobservers (aged between 27 and 44 years old, 7 males and 2 females). All parti-cipants had normal or corrected to normal vision. No session took longer than 5minutes. The participants were aware that their gaze is recorded, but were naiveabout the purpose of experiment.

Our experimental setup consisted of a P-CR Mirametrix S2 eye tracker con-trolled by proprietary software (version 2.5.1.152). The S2 captured locationsof both corneal reflection (the first Purkinje image) and centre of the pupil.The data was collected at the rate of 60 Hz. The eye tracker was positionedunder a 24-inch NEC Multisync PA241W LCD display with the screen dimen-sions of 53x32.5 cm, and the native resolution of 1920x1200 pixels (60Hz). ThePC computer (2.8 GHz Intel i7 930 CPU equipped with NVIDIA GeForce 480GTI 512MB graphics card and 8 GB of RAM, Windows 7 OS) was used to runour evaluation software controlled by MATLAB, and to store the experimentalresults.

4 Results

We processed the gaze data captured during experiments using the GDOT tech-nique. As a result, the most probable fixation targets were identified for eachtime instance in the videos. For the videos B and C, also the reference targetcan be identified because we asked observer to look at the marker of the knownposition or at the player possessing the ball whose position can also be extractedfrom the video with some degree of accuracy. Consistency between the referencetargets and targets identified by GDOT algorithm gives a measure of its accu-racy. We confront the results with the data obtained using raw gaze data andfixation points computed by the eye tracker software. In two latter cases, theclosest target to the gaze point/fixation point is assumed to be the identifiedtarget.


Quality Metric. We used the error metric introduced by Mantiuk et al. [2].They noticed that the objective measure of accuracy represented by the diffe-rence between reference and captured target poorly corresponds to the subjectiveexperience of using a gaze-contingent application. This measure, called an errorrate, is the percentage of time a wrong object is tracked. However, in practiceminimising the error rate does not necessary improve the perceived quality ofthe gaze-dependent object identification. This is because the fixation detectionalgorithms calibrated for a low error rate usually produce rapid changes of fi-xation from one target to another, which are very distracting. Such unwantedrapid changes can be described by the number of times per second an algorithmchanges fixation to a wrong target, which is called error frequency.

Accuracy. Fig. 3 compares the error rate and error frequency for three testedtechniques. The results were averaged over all observers and repetitions. Theproposed GDOT extension has a far superior overall quality when compared toother algorithms, mostly because of the consistency of predictions (low errorfrequency), but also because it could track longer attended objects (smallestoverall error rate).

Video B Video C0

10

20

30

40

50

60

70

80

90

erro

r ra

te [%

]

Video B Video C0

1

2

3

4

5

6

7

8

9

erro

r fr

eque

ncy

[1/s

ec]

GDOT fixations raw gaze points

Fig. 3. Average error rate and error frequency for two individual videos

Temporal Characteristic of Object Tracking. The temporal characteristicsof each algorithm, shown in Fig. 4, provide an insight into why the GDOTalgorithm is judged as significantly better. Plots show the timing of attendedobject identification, each of them during a single session of the experiment.The reference targets that observers were asked to follow are shown as a redbold dashed line. The blue line is the target identified by the correspondingtracking/fixation algorithm.

The raw gaze data (top plot) resulted in a relatively low error rate. However,raw gaze points also gave an unacceptable level of flickering, resulting in thegreatest error frequency. The fixation detection (middle plot) lowers the error


0 5 10 15 20 25 301

5

9

13

17

21

25

29

time [sec]

targ

et ID

RAW GAZE,error rate:66.08[%],error freq.:5.23[1/s]

0 5 10 15 20 25 301

5

9

13

17

21

25

29

time [sec]

targ

et ID

FIXATIONS,error rate:68.51[%],error freq.:2.10[1/s]

0 5 10 15 20 25 301

5

9

13

17

21

25

29

time [sec]

targ

et ID

GDOT,error rate:32.79[%],error freq.:0.10[1/s]

Fig. 4. Target tracking results for an example experimental session (video B, observerpaf, repetition 3). The vertical axis shows IDs of targets, while the horizontal is time.The dashed red lines depict IDs of reference targets, while blue ones show the identifiedtarget.

frequency but at the same time gives an unacceptably high error rate. For themajority of sessions the proposed GDOT correctly identified more than 50% oftargets for video B and more than 70% for video C (77.21% for an individualsession presented in Fig. 4). The dominant source of error were the inaccuracieswhen switching between targets as well as the delay caused by observers notbeing able to move their gaze instantaneously when the marker jumped fromone object to another. This quite high error value may show that complex videofootages which include a high number of fast moving targets with a hard topredict trajectory, still pose a challenge for modern eye tracking because ofinsufficient device accuracy and its temporal lags.


5 Conclusions

We described a novel use of the GDOT method and proven that it can be suc-cessfully used with a video stimuli, if certain features of the motion picture weredesignated as potential fixation targets and tracked by the optical flow system.As it was shown in the conducted perceptual experiments, GDOT technique out-performs the typical fixation detection techniques in this new field. The overallaccuracy is high enough to enable the use of object-oriented gaze and attentiontracking in video-based applications.

Acknowledgements. The project was partially funded by the Polish NationalScience Centre (decision number DEC-2013/09/B/ST6/02270).

References

1. Duchowski, T.A.: Eye Tracking Methodology: Theory and Practice, 2nd edn.Springer (2007)

2. Mantiuk, R., Bazyluk, B., Mantiuk, R.K.: Gaze-dependent Object Trackingfor Real Time Rendering. Computer Graphics Forum (Proc. of Eurographics2013) 32(2), 163–173 (2013)

3. Salvucci, D.D., Goldberg, J.H.: Identifying fixations and saccades in eye-trackingprotocols. In: Proceedings of the 2000 Symposium on Eye Tracking Research &Applications (ETRA), New York, pp. 71–78 (2000)

4. Widdel, H.: Operational problems in analysing eye movements. In: Gale, G., John-son, F. (eds.) Theoretical and Applied Aspects of Eye Movement Research, pp.21–29. Elsevier Science Publishers B.V. 1, North-Holland (1984)

5. Erkelens, C.J., Vogels, I.M.L.C.: The initial direction and landing position of sac-cades. Eye Movements Research: Mechanisms, Processes and Applications, pp.133–144 (1995)

6. Mantiuk, R., Markowski, M.: Gaze-dependent Tone Mapping. In: Kamel, M., Cam-pilho, A. (eds.) ICIAR 2013. LNCS, vol. 7950, pp. 426–433. Springer, Heidelberg(2013)

7. Hailin, J., Favaro, P., Cipolla, R.: Visual tracking in the presence of motion blur.In: Proc. of Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp.18–25 (2005)

8. Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Ap-plication to Stereo Vision. In: Proc. of the 7th International Joint Conference onArtificial Intelligence, Canada, vol. 2, pp. 674–679 (1981)

9. Becker, W., Fuchs, A.F.: Prediction in the oculomotor system: smooth pursuit du-ring transient disappearance of a visual target. Experimental Brain Research 57(3),562–575 (1985)

10. Mantiuk, R., Bazyluk, B., Tomaszewska, A.: Gaze-Dependent Depth-of-Field EffectRendering in Virtual Environments. In: Ma, M., Fradinho Oliveira, M., MadeirasPereira, J. (eds.) SGDA 2011. LNCS, vol. 6944, pp. 1–12. Springer, Heidelberg(2011)

Download - [Lecture Notes in Computer Science] Computer Vision and Graphics Volume 8671 || Gaze-Driven Object Tracking Based on Optical Flow Estimation

Top Related