bachelor thesis on analysis of visual fixations

24
Media Engineering and Technology Faculty German University in Cairo Analysis of eye fixations in static and dynamic stimuli using neural networks Bachelor Thesis Author: Mohamed Ashraf Supervisors: Prof. Florian Röhrbein Prof. Hazem Mahmoud Abbas Submission Date: 04 September, 2014

Upload: mohamed-ashraf-heikal

Post on 07-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Bachelor thesis on analysis of visual fixations on dynamic and static stimuli

TRANSCRIPT

Page 1: Bachelor thesis on analysis of visual fixations

Media Engineering and Technology FacultyGerman University in Cairo

Analysis of eye fixations in static anddynamic stimuli using neural networks

Bachelor Thesis

Author: Mohamed Ashraf

Supervisors: Prof. Florian Röhrbein

Prof. Hazem Mahmoud Abbas

Submission Date: 04 September, 2014

Page 2: Bachelor thesis on analysis of visual fixations

Acknowledgments

First of all I would love to thank my supervisor Prof Florian Röhrbein for his continuedsupport and guidance throughout the writing of my thesis and the creation of the relatedproject. He was indispensable during the tough process of honing the quality of theoutput as well as providing references and ideas for better feature extraction from thestimuli.

I would like to thank the GUC for their sponsorship of my stay in Germany to workon my bachelor thesis. I would also like to thank the GUC Berlin branch for their supportduring my stay in Berlin and specifically to Sarah El Sawaf for her personal assistance. Iwould also like to extend my gratitude to my friends Hossam Mossalam, Mina Nagy andRonja Pawellek for their motivation during the long process of working on the projectand writing of the thesis and for making the downtime more fun and also to NourhanMohamed specifically for her help with the thesis template as well as for all the othertimes she helped me during my years at the GUC.

I am thankful for the existance of stackoverflow since it helped me debug many of thebugs I had during the project and helped me with the complications of using the Matlablanguage, Coursera for the online course on machine learning and neural networks whichprovided for a large part of the basis of the project. And also for Google which helpedme find all the resources I needed wherever they were online.

And finally I would like to thank my parents who supported me financially and emo-tionally during my education and especially during my stay abroad in Germany for theduration of my Bachelor thesis.

II

Page 3: Bachelor thesis on analysis of visual fixations

Abstract

The prediction of eye movement is a sought for ability. It opens the door for manyapplications including improving the visibility of advertisements and in improving thevideo compression schemes for example by decreasing the quality of areas of the videowhere a viewer is not likely to look while keeping the visually salient areas at higherquality. In this paper, eye movements of subjects were recorded while they viewed staticand dynamic grayscale stimuli. We then analysed the recorded movements and tried tocreate a Machine Learning model that would predict where a human eye would look atnew stimuli. Pre-processing and post-processing was done on the resultant heatmaps toremove noise and leave only the most salient regions behind. The same model was usedfor both the static and dynamic stimuli, where dynamic stimuli had temporal factorsalso taken into account. The resultant model worked better than an equivalent randomcontrol model.

III

Page 4: Bachelor thesis on analysis of visual fixations

Contents

Acknowledgments II

Abstract III

1 Introduction 11.1 Saccades and Fixations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Saliency Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Top down approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Bottom up approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Which approach to choose . . . . . . . . . . . . . . . . . . . . . . . . . . 21.6 What we chose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Previous work on static stimuli . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Itti and Koch [2000] . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Judd et al. [2009] . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Previous work on dynamic stimuli . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Le Meur et al. [2005] . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Le Meur et al. [2007] . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methodology 83.1 Our stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Collecting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Preprocessing the collected data . . . . . . . . . . . . . . . . . . . . . . . 93.4 Dividing the stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.5 Training the neural network . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Results and Discussion 114.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Conclusion and Future work 155.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

IV

Page 5: Bachelor thesis on analysis of visual fixations

Appendix 17

A Acronyms 18List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

References 19

V

Page 6: Bachelor thesis on analysis of visual fixations

Chapter 1

Introduction

1.1 Saccades and Fixations

Due to the way the retina is designed, only a small part, called the fovea, has the highestresolving power and provides the clearest image to the brain. The rest of the retinaprovides a less than clear image. However to compensate the brain directs the eye topoint the fovea at the most important part of the image and then approximates the restto provide an almost clear image.

The eye motion is divided into saccades and fixations. Saccades are rapid eye move-ments that switch between important parts of the image. Fixations are when the eyetemporarily stops moving to focus at one of the important parts of the observed scene.Fixations are connected by saccades.

1.2 Saliency Maps

A central idea to most common approaches in fixation analysis and prediction is saliencymaps. These maps assign a value to parts of the image indicating the importance of thatpart of the image. It is theorized that the brain creates a similar saliency map when itviews stimuli. It then tries to focus on the most important part of the image and fixateson it. This part is then inhibited to encourage the fixation on the next most importantpart of the image and so on.

1

Page 7: Bachelor thesis on analysis of visual fixations

CHAPTER 1. INTRODUCTION 2

1.3 Top down approach

There are two approaches to the creation of the saliency map. One is the top downapproach that takes into account the prior information about high level features such asthings identified as background and foreground, cars on the road, human faces. Theseare all things that may not visually stand out on there own but we are attracted to themdue to familiarity. The top down approach also takes into account the aim during theviewing of the stimuli. For example the saliency map, for the same stimuli, may lookdifferent if we searching for a car on the road, or a face among a crowd or keys on a desk.The top down approach usually boosts salience artificially, during a search task, in placeswhere the target object may be found.

1.4 Bottom up approach

The other approach is the bottom up approach. This is mainly concerned with the lowlevel features that are being sensed by the retina. This usually manifests in things likecolor difference, intensity and contrast. This approach predicts that a subject would lookat the parts of the stimuli that stand out from the others visually. Biological analysis ofthe sensors in the retina and how well the brain senses differences in color and intensityand other features are taken in to account when creating a model using the bottom upapproach.

1.5 Which approach to choose

Depending on the task at hand one approach may be better than the other and sometimesit could be a mixture of the two. The top down approach is usually prevalent when asearch task is being done. Since at this point speed is of the essence and the brain triesto search in the most probabilistic places first rather than the most prominent visually.So you block out parts of the image where you are unlikely to find your query even ifthey visually stand out and focus on the areas your brain thinks it will find the query.

The bottom up approach is usually prevalent in memorization then identificationtasks. Where a subject memorizes a set of images and is then later quizzed on whetheror not they have seen an image before, that may or may not be from outside the set ofmemorized images. When a subject wants to memorize an image for a later test, theywill look at the most visually conspicuous areas of the image.

Page 8: Bachelor thesis on analysis of visual fixations

CHAPTER 1. INTRODUCTION 3

1.6 What we chose

In this project we were focused on the bottom up approach. This approach was shown toprovide good results when used with natural scenes Itti and Koch [2000] such as the onesused here. We focused on contrast and intensity since the stimuli were grayscale withoutcolor channels.

Page 9: Bachelor thesis on analysis of visual fixations

Chapter 2

Background

2.1 Previous work on static stimuli

2.1.1 Itti and Koch [2000]

The idea of saliency maps has been well researched and many approaches were attemptedat creating them. The most popular being Itti and Koch [2000]. They used sensory inputto create a map of the most salient regions. They then chose the most salient region asthe most probable first fixation. They then inhibited that region and recreated the mapchoosing the next most salient region as the next fixation point. Through this methodthey were able to create an ordered list of most probable fixation locations.

To create the saliency map first they extracted the low level visual features such asred, green, blue and yellow hues, orientation and brightness at different scales formingan image pyramid. The pyramid was formed by low pass filtering the image and thensub-sampling with a ratio of 2. The pyramid they used had 9 levels. The features werecomputed using a center-surround so that it is sensitive to local changes rather thanamplitude. They were calculated using difference between different scales of the pyramidacting as center and surround. 6 scales of center-surround were calculated for each feature.

They used 7 features that have basis in the visual systems of mammals. They wereon/off image intensity contrast, red/green and blue/yellow double opponent channels and4 features for local orientation contrast at (0, 45, 90 and 135◦). Thus a total of 42 featuremaps were created for each image.

To combine those 42 maps into one, they started with normalising each feature mapto a fixed dynamic range. They then convolved each map with a large Difference of

4

Page 10: Bachelor thesis on analysis of visual fixations

CHAPTER 2. BACKGROUND 5

Gaussians (DoG) with the results being added to the original map and negative valuesare eliminated to 0. This is done 10 times. Next the feature maps for intensity, colorand orientation are summed up to three maps one for intensity, one for color and one fororientation and then normalised separately. The resultant 3 maps are then subjected toanother 10 rounds of DoG filtering and then linearly summed into a final unique saliencymap.

2.1.2 Judd et al. [2009]

Judd et al. [2009] tried to greatly improve on the work of Itti and Koch [2000] as well asother previous efforts by extracting a lot of features. They divided the features used into3 categories, low-level, mid-level and high-level features.

For the low-level features they extracted features that are physiologically possible andhave a track record of affecting visual attention. For example steerable pyramid filters,intensity, color and contrast. They also included the red, green and blue channels andthe probability of each color using 3D color histograms of the image filtered by a medianfilter at 6 scales.

For the mid-level features a horizon line detector was trained separately and used as afeature for the saliency map. As for the high-level features they used face, person and cardetectors and added them as features since it was shown that humans fixate on cars andfaces with a higher probability. They also took into consideration the fact that humanstend to fixate nearer to the center so the distance from the center at each location wasused as a feature as well.

For training their model they split their corpus into 903 training images and 100test images. For each image 10 pixels were chosen from the top 20% salient regions andanother 10 from the lower 70% salient regions.

They used a support vector machine with a linear kernel to train on the aforementioned9030 positive and 9030 negative pixels. They measured performance on the model usingan Receiver Operating Characteristic (ROC) curve and found that their resultant modeloutperforms chance, center fixation and other models based on a subset of features undertest. They also found that they are at 88% of an average human performance in detectingsalient regions.

Page 11: Bachelor thesis on analysis of visual fixations

CHAPTER 2. BACKGROUND 6

2.2 Previous work on dynamic stimuli

2.2.1 Le Meur et al. [2005]

While most research was considering only static stimuli some did consider dynamic stimulisuch as Le Meur et al. [2005] which tried to create a spacial saliency map using spatialfeatures as well as a temporal saliency map considering the movement as well as otherchanges in the stimuli across the frames and then combine them into a spatio-temporalsaliency map.

For the spatial saliency map they used the psychovisual model as the basis for theprocess of creating the map. First they converted from the RGB color components to thekrauskopf color space A (the achromatic component), CR1 (the red and green antagonistcomponent) and CR2 (the blue and yellow antagonist component). They then appliedperceptual channel decomposition by splitting the 2D spatial frequency domain bothin spatial radial frequency and in orientation. Contrast sensitivity functions are thenused to assess the visibility depending on spacial frequencies of the stimuli. Next isvisual masking to inhibit the visibility threshold (maximum level of detail detectableto the eye) due to interactions within the component maps. Next they simulated thecenter-surround suppressive organisation of neurons and two structural descriptions (forthe chromatic and the achromatic components) were created. They were then linearlycombined to form the final spatial saliency map.

To create the temporal saliency map they proposed that visual attention focusses onmotion contrast. So the wanted to calculate local relative motion at each pixel since it iscorrelated to visual saliency. They used hierarchical block matching to find and calculatethe local motion at each part of the image. They created two Gaussian image pyramidsfor each pair of consecutive frames scaling down the image by a factor of 2 and convolvingwith a 2D Gaussian filter. At each level starting from the lowest resolution the motionvector that produces the smallest sum of absolute differences is chosen which is then upsampled and transmitted to the next level. A refinement algorithm is used to reach thefinal local motion vector for each pixel in the original resolution image.

The next step is to calculate the global motion of the camera so as to cancel it outwhen calculating the relative local motion which is what we are interested in. They usedthe previously calculated motion vectors of successive frames in order to estimate theaffine transformation between them using a robust technique based on M-estimators.

Once they had both the local motion and the global motion the relative motion couldbe easily obtained by local - global motion. They discarded areas with very fast motionmore than is detectable by the human eye. They then created the temporal saliency mapby weighting the relative motion by the median motion in the entire frame since movingparts increases in saliency when there are fewer areas of relative motion in the image.

Page 12: Bachelor thesis on analysis of visual fixations

CHAPTER 2. BACKGROUND 7

They combined both the spatial and temporal saliency maps through weighted addi-tion of the normalized maps. They gave more weight to the spatial map in static framesand in very high global motion frames as well as cut scenes and gave temporal saliencymaps more weight in stimuli with some relative motion that is not overpowering. The finalmodel based only on low level features provided 80% accuracy on average for predictinghuman fixations.

2.2.2 Le Meur et al. [2007]

Le Meur et al. [2007] also used dynamic stimuli in their procedure however they tried toimprove upon Le Meur et al. [2005] and Le Meur et al. [2006]. They used their collectedexperimental data to create a fixation map per observer for each stimuli frame. Theythen averaged the maps for all observers and convoluted them with a Gaussian filter.

Similar to their earlier work they converted the colors from RGB to krauskopf colorspace (A, CR1 and CR2). They then described the components in frequency domain.Each frequency is compared to the visibility threshold and anything higher is perceptible.The visibility threshold is determined by the contrast sensitivity function. The thresholdis then inhibited due to interactions within a component map. Irrelevant data is thenremoved using a center-surround filter. This produces the 3 spacial saliency maps. Thetemporal map is generated as above in Le Meur et al. [2005].

They extracted four saliency maps (1 achromatic, 2 chromatic and 1 temporal). Andunlike Le Meur et al. [2005] they did not combine the spatial ones into one spatial saliencymap. They however proposed a method called Local Non-Linear Normalization (LNLN)of combining all of 4 saliency maps into one priority map. The maps are normalized usingthe theoretical maximum of the features they represent. Then intra-map competition aswell as inter-map competition is used to combine each pair of saliency maps into one.

They proposed and compared multiple models, the spatio-temporal model combiningall the maps, a spacial model only combining the 3 spacial maps , a temporal only modeland a spatio-temporal model with the use of simple Normalized Summation (NS) ratherthan LNLN during combination of the maps.

They used multiple metrics for comparison including linear correlation coefficient, KL-divergence, cumulative probability and an ROC curve analysis. All the metrics showedthat the spatio-temporal model using LNLN provided the most accurate output on av-erage. However all models where shown to produce worse accuracy than a trivial modelemphasising a central tendency for fixations only without considering the spacial or tem-poral features of the stimuli.

Page 13: Bachelor thesis on analysis of visual fixations

Chapter 3

Methodology

3.1 Our stimuli

For the purposes of our Model we restricted our stimuli with two important restrictions.First we used natural stimuli only. So we used pictures and videos of natural elementslike trees, clouds and animals in the wild. We tried as much as possible to stay away fromman made objects or situations. We believed that such stimuli would go well with theevolutionarily developed visual system in humans. Since man made structures such ascars and airplanes would elicit mostly learned responses rather than evolutionary ones.

Second we used grayscale stimuli only. This reduced the amount of processing andstorage that we needed to do as well reduced the amount of input for the brain to makea decision to settle on a fixation location thus making it easier to create a more accuratemodel.

We only used high quality uncompressed stimuli. This is to avoid visual artifacts thatmay arise during lossy compression and may affect the quality of the resultant model.The stimuli was all of size 768 by 576 pixels.

We had 40 dynamic stimuli and 40 static stimuli. The static stimuli were all themiddle frames of the dynamic stimuli. The dynamic stimuli consisted of 5 second clipsat 25 frames per second.

3.2 Collecting the data

To create our model we needed to get ground truth results of actual people viewing thestimuli and record where they looked. We had 10 subjects to view each stimuli. They

8

Page 14: Bachelor thesis on analysis of visual fixations

CHAPTER 3. METHODOLOGY 9

were not given any specific instructions and were told to look where they pleased. Eachstimulus was then shown to them and the first 30 fixations were recorded. Subjects hadan average of 11 fixations on dynamic stimuli and 14 fixations on static stimuli. We onlyused the first 4 fixations on static stimuli since these represent the most salient regions.

3.3 Preprocessing the collected data

Fixations were recorded with resolution of 1 pixel. However since more than one pixelwas in the field of view of the subject during the fixation, the pixels around the fixationlocation should also be taken into account. So in order to improve performance we dividedthe stimuli into patches of size 8 by 8 pixels each. We then grouped all fixations withinthe borders of the patch as fixation on the patch as a whole. This reduced the storagespace and processing time greatly and did not lose us any valuable information. Afterthe grouping we then convoluted with a Gaussian low pass filter and used percentiles toremove noise.

We considered only the upper 10% patches and the lower 70% in our calculations. Wetook the upper 10% patches as fixated and the lower 70% of patches as not fixated onwhen training our neural network. We created binary maps of the fixated and non fixatedlocations and then used them to select 50 random patches from each static stimuli. Fordynamic stimuli we used the same number of patches for extraction on a frame by framebasis, however since the differences between frames was negligible and to avoid redundantdata we stepped through and processed a frame every 6 frames.

3.4 Dividing the stimuli

In order to train as well as test the neural network we divided the stimuli into 3 groups,training, cross validation and test groups, with a ratio of 6:2:2. After using the traininggroup to train the neural network, we used the cross validation group to choose a validlambda value to protect against over fitting. We then finally used the test set to see howwell the neural network works on new stimuli.

3.5 Training the neural network

So to train a neural network we had to extract the data at the selected patches. Weextracted data on Intensity normalized by the median intensity in the frame. We also

Page 15: Bachelor thesis on analysis of visual fixations

CHAPTER 3. METHODOLOGY 10

used Prewitt’s edge detection to get areas of high contrast at all the pixels in the patch.For dynamic frames we added the difference between the current frame and previous framefor simple motion detection. On static frames we just added zeros for this difference sincethere was technically no difference between frames.

We then fed the neural network the features and an expected result of 1 for fixatedpatches and 0 for non fixated patches. We used a sigmoidal activation function so theneural network results were also in the range of 0 to 1.

In order to create a binary map of patches in a frame we first serialize the frame intopatches and then feed the neural network the features in those patches. We then get theresult and deserialize it back into a 2D frame. After that we convolute the frame witha DoG filter and then a median filter after that get the highest 90th percentile as thepatches fixated upon and the rest as non fixated.

Page 16: Bachelor thesis on analysis of visual fixations

Chapter 4

Results and Discussion

4.1 Results

The trained neural network creating a heatmap from a frame, after post-processing isusually one blob on the most salient area of the image. Since both the original groundtruth heatmap as well as the final result of the neural network are both binary maps afterpost-processing we can use a simple summation of absolute delta to get an error for eachframe. Dividing that by the number of patches in a frame and we get an error percentageof 15.26%.

If we consider the error for each type of stimuli separately the average error for dy-namic stimuli is 15.19%, while the average error for static stimuli is 18.83%.

Figure 4.1 on page 13 shows one of the images with a high overlap between the pre-dicted salient region and the actual salient region, thus a low percentage error. Figure 4.2shows another of the images with a very low overlap thus a high percentage error.

4.2 Discussion

According to the results, the machine learning model after training performs better thanchance in predicting the areas of most saliency. It appears that the final map afterconvolution with DoG amplifies only the biggest peak in the saliency map while inhibitingothers thus the resultant positive patches are usually clustered together in one area. Thisworks well when the actual most salient regions are close together however if more thanone separate place in the image is salient the machine learning model fails to account forthat accurately and may ignore the smaller peaks in saliency as noise.

11

Page 17: Bachelor thesis on analysis of visual fixations

CHAPTER 4. RESULTS AND DISCUSSION 12

Motion seams to play an important role in the saliency of dynamic scenes. Peopletend to fixate on the moving parts of the scene more than the static parts. As the resultsshow it is easier for the machine learning model to detect saliency when motion is involvedas compared to only static scenes with no motion.

Page 18: Bachelor thesis on analysis of visual fixations

CHAPTER 4. RESULTS AND DISCUSSION 13

Figure 4.1: One of the images that has a low error. Top left is predicted heatmap afterprocessing, top right is the original heatmap from experimental data, bottom left is theoriginal image and bottom right is the delta between the ground truth heatmap and thepredicted one

Page 19: Bachelor thesis on analysis of visual fixations

CHAPTER 4. RESULTS AND DISCUSSION 14

Figure 4.2: One of the images that has a high error. Top left is predicted heatmap afterprocessing, top right is the original heatmap from experimental data, bottom left is theoriginal image and bottom right is the delta between the ground truth heatmap and thepredicted one

Page 20: Bachelor thesis on analysis of visual fixations

Chapter 5

Conclusion and Future work

5.1 Conclusion

Predicting where someone would look in a video or at a scene is very beneficial andprovides great insight that could be used for example to improve warning signs on roadsor on buildings by making sure they are part of the most salient region in a scene.The results of our work can also be used in video compression. There is, for example,an adaptive compression scheme [Farid et al., 2002] that can selectively compress morethe areas of the video that are least likely to be looked at while keeping those visuallyconspicuous areas at a higher quality. Normally they use eye detection data from actualsubjects viewing the video for determining the most important part of a frame howeverwe can use our model to automate this process and provide a good replacement for actualexperimental measurements.

The result is of course not error free. While there is a considerable overlap betweenthe salient regions detected by the machine learning model and the actual most salientregions observed by the experiment, it is still not 100% prediction. It depends on thestimuli for example it may work really well on some stimuli with great overlap while onanother stimuli it may fail to account for the most salient region and predict the wrongareas of the stimuli as the most salient.

5.2 Future work

Looking at our findings we decided the next direction to proceed in is to test our resultantmodel on subjects given a specific task such as a search task. This is to check how wellour model works at saliency prediction when the task given to the observer is different.

15

Page 21: Bachelor thesis on analysis of visual fixations

CHAPTER 5. CONCLUSION AND FUTURE WORK 16

Also attempting to adapt our model to full color images may help create a moreapplicable model to the outside world since most of what we see is full color rather thangrayscale. It may help even improve the quality of the results since the color channelsprovide more low level details that encourages the bottom up saliency approach in thebrain that we are trying to model.

Finally the dynamic stimuli provided all had a static camera. This saved us fromcalculating local and dominant motion. We can expand our process to handle when thecamera is in motion as well giving us a larger variety of dynamic stimuli that we canhandle.

Page 22: Bachelor thesis on analysis of visual fixations

Appendix

17

Page 23: Bachelor thesis on analysis of visual fixations

Appendix A

Acronyms

LNLN Local Non-Linear Normalization

NS simple Normalized Summation

ROC Receiver Operating Characteristic

DoG Difference of Gaussians

18

Page 24: Bachelor thesis on analysis of visual fixations

Bibliography

N. D. B. Bruce and J. K. Tsotsos. Saliency, attention, and visual search: An informationtheoretic approach. Journal of Vision, 9:1–24, 2009. doi: 10.1167/9.3.5.

M. Farid, F. Kurugollu, and F. Murtagh. Adaptive wavelet eye-gaze based video com-pression. Proceedings of the SPIE, 4877:255–263, 2002.

T. Foulsham and G. Underwood. What can saliency models predict about eye movements?spatial and sequential aspects offixations during encoding and recognition. Journal ofVision, 8:1–17, 2008. doi: 10.1167/8.2.6.

A. Hyvärinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics: A probabilistic ap-proach to early computational vision. Springer, 2009.

L. Itti and C. Koch. A saliency based search mechanism for overt and covert shifts ofvisual attention. Vision Research, 40:1489–1506, 2000.

T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humanslook. In International Conference on Computer Vision (ICCV), number 12, pages2106–2113. IEEE, 2009.

W. Kienzle, F. A. Wichmann, B. Schölkopf, and M. O. Franz. A nonparametric approachto bottom-up visual saliency. In Advances in Neural Information Processing Systems19, 2006.

O. Le Meur, D. Thoreau, P. Le Callet, and D. Barbra. A spatio-temporal model ofthe selective human visual attention. In IEEE International Conference on ImageProcessing (ICIP), volume 3, 2005.

O. Le Meur, P. Le Callet, D. Thoreau, and D. Barbra. A coherent computational approachto model the bottomâĂŞup visual attention. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28:802–817, 2006.

O. Le Meur, P. Le Callet, and D. Barbra. Predicting visual fixations on video based onlow-level visual features. Vision Research, 06, 2007.

19