line drawings of natural scenes guide visual attention · engineering, modelling visual search...

1

Line Drawings of Natural Scenes Guide VisualAttention

Kai-Fu Yang, Wen-Wen Jiang, Teng-Fei Zhan, and Yong-Jie Li, Member, IEEE

Abstract—Visual search is an important strategy of the humanvisual system for fast scene perception. The guided searchtheory suggests that the global layout or other top-down sourcesof scenes play a crucial role in guiding object searching. Inorder to verify the specific roles of scene layout and regionalcues in guiding visual attention, we executed a psychophysicalexperiment to record the human fixations on line drawings ofnatural scenes with an eye-tracking system in this work. Wecollected the human fixations of ten subjects from 498 naturalimages and of another ten subjects from the corresponding996 human-marked line drawings of boundaries (two boundarymaps per image) under free-viewing condition. The experimentalresults show that with the absence of some basic features likecolor and luminance, the distribution of the fixations on the linedrawings has a high correlation with that on the natural images.Moreover, compared to the basic cues of regions, subjects paymore attention to the closed regions of line drawings which areusually related to the dominant objects of the scenes. Finally,we built a computational model to demonstrate that the fixationinformation on the line drawings can be used to significantlyimprove the performances of classical bottom-up models forfixation prediction in natural scenes. These results support thatGestalt features and scene layout are important cues for guidingfast visual object searching.

Index Terms—scene layout, visual attention, eye movements,visual search, scene perception

I. INTRODUCTION

ATTENTION via visual search is necessary for rapid sceneperception and object searching because information

processing in the visual system is limited to one or a fewtargets or regions at one time [1]. From the viewpoint ofengineering, modelling visual search usually facilitates thesubsequent higher visual processing tasks, such as image re-targeting [2], image compression [3], object recognition [4],etc. However, how visual attention is guided in real visualsearching tasks is an interesting but un-fully solved questionin the field of visual neuroscience.

Studies of physiology and psychophysics have revealed thatseveral sources of factors play important roles in guidingvisual attention. The first one is bottom-up guidance. Theclassical Feature Integration Theory (FIT) suggests that visualattention is driven by some low-level visual cues such as localluminance, color and orientation [5], [6]. With a bottom-upmanner, these cues are processed in parallel channels andcombined to guide the visual attention. This stimulus-drivenguidance causes that some local regions with feature differencein scenes attract more attention than others. Moreover, the

Kai-Fu Yang, Wen-Wen Jiang, Teng-Fei Zhan, and Yong-Jie Li are withMOE Key Lab for NeuroinformationUniversity of Electronic Science andTechnology of China, Chengdu, China.

studies of computational models based on the FIT (e.g., IT [7],[8]) also show that these models employing local cues usuallypredict the attention limited in some regions with high localcontrasts (e.g., object boundaries).

On the other hand, visual attention is also guided by taskrelated prior knowledge beyond the low-level visual cues [9].Imagining that we are searching an object in a complex scene,our attention is guided by current task and directed to theobjects with known features of the desired targets. Previousstudies have revealed that the feedback pathways from high-level cortexes carry rich and varying information about thebehavioural context, including visual attention, memory, ex-pectation, etc. [10]. For example, the classical study by Yarbus[11] shows that human fixation distribution is dependent on theasked question when viewing a picture. In addition, user- ortask-driven guidance is also widely used in visual processingmodels in computer vision area [12]–[14]. Recent interestingworks [15], [16] show that human attention is guided bythe knowledge of the plausible size of target objects. It wassuggested that this is a useful brain strategy for human torapidly discount the distracters (e.g., the object with atypicalsize relative to the surrounding) during visual searching, andthis is neglected in current machine learning systems [15].

Besides the stimulus-driven and task-driven guidance, con-textual guidance is another important source for guidinggeneral visual searching tasks [17]. According to the GuidedSearch Theory (GST) [18], the scene context and globalstatistical information are important for guiding visual objectsearching. This indicates that rapid global scene analysis willfacilitate the prediction of the locations of potential objects.The GST suggests that rapid visual searching involves twoparallel pathways [1]: (1) the non-selective pathway, whichserves to extract spatial layout (or gist) information rapidlyfrom the entire scenes; and (2) the selective pathway, whichworks to extract and bind the low-level features under theguidance of the contextual information of scene extracted fromthe non-selective pathway. This two-pathway based strategyprovides a unified framework to integrate the context ofscene and the local information for rapid visual searching.Recently, we have built a computational model based on GSTto effectively execute the general task of salient structuredetection in natural scenes [19], which further supports theefficiency of context guidance in rapid visual searching with atask-independent manner. However, it has been revealed thatsuch context guidance also employs the prior knowledge orexperience in our memory and guides the low-level featureintegration with a top-down manner, similar to the task-drivenguidance [20], [21].

arX

iv:1

912.

0958

1v1

[cs

.CV

] 1

9 D

ec 2

019

2

According to the source of contextual information, we cansimply classify the contextual guidance into two categories:object-based and scene-based guidance. Nuthmann et al. founda preferred viewing location close to the center of objectswithin natural scenes when directed with different task instruc-tions, which suggests that visual attentional selection in scenesis object-based [22]. A related biologically plausible modelwas proposed to attend to the proto-objects in natural scenes[23]. In addition, the object-based guidance also indicatesthat meaningful visual objects usually possess some principlesabout the shape. For example, the well-known Gestalt theory[24] summarizes some universal principles of visual perception(e.g., closure and symmetry), which are crucial factors forfacilitating the visual object searching. Based on the object-based prior, Kootstra et al. employed the feature of symmetryfor fixation prediction and object segmentation in complexscenes [25], [26]. Yu et al. used the Gestalt grouping cuesto select the salient regions from over-segmented scenes [27].

As for the scene-based guidance, previous studies haveshown that structural scene cues, such as layout, horizontalline, openness, depth, etc., usually guide the visual searching[28], [29]. Beyond the geometry of the scenes, visual attentionis also guided by the semantic information of the real-worldscenes, such as the gist of the scenes [28], scene-objectrelations, and object-object relations [30], [31]. Therefore,clear understanding of the neural basis of scene perception,object perception, perceptual learning and memory will alsofacilitate the understanding of visual searching [32].

Moreover, scene structure could provide specific scene-related information for specific visual tasks. Recent studieshave addressed the role of context in driving [33], [34]. Byanalysing the fixation data of 40 non-drivers and experi-enced drivers when viewing 100 traffic images, Deng et al.showed that drivers’ attention was mostly concentrated on thevanishing points of the roads [33], [34]. In addition, theirmodels further support that vanishing point information ismore important than other bottom-up cues in guiding drivers’attention during driving. At the same time, Borji et al. alsofound that vanishing points can guide the attention in free-viewing and visual searching tasks [35].

On the other hand, physiological evidences have shown thatthe global contours delineating the outlines of visual objectscould be responded quite early by the neurons of high visualcortexes, which can provide the sufficient feedback modulationthat enhances the object-related responses at the lower visuallevels [36], [37]. Therefore, in this paper, we focus on the roleof line drawings in visual guidance, which is closely relatedto the important contextual guidance cues such as regionalshapes and scene layout. In specific, we explore the roleof line drawings in visual searching with a psychophysicalexperiment. We first collected the human fixations from 498natural images and the corresponding 996 human-marked linedrawings (two line drawings for each image) with a free-viewing manner. The experimental results show that with theabsence of some basic features like color and luminance, thedistribution of fixations on line drawings has high correlationwith that on natural images. Moreover, the subjects pay moreattention to the closed regions of line drawings that are usually

related to the dominant objects in the scenes. We also builta computational model to demonstrate that the informationof line drawings can be used to significantly improve theperformance of some classical bottom-up models for fixationprediction in natural scenes.

II. MATERIAL AND METHODS

A. Stimuli and Observers

We employed 498 natural images from the Berkeley Seg-mentation Data Set and Benchmarks 500 (BSDS500) [38] andcollected the eye fixations of 10 subjects on the natural imageswith a free-viewing manner. This eye-tracking experiment hasbeen conducted in our previous work [39]. In this work, withthe help of the human-marked segmentations and the contourmaps (i.e., the line drawings) provided by BSDS500 [38], wefurther collected the human fixations of another 10 subjectswhen freely viewing the line drawings.

In this work, we chose two contour maps as visual stimulifrom the 5-10 human-marked contour maps for each naturalscene available in BSDS500: the one containing the mostcontour pixels (denoted by Cmax) and the one containing theleast contour pixels (denoted by Cmin). Such choice helpsdemonstrate the consistent role of contours in guiding visualsearching no matter the detailed parts are contained or notin the contour maps. Moreover, in order to keep consistentwith our previous experiment with the natural scenes [39],each image was resized to 1024×768 pixels in this work byadding gray margins around them while maintaining the aspectratio. The observers include 5 males and 5 females with themean age of 23.8 (22-25) years old. The selected observershave normal or corrected-to-normal vision for participation.They were naive to the purpose of the experiment and hadnot previously seen the stimuli. This study was carried outin accordance with the recommendations of the Guidelinefor Human Behaviour Studies, the Institutional Review Boardof UESTC MRI Research Center for Brain Research withwritten informed consent from all subjects. All subjects gavewritten informed consent in accordance with the Declarationof Helsinki. The protocol was approved by the InstitutionalReview Board of UESTC MRI Research Center for BrainResearch.

B. Procedure

The procedure is same as that in our previous experiment onthe corresponding natural scenes [39]. Ten subjects were askedto watch the given images displayed on the screen with a free-viewing manner. Each image was presented for 5 seconds aftera visual cue of cross (“+”) in the center of the gray screen.An Eyelink-2000 eye-tracking system was used to collect thefixations of subjects with a sampling rate of 1000Hz, and theeye-tracking re-calibration was performed on every 30 images.A chin rest was used to stabilize the head.

The procedure of the experiment was designed and pro-grammed based on the Psychotoolbox of MATLAB. Theimages were presented on the display in a random order, andthe contour maps of the same natural images were separatedby at least 30 trails. The observers were instructed to simply

3

GMM Fitting

Line Drawing Guided prior

Closure DegreeContours

Line Drawing Guided Saliency Map

(with human fixations)

Input Image

Low-level

Saliency Map

Fig. 1. The flowchart of the visual attention model guided by line drawings.

watch and enjoy the pictures (i.e., free viewing task). Finally,we obtained the fixations of 10 subjects on the line drawingsof 498 natural images from the BSDS500 dataset [38]. Notethat two images of BSDS500 were failed to collect reasonablefixation data because the sizes of the objects in those twoimages are too small.

C. Model-based analysis

In the field of computer vision, numerous saliency detectionmethods have been proposed, some of which show promisingperformance, especially when deep learning technology is usedto predict human fixations [40]. However, the researchers inthe field of neuroscience devote themselves to make progressto understand how human execute the efficient visual search-ing. Some researchers have proposed various biologicallyinspired models for fixation prediction [7], [8]. These meth-ods usually have bottom-up architectures and respond to theregions with local high contrast in various feature channels.

Our experimental results suggest that the closed regionshave higher possibility to form meaningful objects and mayattract most of our visual fixations (See Results). Therefore,we build a simple model to extract the closed regions andgenerate the prior location where the objects possibly present.This prior will be used to guide the generation of bottom-upsaliency and improve fixation prediction. Figure 1 shows theflowchart of the guided visual attention model.

Firstly, we extract the edges from the input image with theStructured Forests method presented in [41] and group themain edges into contour with the edge link method [42]. Then,we estimate the possibility of each point in the image within aclosed region based on the contour map. Figure 2 shows twoexamples of closure computation at different points. For each

point (x, y), we search the contour pixels around it along Ddirections (we sample D directions in this study, and D = 8)starting from (x, y). We use Nd(x, y) to denote the numbercontour pixels around (x,y) in all the searched directions. Forexample, P1(x, y) is with Nd = 5, while P2(x, y) is withNd = 8. That means point P2(x, y) has the higher possibilityof locating in a closed region than P1(x, y).

Computationally, the possibility of (x, y) within a closedregion (denoted by Closure Degree) is obtained with

C(x, y) = exp(Nd(x, y)−D) (1)

In addition, we further consider the radii of regions to boostthe estimate of closure degree. Let Ri(x, y), i ∈ [1, D] denotesthe radii of all the considered directions referring to point(x, y). Subsequently, we compute the final closure degree of(x, y) as

Scc (x, y) =C (x, y)

R(x, y) + S(x, y)(2)

where, R(x, y) is the mean radius and S(x, y) is the standarddeviation of all the radii along D directions when consideringpoint (x, y). With the definition in Equation (2), the smallregions (with low R(x, y)) have the higher closure degreesthan the larger regions. Meanwhile, low S(x, y) means theregion towards a circle which attracts more attention.

In order to evaluate the contribution of contextual guidancein fixation prediction, we build a simple model to improve theperformance of fixation prediction when integrating the spatiallayout information into the classical bottom-up methods. Indetail, we employ a Gaussian mixture model (GMM) [43] to fitthe closure degree map obtained with Equation (2) and find the

4

1( , )P x y

2( , )P x y

2R3R

4R5R

1R

5dN =

2R

3R

4R

5R

1R

6R

7R

8R8dN =

Fig. 2. The examples of computing closure degree for each point.

K main components (we set K = 5 in the experiments). Theline drawing guided prior is obtained by combining multipleGMM components (Gi(x, y)) with weights (ωi)

WGMM (x, y) =

K∑i=1

ωi ·Gi(x, y) (3)

ωi = ωiporp · ωi

ellip · ωidis (4)

where, each component is weighted by the component propor-tion (ωi

porp) , ellipticity (ωiellip), and the distance weight (ωi

dis).In detail, ωi

porp is the component proportion from GMMfitting. ωi

ellip is related to the ellipticity which is computedwith the long and short radius the ith Gaussian componentalong its two main axis. It means that the rounder componenthas a greater weight. ωi

dis is related to the distance of thecenter of the ith Gaussian component to the image center,which indicates that the component nearer to image centerhas a greater weight.

Finally, the combined saliency map (Scom) can be obtainedas

Scom (x, y) = Sbottom−up(x, y) · WGMM (x, y) (5)

where, Sbottom−up(x, y) is a saliency map obtained by a low-level saliency method, like IT [7]. With Equation (5), wesimply obtain the line drawing (closure cue) guided visualsaliency map.

III. RESULTS

A. Human fixation analysis

Compared to the natural scenes, the line drawings of animage excludes the basic low-level features such as the colorand luminance. However, what is the difference between thedistribution of fixations on the line drawings and that onthe natural image of a same scene? Figure 3 shows severalexamples of the fixation distribution on the natural images

and the corresponding line drawings. In this figure we displayall the fixation points from the 10 subjects on each visualstimulus. Note that we removed the first fixation point of eachsubject in order to eliminate the center bias considering thatthe visual cue of cross marker was presented at the center ofthe gray screen.

From Figure 3, we can clearly see that the fixations ondifferent stimuli (natural scenes and line drawings) of the samescenes are quite similar. Most of the fixation points (greendots) locate on the dominant objects in both the natural scenesand the line drawings. Especially, without the color and surfaceinformation, visual attention is also attracted by the objectsin the line drawings, which are defined by the shapes andthe Gestalt features of local regions. Note that without thespecific surface features inside the objects, the fixations onthe line drawings are distributed more dispersively than thaton the natural scenes. This suggests that the visual attention isattracted by the structure of scenes and the shape features ofthe potential objects that are outlined by the dominant contoursin the line drawings. When we are viewing the natural scenes,the local feature contrasts can further gather our attention tosome salient regions of the visual objects. This also indicatesthat the layout and the structure of scenes may guide ourattention to the dominant objects for further visual processing.

For qualitative comparison, we also calculated the correla-tion coefficient between the distribution of fixations on the linedrawings and that on the natural images. Considering that thefixation points are extremely sparse on the stimuli, variousscales of Gaussian blurring were used to obtain the spatialdistribution of fixations. The correlation coefficients (CC) werecomputed between the blurred distributions of fixations on theline drawings and the natural scenes, i.e.,

CC(fld, fns) =cov(fld, fns)

σldσns(6)

where, cov() is the covariance, σld and σns are the standarddeviation of blurred distributions of fixations on the linedrawings (fld) and the natural scenes (fns). Figure 4 showsthe correlation coefficients with varying blurring scales. Thefixation distributions on the line drawings of Cmax and Cmin

are denoted by FCmax and FCmin respectively. We cansee that compared to the line drawings Cmin containingless details, the line drawings Cmax containing more detailsprovide more visual information, and the fixation distributionon Cmax is slightly closer to that on the natural images.

Figure 4 indicates the high correlation (the correlationcoefficients are around 0.7) of fixation distribution between thenatural scenes and their line drawings. However, it is easilyunderstood that the fixation locations of different subjects areusually not exactly consistent with each other even they areviewing the same objects. Therefore, we suppose that differentsubjects are attending to the same objects if their fixationpoints locate within the different parts of the same objects.In order to evaluate the object-based consistence between thevisual attention on the natural scenes and the line drawings,we first generated the salient regions based on the human-marked segmentation provided by BSDS500 as follows. Foreach segment, we computed the density of the human fixation

5

Fig. 3. Examples of fixation points (green dots) on natural scenes and the corresponding line drawings. From top to bottom row: the fixations from 10subjects when viewing the natural scenes, the line drawings Cmax and Cmin, respectively.

points as its saliency level. Thus, if a segment (or a region)attracts more fixation points, it will obtain a higher saliencyscore. For each scene, the two salient object maps generatedbased on the two line drawings (i.e., Cmax and Cmin) and thecorresponding fixations were linearly summarized together andintegrated into the final map containing the hierarchical salientobjects. Similarly, we obtained the hierarchical salient objectsbased on the human-marked segmentations and fixations col-lected on the original images. Figure 5 shows several examplesof hierarchical salient objects generated from the fixations onthe natural images and the line drawings.

Co

rrela

tio

n c

oeff

icie

nt

20 30 40 50 60

0.3

0.4

0.5

0.6

0.7

0.8

FCmax

FCmin

Fig. 4. Correlation coefficients between the fixation distributions on the linedrawings (Cmax and Cmin) and the natural images with varying scales (σin pixel) of Gaussian blurring. The fixation distributions on the line drawingsof Cmax and Cmin are denoted by FCmax and FCmin respectively. Theerror bars are the 95% confidence intervals.

In this work, the object-based consistence of visual attentionwas evaluated based on the hierarchical salient objects usingtwo metrics: correlation coefficient [44] and Mean AbsoluteError (MAE) [19], [45]. As indicated in Figure 6, with thesalient objects generated with the fixations on the naturalscenes (SFN ) as baseline, the salient objects generated withthe fixations on the line drawings (SFC) obtain clearly highercorrelation coefficient and lower MAE than some represen-tative computational models (including MR [46], HS [47],and CGVS [19]). Actually, this is an obvious conclusionconsidering that these hierarchical salient objects were markedbased on the fixations from human subjects. However, thisexperiment indeed suggests the high consistence of visualattention when we are viewing the natural images and the linedrawings of the same scenes. Although without the color andsurface information, human subjects can also execute efficientvisual searching by only employing the limited scene structureinformation.

B. Guidance cuesWith the absence of color and luminance, how does the

line drawings guide our visual attention? Some studies haverevealed the role of scene structure for visual guidance inspecific visual scenes or tasks. For example, previous studieshave shown that the vanishing point of a road plays animportant role in traffic scenes especially with the driving task[33]. However, most of the stimuli used in this experiment aretask-irrelevant general images, each of which usually containsat least one dominant object. Therefore, we believe that theshape features of regions and scene layout mainly contributeto predict potential objects.

We tested ten shape-related common features and evaluatedtheir correlation with the fixation distributions. These shape

6

SFN SFCImages

Fig. 5. Allocating saliency score to each region with fixation density. SFN denotes the hierarchical salient objects generated based on the human-markedsegmentations and fixations on the natural images, while SFC denotes the hierarchical salient objects generated based on the human-marked segmentationsand fixations on the contours (i.e., line drawings). Brighter regions have higher saliency scores.

SFC MR HS CGVS SFC MR HS CGVS

Correlationcoefficient

MAE

Fig. 6. Correlation coefficients and MAEs of the salient regions based on the hierarchical salient objects. The baseline is the salient objects generated withthe fixations on the natural scenes (SFN ), while SFC denotes the hierarchical salient objects generated based on the human-marked segmentations andfixations on line drawings

features were extracted from each segment of each image,which are listed in Table I. These features can reflect differentcharacters of regions. For example, the regions with higherdegree of Closure are usually related to meaningful objectsof higher probability. Perimeter Ratio carries the continuityof the boundary of a region. Higher Perimeter Ratio indicateshigher irregularity of a region. It has been verified that some ofthese features are helpful for the salient object detection [48].However, in this work, we further analysed the contributionof individual features to the specific task of visual attention.

Figure 7 lists the relations between the saliency values(which are defined by the fixation density) and the corre-sponding shape cues of each segment of Cmax. In order toclearly show the distributions, the saliency values are shownin log scale in Figure 7. The relations between the saliencyvalues and the corresponding shape cues of each segmenton Cmin are similar with that on Cmax. Table I also lists

the correlation coefficient of saliency values with each shapecue on Cmax and Cmin. We can find that the feature ofClosure contributes most to visual attention among all thecues considered here. In addition, the contributions of Closure,EquivDiameter and Perimeter Ratio are higher than that ofthe well-known center bias (Centralization). In contrast, somefeatures like Orientation contribute little for object detection.It make senses that most of the regions in the natural scenestested in this work are distributed on the 0 degree and 90degree orientations (the bottom-left panel in Figure 7), butOrientation cannot provide enough information to distinguishbetween object and non-object regions.

It has been proved that the closure, as the most importantone among the considered features, plays a pivotal role fordescribing an object proposal [27]. Here we further evaluatedthe contribution of closure feature in the specific task ofvisual searching. Firstly, we define a score to measure the

7

TABLE ITEN SHAPE-RELATED COMMON FEATURES USED IN THIS STUDY AND THEIR CORRELATION WITH THE FIXATION DISTRIBUTIONS.

Features Description Cmax Cmin

Area Ratio The ratio of number of pixels in the region to total pixels in the image -0.459 -0.483Centralization The mean distance of pixels in the region to the center of image, normalized by the

image size-0.409 -0.301

Perimeter Ratio The ratio of perimeter of the region to perimeter of the image -0.510 -0.526MinAxis/MajorAxis The length of the major axis divided by the length of the minor axis of the region -0.072 0.072Eccentricity The eccentricity of the ellipse that has the same second-moments as the region -0.067 -0.065Orientation The orientation of the major axis of the ellipse that has the same second-moments as

the region0.013 -0.006

EquivDiameter The ratio of diameter of a circle with the same area as the region to the diagonallength of image

-0.609 -0.601

Solidity The proportion of the pixels in the convex hull that are also in the region -0.047 -0.013Extent The ratio of pixels in the region to pixels in the total bounding box -0.066 -0.234Closure The closure score of a region is defined in Equation(7) 0.632 0.736

0 0.2 0.4 0.6 0.8 110

-6

10-5

10-4

10-3

10-2

10-1

100

r = -0.459

Area

Fix

atio

n D

ensi

ty (

log

)

r = -0.409

Center bias Perimeter/ImgPerimeter

r = -0.510 r = -0.072

MinAxis/MajorAxis

r = -0.067

Eccentricity

r = -0.609r = 0.013

Orientation EquivDiameter Solidity Extent Closure

r = -0.047 r = -0.066 r = 0.632

Fix

atio

n D

ensi

ty (

log)

Fig. 7. Correlations between various shape features and the saliency levels of each region based on Cmax. The number listed for r in the top of each panelis the linear correlation coefficient between the feature and the saliency values (in log scale).

degree of closure of a region based on the human-markedsegmentation. Let NBi denote the number of intersectionbetween the boundary pixel of segment i and the border pixelset of the image and NRi denote the number of all boundarypixels of segment i. Then, the closure score of contour pixelson segment i is defined as

CSi = 1− NBi

NRi(7)

We denote the set of fixations as Sf and the set of contourpixels as Sc. In addition, we define the closed region as thesegment with closure score higher than 0.9 (i.e, CSi > 0.9),and the set of pixels on closure contours and in the closureregions as Scc.

Then, we define four metrics as follows: (1) the percentageof the fixations around the contours in all fixations (PoF );(2) the percentage of the contours around the fixations inall contour pixels (PoC); (3) the percentage of the fixationsaround and in the closed regions in the fixations around allcontours (PoFC); (4) the percentage of the closed regionsin the whole image (PoCC). These four metrics can be

computed as

PoF =DN (Sc) ∩ Sf

Sf(8)

PoC =Sc ∩DN (Sf )

Sc(9)

PoFC =DN (Scc) ∩ Sf

Sf(10)

PoCC =DN (Scc)

W ×H(11)

where DN (S) denotes the dilatation operator with a size ofN×N pixels applied on S (a binary map), W and H representrespectively the width and height (in pixels) of the image.

Figure 8 illustrates graphically the metrics with varying Nvalues on the line drawings of Cmax and Cmin. From Figure8 (left), PoF only achieves around 0.5 even with a large N ,which indicates that less than half of the fixation points locatenear the contour lines and more than half of the fixation pointslocate in the non-contour regions (without any local contrast).This suggests that there are some higher-order visual cuesbeyond local contrast (contours) for guiding visual attention.

8

In addition, although contour lines attract around half of thevisual fixations, PoC further shows that these fixations locatearound only 30% of all contour pixels. This means that partof contour lines plays a more important role than others inguiding visual attention.

A simple suppose is that visual objects are commonlydefined by the closed regions in line drawings and visual atten-tion usually focuses on the potential visual objects. Examplesshown in Figure 4 and the results listed in Table I suggestthat closed regions attract more visual fixation. In order tofurther verify this suppose, we analyzed the metrics of PoFCand PoCC with varying N values on the line drawings ofCmax and Cmin. Figure 8 (right) indicates that around 80%of the fixations locate within or around the closed regions(high PoFC), although the closed regions cover only a smallpercentage of area in the whole scene (low PoCC).

C. Model evaluation

In this experiment, we employed several representativeclassical models including FIT-based model (IT) [7], AIM[49], SIG [50], graph-based model (GB) [51]. To evaluatethe performance of multiple models for the task of fixationprediction, we employed a metric of receiver operating char-acteristic (ROC) curve widely used in the field of computervision [44], [49]. In this experiment, we used the revisedversion of ROC [52], which focuses mainly on the truepositive rate, i.e., the proportion of actual fixations that arecorrectly identified as such, and the human fixation on thenatural scenes is regarded as the ground truth. As shownin Figure 9, the fixation prediction performances of all theconsidered bottom-up models are significantly improved afterintegrating the guided information from line drawings usingEquation (5). For example, the IT model is a classical bottom-up model, which predicts the fixations from natural imagesby combining the contrast features of color, luminance andorientation [7]. Therefore, with the absence of scene structurefeature, IT model mainly detects some regions with high localcontrasts while missing the meaningful object regions (e.g., theinner surface of objects). However, line drawings can provideadditional scene structure and regional shape information,which serve to guide the visual attention to focus on the innerof potential objects. Therefore, the closure prior from the linedrawings that represent the important region information canremarkably improve the performance of IT. Figure 10 showsseveral examples of saliency maps predicted by the originalIT and the improved model (Guided IT).

IV. DISCUSSION

Classical theories of visual processing view the brain asa stimulus-driven device, in which visual information is ex-tracted hierarchically along the visual pathway [5]. However,numerous recent neurophysiological and psychological studiessupport that a coarse-to-fine mechanism plays an importantrole in various visual tasks [53], such as stereoscopic depthperception [54], temporal changes [55], object recognition incontext [21], etc. In addition, Oliva et al. reveal that the scaleusage of visual information is flexible and can provide the

information required depending on the task constraints [56].On the other hand, it has been proven that the coarse-to-finestrategy can contribute to various computer vision applications,e.g., contour detection [57] and image segmentation [58].These researches further confirm the efficiency of the coarse-to-fine mechanism in visual information processing from theviewpoints of computational modeling.

It is widely accepted that visual processing is an activeand highly selective process [20], which reflects the dynamicinteraction between the bottom-up feature extraction and thetop-down constraints [1], [59]. This means visual informationprocessing is not purely stimulus-evoked, but constrained bytopdown influences that strongly shape the intrinsic dynamicsof cortical circuits and constantly create predictions aboutthe forthcoming information [10], [59], [60]. For example,with object recognition in context, context-based predictionmakes the task of object recognition more efficient [21]. Thesestudies support that context information, acting as coarse andglobal scene information, can be rapidly extracted and used tofacilitate the object recognition in complex scenes [21], [61].

As an important aspect of scene perception, visual atten-tion is also considered as a process that encompasses manyaspects central to human visual and cognitive function [20].Knudsen also proposed a famous conceptual framework thatindicates the attention combined contribution of four distinctprocesses: working memory, competitive selection, top-downsensitivity control, and automatic filtering for salient stimuli[62]. Computational modeling of visual searching has alsodemonstrated the contribution of information from differentlevels of perception [17].

According to the scales of features, visual informationcan be coarsely divided into three levels: low, middle, andhigh levels. Firstly, at the low-level scale, classical bottom-upframeworks (e.g., Koch, Itti et al. [5]–[7]) have shown thatlocal contrast in various feature channels can attract visualattention. These models are usually based on the local filteringand obtain stimuli-driven saliency map without taking intoconsideration the behavioral goals of searching [8]. Secondly,at the region scale (mid-level), perceptual organization prin-ciples which describe how basic visual features are organizedinto more coherent units [63] (e.g., Gestalt [24]) could guidethe visual searching. For example, the regions that matchsome specific principles (e.g., closure, continuity, symmetry,etc.) usually indicate special visual objects that are moreinteresting for human visual tasks [25], [64]. Finally, at thescene scale (high-level), scene layout and structure will beimportant guidance to predict where the interesting objectspresent in the current scene [1], [65]. With the guided searchtheory, the high-level scene semantic and gist could provideimportant spatial cues for targets, which will facilitate thebinding of low-level features and speed-up the visual searching[1], [17], [28], [33]. These principles are usually solidifiedin our memory as certain general knowledge obtained withlearning from our daily life.

In this paper, we focus on the role of line drawings inguiding visual attention. Generally, line drawings of naturalscenes provide two structure related information, i.e., the shapefeature of region (mid-level) and the layout of scene (high-

9

The size of dilatation operator (N×N)The size of dilatation operator (N×N)

The size of dilatation operator (N×N) The size of dilatation operator (N×N)

15×1510×10 20×20 25×25 30×30 15×1510×10 20×20 25×25 30×30

15×1510×10 20×20 25×25 30×30 15×1510×10 20×20 25×25 30×30

Valu

es o

n C

max

Valu

es o

n C

max

Valu

es o

n C

min

Valu

es o

n C

min

Fig. 8. The multiple metrics under dilatation operation with varying operator size of N × N pixels on the line drawings of Cmax (top row) and Cmin

(bottom row). Left: PoF and PoC; Right: PoFC and PoCC. The error bars are the 95% confidence intervals.

Fig. 9. Performance comparison between the bottom-up models and theimproved models when integrating the guided information from line drawings.

level). Therefore, line drawings can represent two types ofcontextual sources. On the one hand, line drawings segment animage into various perceptional regions (before the perceptionof specific objects) that may be used by the visual systemto rapidly construct a rough sketch of the scene structure inspace [66], and also contribute to fill-in the surface of regions[67]. At the level of region, visual object shapes usually obeysome general principles that can be considered as certaingeneral prior or knowledge. For example, Gestalt principlesdescribe some general principles of visual perception [24].As for the visual attention, the shapes of regions matchingsome Gestalt principles (e.g., closure, continuity, etc.) couldattract more visual attention because they represent meaningfulvisual objects with higher probability. Moreover, geometricalproperties of regions have been strongly proven to contributeto early global topological perception [68].

On the other hand, line drawings provide the coarse scenestructure. The visual guided search theory proposed by Wolfeet al. [1] suggests that scene global information (includingscene spatial layout) can be transferred rapidly to high visualcortexes via the so-called non-selective visual pathway. Linedrawings could give the rough layout of surfaces in thespace and provide the basis for scene-based guidance ofvisual searching. Physiological evidence shows that the globalcontours delineating the outlines of visual objects may beresponded quite early (perhaps via a special pathway) by theneurons of high cortexes, which, although producing onlya crude signal about the position and shape of the objects,

10

Input Images Human Fixations Guided IT Original IT

Fig. 10. Examples of saliency maps predicted by the original IT and the improved model (Guided IT).

can provide sufficient feedback modulation to enhance thecontour-related responses at lower levels and suppress theirrelevant background inputs [37]. In addition, coarse contouris low frequency visual information [57], which carries coarsescene information to provide enough signals to infer the scenestructure and layout [21], [61].

To summarize, visual search should be a unified frameworkthat integrates information from different sources in the brain[69]. As important guidance cues, line drawings of naturalscenes provide a rapid representation of the scene structureand regional shape. Coarse scene layout may be not enoughfor the object identifying tasks, but it can provide efficient cuesto predict where the potential target is and contribute to theactive vision system to achieve high efficient visual searchingand scene perception [20].

V. CONCLUSION

In this paper, we collected a dataset of human fixationson the natural scenes and the corresponding line drawings,which are useful for analyzing the mechanisms of visualattention at the object or shape level. Our experiments reveala high correlation between the distribution of fixations on thenatural images and that on the line drawings of the samescenes. In addition, we systemically analyzed the effects ofvarious shape-based visual cues in guiding visual attention.The results suggest that the closed regions have higher pos-sibility to form meaningful objects and may attract most of

our visual fixations. Finally, the computational model furtherverifies that the information of line drawings can be used tosignificantly improve the fixation prediction performance ofclassical bottom-up methods on natural scenes.

In conclusion, we suggest that the cortexes involved invisual attention or visual search should be decomposed intonot only various parallel feature channels (such as the ITmodel [7]), but also various hierarchical levels including low-, mid-, and high-levels. At the same time, the informationfrom various levels plays different roles in visual searching.Therefore, our future work will be extended as followingaspects: (1) to build a computational model to predict thescene layout, which will be a powerful complement to thelow-level visual features for visual object searching; (2) tofurther study how the scene structure information guides ourvisual attention; (3) to integrate the guidance informationfrom various scales and build a unified framework for visualguided searching tasks combining the low-, mid-, and high-level guided cues.

ACKNOWLEDGMENT

This work was supported by Natural Science Founda-tions of China under Grant 61703075, Sichuan ProvinceScience and Technology Support Project under Grant2017SZDZX0019, and Guangdong Province Key Project un-der Grant 2018B030338001.

11

REFERENCES

[1] J. M. Wolfe, M. L.-H. Vo, K. K. Evans, and M. R. Greene, “Visualsearch in scenes involves selective and nonselective pathways,” Trendsin Cognitive Sciences, vol. 15, no. 2, pp. 77–84, 2011.

[2] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliencydetection,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 34, no. 10, pp. 1915–1926, 2012.

[3] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 stillimage coding system: an overview,” IEEE Transactions on ConsumerElectronics, vol. 46, no. 4, pp. 1103–1127, 2000.

[4] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-upattention useful for object recognition?” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, vol. 2, 2004,pp. II–37–II–44.

[5] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-tion,” Cognitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.

[6] C. Koch and S. Ullman, “Shifts in selective visual attention: towardsthe underlying neural circuitry,” in Matters of Intelligence, 1987, pp.115–141.

[7] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at-tention for rapid scene analysis,” IEEE Transactions on Pattern Analysisand Machine Intelligence, no. 11, pp. 1254–1259, 1998.

[8] L. Itti and C. Koch, “Computational modelling of visual attention,”Nature Reviews Neuroscience, vol. 2, no. 3, pp. 194–203, 2001.

[9] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao, “Predictinghuman gaze beyond pixels,” Journal of Vision, vol. 14, no. 1, pp. 1–20,2014.

[10] C. D. Gilbert and W. Li, “Top-down influences on visual processing,”Nature Reviews Neuroscience, vol. 14, no. 5, pp. 350–363, 2013.

[11] A. L. Yarbus, “Eye movements during perception of complex objects,”in Eye Movements and Vision. Springer, 1967, pp. 171–211.

[12] V. Navalpakkam and L. Itti, “Modeling the influence of task on atten-tion,” Vision Research, vol. 45, no. 2, pp. 205–231, 2005.

[13] C. Kanan, M. H. Tong, L. Zhang, and G. W. Cottrell, “SUN: Top-downsaliency using natural statistics,” Visual Cognition, vol. 17, no. 6-7, pp.979–1003, 2009.

[14] A. Borji, D. N. Sihite, and L. Itti, “What/where to look next? modelingtop-down visual attention in complex interactive environments,” IEEETransactions on Systems, Man, and Cybernetics: Systems, vol. 44, no. 5,pp. 523–538, 2014.

[15] M. P. Eckstein, K. Koehler, L. E. Welbourne, and E. Akbas, “Humans,but not deep neural networks, often miss giant targets in scenes,” CurrentBiology, vol. 27, no. 18, pp. 2827–2832, 2017.

[16] J. M. Wolfe, “Visual attention: Size matters,” Current Biology, vol. 27,no. 18, pp. R1002–R1003, 2017.

[17] A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Henderson, “Con-textual guidance of eye movements and attention in real-world scenes:the role of global features in object search.” Psychological Review, vol.113, no. 4, pp. 766–786, 2006.

[18] J. M. Wolfe, “Guided search 2.0 a revised model of visual search,”Psychonomic Bulletin & Review, vol. 1, no. 2, pp. 202–238, 1994.

[19] K.-F. Yang, H. Li, C.-Y. Li, and Y.-J. Li, “A unified framework for salientstructure detection by contour-guided visual search.” IEEE Transactionson Image Processing, vol. 25, no. 8, pp. 3475–3488, 2016.

[20] M. P. Eckstein, “Visual search: A retrospective,” Journal of Vision,vol. 11, no. 5, pp. 1–36, 2011.

[21] M. Bar, “Visual objects in context,” Nature Reviews Neuroscience, vol. 5,no. 8, pp. 617–629, 2004.

[22] A. Nuthmann and J. M. Henderson, “Object-based attentional selectionin scene viewing.” Journal of Vision, vol. 10, no. 8, pp. 1–19, 2010.

[23] D. Walther and C. Koch, “Modeling attention to salient proto-objects.”Neural Networks the Official Journal of the International Neural Net-work Society, vol. 19, no. 9, pp. 1395–1407, 2006.

[24] K. Koffka, “Principles of gestalt psychology.” A Harbinger Book,vol. 20, no. 5, pp. 623–628, 1935.

[25] G. Kootstra, A. Nederveen, and B. De Boer, “Paying attention tosymmetry,” in Proceedings of British Machine Vision Conference. TheBritish Machine Vision Association and Society for Pattern Recognition,2008, pp. 1115–1125.

[26] G. Kootstra, N. Bergstrom, and D. Kragic, “Using symmetry to selectfixation points for segmentation,” in Proceedings of International Con-ference on Pattern Recognition. IEEE, 2010, pp. 3894–3897.

[27] J.-G. Yu, G.-S. Xia, C. Gao, and A. Samal, “A computational model forobject-based visual saliency: Spreading attention along gestalt cues,”IEEE Transactions on Multimedia, vol. 18, no. 2, pp. 273–286, 2016.

[28] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International Journalof Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[29] M. G. Ross and A. Oliva, “Estimating perception of scene layoutproperties from global image features,” Journal of Vision, vol. 10, no. 1,pp. 1–25, 2009.

[30] C.-C. Wu, F. A. Wick, and M. Pomplun, “Guidance of visual attentionby semantic information in real-world scenes,” Frontiers in Psychology,vol. 5, no. 54, pp. 1–13, 2014.

[31] S. E. Boettcher, D. Draschkow, E. Dienhart, and M. L.-H. Vo, “Anchor-ing visual search in scenes: Assessing the role of anchor objects on eyemovements during visual search,” Journal of Vision, vol. 18, no. 13, pp.1–13, 2018.

[32] M. V. Peelen and S. Kastner, “Attention in the real world: towardunderstanding its neural basis,” Trends in Cognitive Sciences, vol. 18,no. 5, pp. 242–250, 2014.

[33] T. Deng, K. Yang, Y. Li, and H. Yan, “Where does the driver look? top-down-based saliency detection in a traffic driving environment,” IEEETransactions on Intelligent Transportation Systems, vol. 17, no. 7, pp.2051–2062, 2016.

[34] T. Deng, H. Yan, and Y.-J. Li, “Learning to boost bottom-up fixation pre-diction in driving environments via random forest,” IEEE Transactionson Intelligent Transportation Systems, vol. 19, no. 9, pp. 3059–3067,2017.

[35] A. Borji, “Vanishing point detection with convolutional neural net-works,” arXiv preprint arXiv:1609.00967, 2016.

[36] J. Poort, F. Raudies, A. Wannig, V. A. Lamme, H. Neumann, and P. R.Roelfsema, “The role of attention in figure-ground segregation in areasv1 and v4 of the visual cortex.” Neuron, vol. 75, no. 1, pp. 143–156,2012.

[37] Chen, Minggui, Yan, Yin, Gong, Xiajing, Gilbert, Charles, and Liang,“Incremental integration of global contours through interplay betweenvisual cortical areas,” Neuron, vol. 82, no. 3, pp. 682–694, 2014.

[38] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detectionand hierarchical image segmentation,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011.

[39] K.-F. Yang, X. Gao, J.-R. Zhao, and Y.-J. Li, “Segmentation-basedsalient object detection,” in Proceedings of Chinese Conference onComputer Vision. Springer, 2015, pp. 94–102.

[40] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 35,no. 1, pp. 185–207, 2013.

[41] P. Dollar and C. L. Zitnick, “Structured forests for fast edge detection,”in Proceedings of the IEEE international conference on computer vision,2013, pp. 1841–1848.

[42] P. Kovesi, “MATLAB and Octave functions for computervision and image processing,” 2000, available from:<http://www.peterkovesi.com/matlabfns/>.

[43] H. Permuter, J. Francos, and I. H. Jermyn, “Gaussian mixture models oftexture and colour for image database retrieval,” in IEEE InternationalConference on Acoustics, Speech, and Signal Processing, vol. 3, no. III.IEEE, 2003, pp. 569–572.

[44] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti, “Analysis of scores,datasets, and models in visual saliency prediction,” in Proceedings of theIEEE International Conference on Computer Vision, 2013, pp. 921–928.

[45] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters:Contrast based filtering for salient region detection,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2012, pp. 733–740.

[46] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliencydetection via graph-based manifold ranking,” in Proceedings of theIEEE conference on Computer Vision and Pattern Recognition, 2013,pp. 3166–3173.

[47] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 1155–1162.

[48] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets ofsalient object segmentation,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2014, pp. 280–287.

[49] N. Bruce and J. Tsotsos, “Saliency based on information maximization,”in Proceedings of the Advances in Neural Information ProcessingSystems, 2006, pp. 155–162.

[50] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparsesalient regions,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 34, no. 1, pp. 194–201, 2012.

12

[51] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” inProceedings of the Advances in Neural Information Processing Systems,2006, pp. 545–552.

[52] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predictwhere humans look,” in Proceedings of the IEEE International Confer-ence on Computer Vision, 2009, pp. 2106–2113.

[53] M. Mermillod, N. Guyader, and A. Chauvin, “The coarse-to-fine hypoth-esis revisited: Evidence from neuro-computational modeling,” Brain andCognition, vol. 57, no. 2, pp. 151–157, 2005.

[54] M. D. Menz and R. D. Freeman, “Stereoscopic depth processing in thevisual cortex: a coarse-to-fine mechanism,” Nature Neuroscience, vol. 6,no. 1, pp. 59–65, 2003.

[55] J. Hegde, “Time course of visual perception: coarse-to-fine processingand beyond,” Progress in Neurobiology, vol. 84, no. 4, pp. 405–439,2008.

[56] A. Oliva and P. G. Schyns, “Coarse blobs or fine edges? evidencethat information diagnosticity changes the perception of complex visualstimuli,” Cognitive Psychology, vol. 34, no. 1, pp. 72–107, 1997.

[57] C. Zeng, Y. Li, and C. Li, “Center–surround interaction with adaptiveinhibition: A computational model for contour detection,” NeuroImage,vol. 55, no. 1, pp. 49–66, 2011.

[58] J. Yao, M. Boben, S. Fidler, and R. Urtasun, “Real-time coarse-to-fine topologically preserving segmentation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2015, pp.2947–2955.

[59] A. K. Engel, P. Fries, and W. Singer, “Dynamic predictions: oscillationsand synchrony in top–down processing,” Nature Reviews Neuroscience,vol. 2, no. 10, pp. 704–716, 2001.

[60] J.-M. Hopf, S. J. Luck, M. Girelli, T. Hagner, G. R. Mangun, H. Scheich,and H.-J. Heinze, “Neural sources of focused attention in visual search,”Cerebral Cortex, vol. 10, no. 12, pp. 1233–1241, 2000.

[61] M. Bar, K. S. Kassam, A. S. Ghuman, J. Boshyan, A. M. Schmid, A. M.Dale, M. S. Hamalainen, K. Marinkovic, D. L. Schacter, B. R. Rosenet al., “Top-down facilitation of visual recognition,” Proceedings of theNational Academy of Sciences, vol. 103, no. 2, pp. 449–454, 2006.

[62] E. I. Knudsen, “Fundamental components of attention,” Annual Reviewsof Neuroscience, vol. 30, pp. 57–78, 2007.

[63] J. Wagemans, “Perceptual organization,” Stevens’ Handbook of Experi-mental Psychology and Cognitive Neuroscience, vol. 2, pp. 1–70, 2018.

[64] J. E. Dickinson, K. Haley, V. K. Bowden, and D. R. Badcock, “Visualsearch reveals a critical component to shape,” Journal of Vision, vol. 18,no. 2, pp. 1–25, 2018.

[65] M. P. Eckstein, “Probabilistic computations for attention, eye move-ments, and search,” Annual Review of Vision Science, vol. 3, pp. 319–342, 2017.

[66] A. Oliva and A. Torralba, “Building the gist of a scene: The role ofglobal image features in recognition,” Progress in Brain Research, vol.155, pp. 23–36, 2006.

[67] S. Zweig, G. Zurawel, R. Shapley, and H. Slovin, “Representation ofcolor surfaces in v1: edge enhancement and unfilled holes,” Journal ofNeuroscience, vol. 35, no. 35, pp. 12 103–12 115, 2015.

[68] L. Chen, “The topological approach to perceptual organization,” VisualCognition, vol. 12, no. 4, pp. 553–637, 2005.

[69] H. H. Schutt, L. O. Rothkegel, H. A. Trukenbrod, R. Engbert, and F. A.Wichmann, “Disentangling bottom-up versus top-down and low-levelversus high-level influences on eye movements over time,” Journal ofVision, vol. 19, no. 3, pp. 1–23, 2019.

line drawings of natural scenes guide visual attention · engineering, modelling visual search...

Documents