perceptually-motivated non-photorealistic graphicsholgerweb.net › phd › research › papers ›...

NORTHWESTERN UNIVERSITY

Perceptually-motivated Non-Photorealistic Graphics

A DISSERTATION

SUBMITTED TO THE GRADUATE SCHOOL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

for the degree

DOCTOR OF PHILOSOPHY

Field of Computer Science

By

Holger Winnemoller

EVANSTON, ILLINOIS

December 2006

2

c© Copyright by Holger Winnemoller 2006

All Rights Reserved

3

ABSTRACT

Perceptually-motivated Non-Photorealistic Graphics

Holger Winnemoller

At a high level, computer graphics deals with conveying information to an observer by visual

means. Generating realistic images for this task requires considerable time and computing

resources. Human vision faces the opposite challenge: to distill knowledge of the world from

a massive influx of visual information. It is reasonable to assume that synthetic images based

on human perception and tailored for a given task can (1) decrease image synthesis costs by

obviating a physically realistic lighting simulation, and (2) increase human task performance

by omitting superfluous detail and enhancing visually important features.

This dissertation argues that the connection between non-realistic depiction and human per-

ception is a valuable tool to improve the effectiveness of computer-generated images to support

visual communication tasks, and conversely, to learn more about human perception of such im-

ages. Artists have capitalized on non-realistic imagery to great effect, and have become masters

of conveying complex and even abstract messages by visual means. The relatively new field of

non-photorealistic computer graphics attempts to harness artists’ implicit expertise by imitating

their visual styles, media, and tools, but only few works move beyond such simulations to verify

4

the effectiveness of generated images with perceptual studies, or to investigate which stylistic

elements are effective for a given visual communication task.

This dissertation demonstrates the mutual beneficence of non-realistic computer graphics

and perception with two rendering frameworks and accompanying psychophysical studies:

(1) Inspired by low-level human perception, a novel image-based abstraction framework

simplifies and enhances images to make them easier to understand and remember.

(2) A non-realistic rendering framework generates isolated visual shape cues to study hu-

man perception of fast-moving objects.

The first framework leverages perception to increase effectiveness of (non-realistic) images

for visually-driven tasks, while the second framework uses non-realistic images to learn about

task-specific perception, thus closing the loop. As instances of the bi-directional connections be-

tween perception and non-realistic imagery, the frameworks illustrate numerous benefits includ-

ing effectiveness (e.g. better recognition of abstractions versus photographs), high performance

(e.g. real-time image abstraction), and relevance (e.g. shape perception in non-impoverished

conditions).

5

Dedication

To my parents, for their unconditional love and support.

6

Acknowledgements

There are many people who can be blamed, to various degrees, for helping me get away

with a PhD:

My parents endowed me with a working brain, and always made sure that I use it to its full

potential. My sister, Martina, kept telling me to believe in myself, and I am starting to listen

to her. Angela has been my confidante and friend in good times and when things were rough,

which they were a bit.

Bruce Gooch, my advisor at Northwestern University, believed in my ideas, gave me the

freedom to pursue my goals, supported me generously wherever he could, and has been the

mentor that I had wanted for a long time. He also managed to give the graphics group a sense

of family and belonging. Without Jack Tumblin, I would never have come to Northwestern to

begin with. He made the first contact, invited me to come to Evanston as a scholar, and has been

supportive and interested even when I decided to work with Bruce. Talking to Jack reminds you

that there is always more to learn in life. Amy Gooch was one of my NPR contacts when I

was at Cape Town university, looking for a place to finish my PhD. She was helpful then, and

has helped me with papers, gruesome corrections, and good advice ever since. Bryan Pardo

graciously agreed to be on my PhD committee, and gave me much of his time and many helpful

suggestions for the dissertation. James Gain kindly offered to be my co-advisor at UCT when

all else failed. I am sure he would have done a fine job.

7

Ankit Mohan and Pin Ren have been my Evanston friends from the first day they welcomed

me in their office. Since that day, we’ve had many interesting, silly, and funny times together.

I’ll always remember our 48 hour retreat around Lake Michigan. Sven Olsen has been my

conspirator for the Videoabstraction project and a great companion during those long office

hours when everybody else was already sleeping. David Feng joined the team only later, but

quickly became an integral part of the crew. I miss having a worthy squash opponent. Marc

Nienhaus joined the graphics group as a post-doc and left three months later as a cool roommate

and good friend. I will also miss the rest of the graphics lab, Tom Lechner, Sangwon Lee,

Yolanda Rankin, Vidya Setlur, and Conrad Albrecht-Buehler, but I am sure that our paths will

cross in the times to come.

The rest of my family, especially my brother Ronald, and my friends in Germany and South

Africa have been a constant source of inspiration in my life. They have achieved so much, made

me so proud, and given me good reasons not to give up whenever times were tough. You know

who you are.

I would like to thank the many volunteers for my experiments, who were always patient,

courteous, and interested. I also owe thanks to Rosalee Wolfe and Karen Alkoby for the deaf

signing video; Douglas DeCarlo and Anthony Santella for proof-reading and supplying data-

driven abstractions and eye-tracking data for the Videoabstraction project; as well as Marcy

Morris and James Bass for acquiring image permission from Cameron Diaz.

8

Table of Contents

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1. Realistic versus Non-realistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2. The Art of Perception and the Perception of Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Chapter 2. General Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1. Simple Error Metrics (Non-Perceptual) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2. Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3. Visible Difference Predictors (VDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4. Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Chapter 3. Real-Time Video Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9

3.2. Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.5. Framework Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Chapter 4. An Experiment to Study Shape-from-X of Moving Objects . . . . . . . . . . . . . . . . . 102

4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2. Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.5. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.7. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Chapter 5. General Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.1. Vision Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.1. Conclusion drawn from Real-time Video Abstraction Chapter . . . . . . . . . . . . . . . . . . 173

6.2. Conclusions drawn from Shape-from-X Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10

Appendix A. User-data for Videoabstraction Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Appendix B. User-data for Shape-from-X Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Appendix C. Links for Selected Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

11

List of Tables

2.1 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Within-“Aggregate Measure” Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

4.2 Significance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

A.1 Data for Videoabstraction Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

A.2 Data for Videoabstraction Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

B.1 Shading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

B.2 Outline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

B.3 Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

B.4 TexISO Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

B.5 TexNOI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

B.6 Questionnaire Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

C.1 Internet references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

12

List of Figures

1.1 Photorealistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2 Realism vs. Non-Realism - Subway System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Perceptual Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4 Realism in Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Simple Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Abstraction Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Explicit Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Scale-Space Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 Linear vs. Non-linear Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.6 Diffusion Conduction Functions and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Progressive Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8 Data-driven Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.9 Painted Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.10 Separable Bilateral Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.11 Center-Surround Cell Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

13

3.12 DoG Edge Detection and Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.13 DoG Parameter Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.14 Edge Cleanup Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.15 DoG vs. Canny Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.16 IWB Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.17 Computing Warp Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.18 Luminance Quantization Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.19 Sample Images for Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.20 Sample Images from Study 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.21 Participant-data for Video Abstraction Experiments . . . . . . . . . . . . . . . . . . . . . 82

3.22 Failure Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.23 Benefits for Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.24 Automatic Indication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.25 Motion Blur Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.26 Motion Blur Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.1 Shape-from-X Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2 Left: Depth Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3 Right: Tilt & Slant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.4 Real-time Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

14

4.6 Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.7 The First Version of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.8 Constructing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.9 Shape Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.10 Experiment Object Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.11 Mistaken Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.12 Aggregate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4.13 Detailed Aggregate Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4.14 Detailed Aggregate Measures Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.1 Lifecycle of a Synthetic Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2 Flicker Color Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5.3 Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.4 Originals for Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.5 Anomalous Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.6 Deleted Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

B.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

15

CHAPTER 1

Introduction

This dissertation presents two rendering frameworks and validatory studies to demonstrate,

by example, the intimate connection between non-photorealistic (NPR) graphics and percep-

tion, and to show how the two research areas can form an effective and natural symbiosis.

While the notion of such a connection is not novel in itself, it is also not commonly lever-

aged, particularly within the NPR community. It is my hope that future researchers will adopt

the frameworks, methodologies, and experiments documented in this dissertation to the mutual

benefit of both communities.

To explain the origins of this connection and its significance, it is instructive to discuss

non-photorealism and then list the commonalities of non-photorealism and perception. Seeing

as non-photorealism is defined per exclusion (being not realistic) instead of explicit goals, it

seems appropriate to look briefly at the historical contrast between realistic and non-realistic

graphics.

1.1. Realistic versus Non-realistic Graphics

Traditionally, the ultimate goal of computer graphics has been photo-realism; to generate

synthetic images that are indistinguishable from photographs [166, 63, 85, 28]. Today, this goal

has arguably been achieved. Given enough time and resources, synthetic renderers can generate

imagery that is indistinguishable from photographic images to the naked eye (Figure 1.1), and

models exist that simulate optical processes down to the level of individual photons [37]. While

16

Figure 1.1. Photorealistic Graphics. This image shows a state-of-the-art ren-dering of a synthetic scene (using POV-Ray 3.6). Notable realistic optical ef-fects include: reflection, refraction, global illumination, depth-of-field, and lensdistortion.— By Gilles Tran, Public Domain. See Table C.1 for URLs to selected images.

this success does not foreclose further research to advance the number and types of optical

phenomena that can be modeled, or to improve efficiency, there are an increasing number of

researchers that question realism as the only viable goal for computer graphics. The question

these scientists ask is: What are the images we create used for?1

1.1.1. Depiction Purpose

“Because pictures always have a purpose, producing a picture is essentially an optimization pro-

cess. Depiction consists in trying to make the picture that best satisfies the goals.” ([40], pg. 116,

1Pat Hanrahan, in his Eurographics 2005 keynote address, saw slideshow presentations at conferences as one ofthe main uses.

17

(a) Photographic (b) Schematic

Figure 1.2. Realism vs. Non-Realism - Subway System. (a) An aerial pho-tograph of London. This image is ill-suited to show the underground subwaysystem covering the photographed area. (b) A schematic (non-realistic) map ofpart of the London subway system with a variety of abstractions/simplifications:All streets, buildings, and parks are omitted. Train-paths are drawn color-codedand so that angles are multiples of 45. Train stations are symbolized by circles,indicating connections through connected circles. Other symbols list additionalservices offered at a given station.— (a) by Andrew Bossi, GNU Free Documentation License.(b) after maps of Transport of London.

original emphasis). If this purpose is simulation of physical interaction between light and matter

(for research, realistic conceptualization, or entertainment) [162, 37, 165], then photo-realism

is a logically sound choice. If, on the other hand, the purpose is more general or abstract (to

convey an idea, to give directions, to explain a situation, to give an example) then photo-realism

may confuse the issue at hand through unnecessary specificity, visual clutter (masking), and

physical limitations. For example, the spatial layout map of a subway system does not include

every bend and corner (specificity) because only the stations and their relative positions are

of interest to the viewer (Figure 1.2). The map does not include all the buildings and streets

where the subway runs (visual clutter) because this would make it difficult to see the subway

paths. Lastly, the map could not have been captured in a single photograph (physical limitation)

because most parts of the subway system are underground and mutually hidden.

18

1.1.2. Realistic Non-realism

Images generated with a specific purpose in mind could thus be called artistic, symbolic, stylis-

tic, comprehensive, instructive, expressive, or communicative but, unfortunately, the rather

unimaginative term non-photorealism has become established. Perhaps because of this lack

of a purpose-statement, much of the research in non-photorealism has focussed, again, on re-

alism instead. This dissertation uses a similar classification to Gooch and Gooch [61] who

identify three main areas of NPR research: (1) Natural media simulation; (2) Artistic Tools; and

(3) Artistic style simulation.

Natural media simulation. Natural media simulation concerns itself with simulating

(realistically) the substance that is applied to an image (e.g. oil, acrylic, coal), the instruments

with which the substance is applied (e.g. brush, pencil, crayon), and the substrate to which the

substance is applied (e.g. canvas, paper) [61]. In all cases, the simulated media is intended to

produce surface marks that are indistinguishable from the real media [131, 170, 143, 32].

Artistic Tools. Simulated media itself is of little practical use if it is not controlled by

some entity. Assisting users in creating images is therefore a worthwhile endeavor. Commercial

products, such as Photoshop or CorelDraw provide a rich set of tools and functionality by re-

purposing standard input devices (mouse, keyboard, digital tablet). Other software and research

work assist users with technically challenging, tedious, or repetitive tasks [152, 190, 120, 72,

24], but ultimately the user still has to create the image and therefore make all decisions about

layout, design, placement, etc.

19

Artistic Styles. The last category of NPR research takes inspiration from existing artistic

styles and attempts to automatically transform some data (usually geometric models or pho-

tographs) into images in a given artistic style. Examples of this work include the creation of

line drawings from three-dimensional models [170, 38, 86, 174], light-models for cartoon-like

shading [99, 25, 84, 173], and painterly systems from geometric models [110], videos [73, 68]

or photographs [176].

It should be noted that none of these systems presumes to create Art, they merely gener-

ate images that (as realistically as possible) resemble a particular artistic style. In short, much

of NPR research is still devoted to realistic picture (as opposed to photo) creation. There ex-

ist several noteworthy exceptions to this trend, which serve as inspiration and which I discuss

throughout this dissertation. Saito and Takahashi [142] increased the comprehensibility of im-

age using G-Buffers. Gooch and Willemsen used a non-realistic virtual environment to study

virtual distance perception [60]. DeCarlo and Santella created meaningful abstractions guided

by eye-tracking data [36, 144]. Gooch et al. showed that illustrations of faces were more ef-

fective than photographs for facial recognition [62]. Raskar et al. visually enhanced images

with multi-flash hardware [138]. DeCarlo et al. facilitated shape perception with suggestive

contours [35]. It is clear from these citations that currently only a fairly small number of re-

searchers are addressing NPR issues beyond realistic simulation of artistic media and styles.

1.1.3. Stylistic Effects and Effective Styles

So what is wrong with reproducing an artistic style? After all, artists have been very successful

at communicating abstract ideas, expressing emotions, triggering curiosity, and entertaining.

The answer is that there might be nothing wrong at all, but we cannot be sure. As Durand put

20

it, “[the] availability of this new variety of styles raises the important question of the choice

of an appropriate style, especially when clarity is paramount” ([40], pg. 120). Santella and

DeCarlo [144] argued that many NPR systems abstract imagery, but often without meaning

or purpose, other than stylistic. In Santella and DeCarlo’s experiment they found that lack of

shading and uniformly varying the level of detail in pure line drawings had no effect on the

number of viewer’s fixation points, whereas targeted control of detail “[. . . ] affected viewers in

a way that supports an interpretation of enhanced understanding.” ([144], pg.77).

Many authors of NPR systems motivate their work with the expressiveness and commu-

nicative benefits of stylistic imagery, but few go on to prove that transferring visual aspects

of a given style to their synthesized imagery satisfies these higher perceptional or cognitive

goals. Admittedly, some NPR systems do not stand to gain much from perceptual validation,

particularly artistic tools and natural media simulations. Such systems are better served with

the extreme programming method and physical validation, respectively. Most other NPR sys-

tems and the NPR community at large, however, stand to benefit from perceptual evaluation and

validation. The measures currently employed to compare different NPR algorithms and imple-

mentations are mainly taken from realistic graphics performance measures, chiefly frame-rates

dependent on geometric complexity or screen resolution. While such measures are suitable to

demonstrate performance enhancements for algorithmic improvements, they are ill suited to

objectively answer questions like: “Does this system capture the essence of an artistic style?”;

“Does this system help a user in detecting certain features of an image faster?”; or “How do

we know which style to choose to support a given perceptual task?”

Several authors, including myself, believe that the answers to these questions lie in percep-

tion. Seeing as most visual art is conceived through visual experience and expressed through an

21

(a) Size Constancy (b) Shape Constancy

Figure 1.3. Perceptual Constancy. (a) Objects at a distance or reflections in amirror produce greatly reduced retinal images, yet they are perceived as beingof a normal size (e.g. a train at a distance does not appear to be a toy-train; itappears as a normal-sized train at a distance). (b) Although the two views of thebunny generate radically different retinal images, they are nonetheless perceivedas depicting the same bunny. Note that shape constancy does not require an ob-server to have seen any particular view previously.— Bunny model courtesy of StanfordUniversity Computer Graphics Laboratory.

artistic process that heavily relies on feedback from the human visual system (HVS), it is likely

that (1) Perception is a large influence in the creation of Art; and (2) the analysis of artistic

principles may lead to insights on human perception. The following Section discusses these

connections between perception and art (and by extension NPR).

1.2. The Art of Perception and the Perception of Art

The neurobiologist Semir Zeki claims that “[. . . ] the overall function of art is an extension

of the function of the brain” ([189], pg. 76). More specifically, Zeki defines the function of the

(visual) brain as the “[. . . ] search for constancies with the aim of obtaining knowledge about

the world” (pg. 79). Similarly, he defines the general function of art as a “[. . . ] search for the

constant, lasting, essential, and enduring features of objects, surfaces, situations, and so on.”

(pg. 79).

22

1.2.1. Constancy

Why is constancy so important? In the words of Durand, “The notion of invariants and con-

stancy are crucial in studying vision and the complex dualism of pictures. Invariants are intrinsic

properties of scenes or objects, such as reflectance, as opposed to accidental extrinsic properties

such as outgoing light that vary with, e.g., the lighting condition or the viewpoint. Constancy is

the ability to discount the accidental conditions and to extract invariants.” ([40], pg. 113, origi-

nal emphasis). There are many examples of perceptual constancy (Figure 1.3): color constancy

allows us to see a green apple as green, regardless of whether we encounter it during an orange

sunset or in a fluorescently-lit room. Size constancy allows us to subjectively perceive our own

reflection in a mirror as normal-sized, although the dimensions of the reflection are objectively

halved. Shape constancy permits objects to be recognized from a variety of viewpoints, even

novel ones that have not been experienced before.

Not surprisingly, the notion of intrinsic and extrinsic properties has had a profound impact

on the evolution of art. For example, the Dutch Golden Age of the 17th century focussed on

high detail and realism, whereas many of the modern artistic styles, like cubism, pointillism,

fauvism, and expressionism focussed instead on cognitive and perceptual aspects of depiction.

The difference between the realistic and expressionistic art forms (Figure 1.4) “[. . . ] can also

be stated in terms of depicting ’what I see’ (extrinsic) as opposed to depicting ’what I know’

(intrinsic).” ([40], pg. 113).

1.2.2. Goals of Art and Vision

Focussing again on vision, Gregory believes that, “[. . . ] perception involves going beyond

the immediately given evidence of the senses: this evidence is assessed on many grounds and

23

(a) Photorealistic Painting (b) Expressionistic Painting

Figure 1.4. Realism in Art. Two approaches to engage a viewer. (a) This paint-ing, called Escaping Criticism (1874), by Pere Borrell de Caso is an example ofa trompe l’oeil, a work of art that is so realistic that it tricks the observer intobeliving that the depicted scene exists in reality. (b) This Portrait of Dr. Gachet(1890) by Vincent van Gogh shortly before his suicide employs various stylis-tic elements like visible brush-strokes, contrasting colors, and symbolism (thefoxglove was used for medical cures and thus attributes Gachet)— Both images inpublic domain.

generally we make the best bet, and see things more or less correctly. But the senses do not give

us a picture of the world directly; rather they provide evidence for the checking of hypothesis

about what lies before us. Indeed, we may say that the perception of an object is an hypothesis,

suggested and tested by sensory data” ([65], p. 13). The process of seeing is therefore not just

a passive absorption of electromagnetic radiation, but an active, highly complex, and parallel

search2 to gain knowledge from our visual surroundings. It appears then, that many of the

goals of art and perception are similar - “[. . . ] the brain must discount much of the information

2Given the complexity of the vision process, Hoffman refers to the mechanisms which allow for our effortlessvisual experience as Visual Intelligence [75]. Biological evolutionists have even offered that much of the humanbrain’s cognitive and intellectual capabilities owe to the great computational demands of vision [117, 34].

24

reaching it, select only what is necessary in order to obtain knowledge about the visual world,

and compare the selected information with its stored record of all that it has seen.” ([189],

pg. 78). An “[. . . ] artist must also be selective and invest his work with attributes that are

essential, discarding much that is superfluous. It follows that one of the functions of art is an

extension of the major function of the visual brain.” ([189], pg. 79).

Given this goal agreement, it is not farfetched to assume that artistic images (e.g. pictures,

paintings) that are designed appropriately can greatly assist the brain in performing its difficult

task.

1.2.3. Perceptual Art(ists)

Some authors go as far as claiming that many artistic styles are based upon the collective per-

ceptual insight of generations of artists (e.g. [189, 135]). Zeki (himself a leading neurologist)

writes, “artists are neurologists, studying the brain with techniques that are unique to them and

reaching interesting but unspecified conclusions about the organization of the brain. Or, rather,

that they are exploiting the characteristics of the parallel processing-perceptual systems of the

brain to create their works, sometimes even restricting themselves largely or wholly to one sys-

tem, as in kinetic art.” ([189], pg. 80). Specifically, Zeki and Lamb found that various types

of late kinetic art are ideal stimuli for the motion sensitive cells in area V5 of the visual cor-

tex [185]. In another experiment, Zeki and Marini [186] showed that fauvist paintings, which

often divorce shapes from their naturally assumed colors, excite quite distinct neurological path-

ways from representational art where objects appear in normal color. Gooch et al. demonstrated

25

that caricatured line drawings of unknown faces are learned up to two times faster than the cor-

responding photographs [62]. Ryan and Schwartz reported similar findings for drawings and

cartoons of objects [141].

Zeki refers to Art that is designed to specifically stimulate particular types of cortical cells

(intentionally or not) as art of the receptive field. “The receptive field is one of the most im-

portant concepts to emerge from sensory physiology in the past fifty years. It refers to the part

of the body (in the case of the visual system, the part of the retina or its projection into the

visual field) that, when stimulated, results in a reaction from the cell, specifically, an increase

or decrease in its resting electrical discharge rate. To be able to activate a cell in the visual

brain, one must not only stimulate in the correct place (i.e., stimulate the receptive field) but

also stimulate the receptive field with the correct visual stimulus, because cells in the visual

brain are remarkably fussy about the kind of visual stimulus to which they will respond. The art

of the receptive field may thus be defined as that art whose characteristic components resemble

the characteristics of the receptive fields of cells in the visual brain and which can therefore be

used to activate such cells.” ([189], pp. 88).

1.2.4. Benefits of combining NPR and Perception

One principled method3 of unlocking the perceptual potential of art for the purpose of creating

task-oriented computer-generated imagery, then, is to study and leverage the different visual

areas of the brain, or more precisely, the cells comprising these areas and the stimuli to which

these cells are responsive. The benefit of designing imagery based on perceptual principles

3This is not to say that artistic development is unprincipled, but rather that less quantifiable factors, like experience,intuition, and aesthetic sense, play a more marked role than is commonly regarded as scientific (of course, it is oftenexactly these qualities that lead to the most exciting and groundbreaking scientific discoveries).

26

instead of physical/optical laws is that we can focus on creating and supporting the visual stimuli

pertinent for a given perceptual task and eliminate unnecessary detail.

The reverse approach is similarly advantageous (compared to fully realistic imagery) - by

generating non-realistic images that purposefully only trigger certain visual areas we can study

how the generated visual stimuli influence task-specific perception in isolation.

These are the two approaches exemplified by the NPR rendering frameworks and perceptual

studies in this dissertation.

1.3. Contributions

This dissertation presents two frameworks and accompanying studies that demonstrate the

important link between non-realistic graphics and perception research. Each framework uses

fundamental concepts of one research area to inform the other.

1.3.1. Perception informing Graphics

Chapter 3 presents a real-time NPR image processing framework to convert images or video

into abstracted representations of the input data. The framework is designed to operate on gen-

eral natural scenes and produces abstractions that can improve the communication content of

the resulting imagery. Specifically, participants in two user studies are able to recognize/identify

objects and faces quicker than in the source photographs. The framework achieves meaningful

abstraction by implementing a simple model of low-level human vision. This model estimates

regional perceptual importance within images and removes superfluous detail (simplification)

while at the same time supporting perception of important regions by increasing local con-

trast (enhancement) and thus catering specifically to edge-sensitive cortical cells. Compared

27

to other automatic abstraction systems, the framework presented here offers superior temporal

coherence, does not rely on an explicit image structure representation, and can be efficiently

implemented on modern parallel graphics hardware.

1.3.2. Graphics informing Perception

The human visual system derives shape from a multitude of shape cues. Chapter 4 presents a

novel experiment to study shape perception of dynamically moving objects. The experimental

framework generates NPR display conditions that specifically target individual shape perception

mechanisms. By comparing user performance for a highly dynamic, interactive task under

each of the display conditions, the experiment establishes a relative effectiveness ordering for

the given shape cues. Data collected during experimentation indicates that shape perception

in a severely time-constrained condition may behave differently from static shape perception

and that a shape cue prioritization may occur in the former condition. The sensitivity of the

experimental design and its flexibility enable a large number of future investigations into the

effects of isolated shape cues and their parameterizations. Such research, in turn, should help

in the design of better graphics and visualization systems.

1.3.3. Evaluation for NPR systems

Several reasons exist why psychophysical evaluation and validation experiments are not per-

formed more commonly for NPR systems designed to increase the communication potential or

expressiveness of images. Experiments are difficult to devise, time consuming to perform, and

require careful analysis. These issues could be somewhat mitigated by establishing a corpus

of experiments for NPR validation, along with a database of test imagery. This dissertation

28

contributes to such a corpus by defining clear perceptual goals for the stylization frameworks

presented, along with psychophysical experiments to test the effectiveness of achieving these

goals. It is my hope that the presented frameworks will provide a foundation for future NPR

work, and similarly, that the given validatory experiments will be used to evaluate and compare

future NPR systems.

29

CHAPTER 2

General Related Work

Various existing works have used mathematical and perceptual models and metrics to guide

approximation algorithms, to control data compression, and to determine data similarity. Al-

though many metrics designed for photorealistic imagery do not directly apply to NPR imagery,

they are nonetheless illustrative of the different approaches to compression, comparison, and

analysis that realistic and non-realistic imagery require. I therefore discuss related photorealis-

tic works in this chapter and defer the discussion of non-photorealistic works to the individual

frameworks in Chapter 3 and Chapter 4.

Most research into perception for photorealistic graphics1 centers around perceptual models

and metrics. In the context of this dissertation a perceptual model is an algorithm that simulates

a particular aspect of human visual perception (for example saliency or contrast sensitivity),

whereas a perceptual metric may use a given model to quantify the perceived differences be-

tween two stimuli or the probability that artifacts (e.g. as a result of compression) in a stimulus

may be detected.

2.1. Simple Error Metrics (Non-Perceptual)

A number of commonly used metrics, particularly in the compression and signal processing

communities, are mathematical in nature and not derived from perceptual models. Among

1It is interesting to note that many photorealistic applications employ perceptual metrics to degrade imagery upto the point where such degradation becomes perceptible or even objectionable. Their ultimate goal thereforeshifts from physical realism to perceived realism; a goal much more in line with other perceptually-guided butintentionally non-realistic graphics.

30

these are the relative error (RE), the mean-squared error (MSE), and the peak signal-to-noise

ratio (PSNR). Given two grayscale images2, A and B, with J pixels in the horizontal direction

(width) and I pixels in the vertical direction (height), the measures are defined as:

RE(A,B) =

I−1∑i=0

J−1∑j=0

(Ai,j −Bi,j)2

I−1∑i=0

J−1∑j=0

(Ai,j)2

,(2.1)

MSE(A,B) =

I−1∑i=0

J−1∑j=0

(Ai,j −Bi,j)

I · J,(2.2)

PSNR(A,B, m) = 10 · log10

(m2

MSE(A,B)

).(2.3)

While Equation 2.2 yields an absolute value depending on the range of A and B, Equa-

tion 2.1 and Equation 2.3 give a relative error value. In the case of PSNR, the result is based on

a maximum possible value, m, for each pixel3, and expressed in decibels (dB).

Figure 2.1 puts the use of these error metrics for quantifying image quality or image fidelity

into perspective. I generated these images by creating an abstraction (see Chapter 3) of an

original image, computing the error values between original and abstraction, and generating two

other types of common image distortion (noise and blur) with similar error values (Table 2.1).

Comparing the images in Figure 2.1 should make it clear that the perceived quality of each

image, and the perceived fidelity to the original image differs greatly between the types of

distortions, despite the fact that their RE, MSE, and PSNR scores are nearly identical.

2This discussion uses scalar-valued images for simplicity, but applies equally to color images.3For a common grayscale image, m = 28 − 1 = 255.

31

Figure 2.1. Simple Error Metrics. An original image and three variations withthe same level of errors. Noise: The original image with salt-and-pepper noiseadded. Blur: The original image with a Gaussian filter applied. Abstract: Theoriginal image processed with the real-time abstraction framework discussed inChapter 3.

Metric Noise Blur Abstract PolarityRE 0.0576 0.0570 0.0578 ↓

MSE 1.004× 103 0.991× 103 1.006× 103 ↓PSNR 18.106 18.169 18.113 ↑

PNG (78.4%) 94.7% 31.3% 40.5% n.a.PDIFF4 1070 2917 3144 ↓

HDR-VDP5 42.53% 99.71% 54.72% ↓

Table 2.1. Comparison Metrics. This table lists numeric values for a numberof error and comparison metrics applied to the images in Figure 2.1. Polaritysymbolizes whether a low numeric value indicates a small error (↓) or a largeerror (↑).

Another method of comparing images is to look at their information content (entropy). Con-

sidering that humans have to extract information from images in order to understand them, this

seems like a sensible approach.

4Number of pixels perceived to be different from original. Settings: gamma = 2.2, luminance = 100 lux, fov = 6.5Percentage of pixels with p > 95% chance of being perceived as different. Same settings as PDIFF.

32

Table 2.1, row 4, lists file-size ratios for the lossless PNG compression6 compared to an

uncompressed image. The compression ratio of the original image is given in the Metric col-

umn. When examining the other columns we can see that adding noise to the image increases

the entropy of the original image, while blurring (averaging) reduces the entropy, as expected.

The problem here is that my generic use of the word information (or entropy) does not deter-

mine how useful this information might be for visual communication purposes. The addition

of random information (noise), uncorrelated to the content of the image, does not enhance the

image. Conversely, I demonstrate in Chapter 3 that targeted removal of information (unlike the

uniform blur in Figure 2.1) can actually help perceptual tasks based on image understanding.

From Section 1.2.1, we know that much of visual perception is concerned with removing ex-

trinsic information while distilling intrinsic information, so it is not information in itself that is

important but the type of information plays a deciding role. Simple metrics are not designed to

make such distinctions.

2.2. Saliency

Simple, mathematical metrics commonly fail for perceptual applications because, when it

comes to the human visual system, not all pixels are created equal7. The location, neighborhood,

and semantic meaning of (a group of) pixels are generally more important than their exact color.

Humans can only focus in a very narrow foveal region8, so pixels in this region have more

impact on the perceived image. Additionally, color discrimination in this region is fairly good

6ISO standard, ISO/IEC 15948:2003 (E).7Besides the fact that humans do not operate on pixels per se, anyway.8The fovea spans about 15 visual angle.

33

but motion detection is better outside the foveal region. Pixels can further be masked9 by texture

or noise [46].

As a rule, some image regions are visually more important (have a higher saliency) than

others. Given their narrow foveal extent, humans have to continually scan their visual field

with head movements and quick saccadic eye movements. For visual efficiency and to preserve

energy, these movements are mostly directed towards salient regions in the visual field. Saliency

is therefore an important tool to model and predict perceptual attention10.

Itti et al. [79, 78] computed explicit contrast measures for brightness, color opponency,

and orientation (via Gabor filters) at multiple spatial scales. They then averaged the individual

contrasts over all scales onto an arbitrary common scale. Finally, they normalized and averaged

all contrasts to obtain a combined saliency map. From this, they predicted the sequence and

durations of eye fixations using local maxima and a capacitance-based model of inhibition-of-

return11.

Privitera and Stark [130] analyzed the effectiveness of 10, partly perceptually-inspired, im-

age processing operators (including Gabor, discrete wavelet transform, and Laplacian of Gauss-

ian) to predict human eye fixations. They computed the 10 operators for a test image and

clustered local maxima until they reached a predetermined number of clusters. By comparing

the remaining clusters with actual fixation locations obtained from human subjects, they deter-

mined the reliability of each operator to predict fixation points. Privitera and Stark’s approach

9For example, a green leaf on a red blanket is perceived very prominently, whereas the same leaf would probablynot be noticed in a pile of other leaves. The pile of leaves thus masks the single leaf.10Santella and DeCarlo [36, 144] exploit this fact by using eye-tracking data to guide their NPR abstraction system.11Preventing visiting the same maxima in short succession.

34

was novel in that they did not assert a priori which image processing operator would model hu-

man attention accurately. Rather, they assembled a number of suitable operators and evaluated

them empirically.

Because image distortions in non-salient regions commonly remain unnoticed, saliency

forms a central component in many perceptual error metrics (Section 2.3) as well as optimiza-

tion and compression algorithms (Section 2.4).

2.3. Visible Difference Predictors (VDPs)

To address the shortcomings of simple error metrics, researchers have designed several

perceptually-based difference predictors that take into account a limited number of low level

human vision mechanisms, including saliency. As the name suggests, a VDP metric predicts if

a human could tell two images apart, or how different a human would judge two images to be.

Daly’s [33] VDP modeled three aspects of human vision: non-linear brightness percep-

tion, the contrast sensitivity function (CSF), and masking [46] due to texture and other noise.

Mantiuk et al. [106] modified Daly’s VDP for use with high-dynamic range (HDR) imagery.

Yee et al. [180] defined a predictive error map, ℵ, that considered intensity, color, orien-

tation, and motion at different spatial scales to estimate visual saliency and to determine the

perceived visual differences in salient regions.

The PDIFF and HDR-VDP entries in Table 2.1 list the pairwise difference scores between

the original image and the distorted images in Figure 2.1 for Yee et al.’s [180] (PDIFF) and Man-

tiuk et al.’s [106] (HDR-VDP) metrics. For the comparisons, I chose environmental conditions

similar to the user-studies in Section 3.4. The PDIFF scores for Noise and Blur appropriately

indicate the aggressive distortion of the blur operation. Note though, that the abstracted image,

35

itself derived using a model of human perception, attains the worst score of all. Although HDR-

VDP still prefers the Noise image to the Abstract image, the metric at least performs better at

judging the excessive visual loss in the Blur image.

The problem lies not in the abstracted image and not even necessarily in the VDPs but in

my use of the VDPs. The above VDPs predict perceivable differences between images, they

do not predict the perceived likeness of images. Many forms of art are exceptionally good

likenesses of a scene, despite the fact that their visual appearance is markedly different from the

real world. For this reason, standard VDPs and other perceptual metrics devised for realistic

scenes generally fare poorly on NPR imagery.

To the best of my knowledge, no NPR image quality or fidelity metrics exist to-date and

I believe this to be an excellent opportunity for future research. As a starting point it might

be interesting to leverage the null-operator qualities12 of some NPR systems to transform both

images to be compared into the same domain and then compute a simple error score.

2.4. Applications

While the above models and metrics can be used directly to compare images, for example

in database searches, they are more commonly integrated into applications to control image

distortion. The two application areas I focus on, lossy compression (Section 2.4.1) and adaptive

rendering (Section 2.4.2), are both very active research areas in their own right. Because this

section only addresses peripherally related work and because this work is too vast to present

comprehensively in this space, I limit my discussion to exemplary applications, instead.

Two points are worth remembering throughout the following discussion. First, all of the

listed works are perceptually motivated, yet only the smallest number of them perform user12For example, abstracting an already abstracted image in Chapter 3 changes almost nothing.

36

studies for perceptual validation. Second, even the most sophisticated perceptual models are

mostly used only to hide artifacts and to degrade images without an objectionable loss in vi-

sual quality, they are generally not designed to make an image easier or quicker to understand.

Although counter-examples exist, particularly for contrast reduction and tone-mapping work,

these shortcomings prevent us from harnessing the full potential of perceptual models for graph-

ical applications.

2.4.1. Lossy Compression

To obtain very high compression ratios, lossy compression methods sacrifice some information,

that is, the signal recovered from a compressed stream is commonly not identical to the orig-

inal signal. To ensure that this information loss remains below a perceivable threshold, or at

least does not become objectionable, perceptual models and metrics can guide the compression

process.

In the past, many researchers have developed lossy compression methods for a number of

different signal types and signal dimensions. The most common types are images, video, and

geometric meshes, while the most common dimensions are spatial (domain), temporal, and

dynamic range.

Images. Reid et al. [139], gave an overview of so-called second-generation (2G) coding

techniques, i.e. lossy image compression systems that incorporate a simple HVS model. They

concluded that most existing 2G systems outperform first-generation systems, that the 2G sys-

tems are of similar complexity, and that an objective quality comparison is impossible until a

quantitative quality metric is adopted.

37

In similar work, Kambhatla et al. [87] compared several image compression schemes, in-

cluding mixture of principal components (MPC), wavelets, and Karhunen Loeve transform

(KLT; also known as principal component analysis, PCA). They found that while PSNR for

wavelet transform and KLT are higher than MPC, the MPC method produced less subjective

errors as judged by (a13) radiologist(s) analyzing brain magnetic resonance images (MRI).

Video. Bordes and Philippe [15] proposed perceptual enhancements to the MPEG-2

compression standard14. They developed a quality map based on a pyramid decomposition of

spatial frequencies together with a multi-resolution motion representation. This quality map

was then used in a pre-process to remove non-visible15 information to limit the amount of data

to be encoded. The second use of the quality map was to locally adapt the encoding quantization

for constant bitrate encoding.

Meshes. Williams et al. [167] developed a view-dependent mesh simplification algorithm

sensitive to an object’s silhouette, its texture, as well as the dynamic scene illumination. The

authors weighed the cost-benefit trade-off between these factors in terms of distortion effects

and rendering costs, and allocated run-time resources accordingly.

Watson et al. [163] applied two different mesh simplification schemes, VClust and QSlim,

to 36 polygonal models of animals and manmade artifacts. They then compared results of a se-

ries of user-studies including naming times, ratings, and preferences, to the results of numerous

automatic measures computed in object and image space. They found that ratings and prefer-

ences were predicted adequately with automatic measures, while naming times were not. The

13The authors gave no details on their subjective evaluation.14This codec, most commonly used for high-quality DVD video-encoding, is part of the larger MPEG compressionand coding family. More information is available at http://www.mpeg.org.15The authors did not define this term clearly. I assume they referred to quality-loss below a certain threshold.

http://www.mpeg.org

38

authors also found significant effects between the two object types, indicating that mesh simpli-

fication systems may need to consider a broader range of information than mere geometry and

connectivity.

Dynamic Range. To address the problem of displaying high dynamic range images on

low dynamic range displays, Tumblin et al. [157] developed two contrast reduction methods.

The first method, practical only for synthetic images, computed separate image channels for

lighting and surface information. By compressing only the lighting channels the authors were

able to reduce the overall contrast of images while preserving much of the surface information.

The second, generally applicable method allowed users to manually specify foveal fixation

locations. The algorithm then adjusted global contrast based on foveal contrast adaptation while

attempting to preserve local contrast in the fixation regions.

Tumblin and Turk [158] took inspiration from artists’ approach to high dynamic range re-

production in developing their low curvature image simplifier (LCIS). They argued that skilled

artists preserve details by drawing scene contents in coarse-to-fine order using a hierarchy of

scene boundaries and shadings. The LCIS operator, a partial differential equation inspired by

anisotropic diffusion, was designed to dissect a scene into smooth regions bounded by sharp

gradient discontinuities. A single parameter, K, chosen for each LCIS, controlled region size

and boundary complexity. Using a hierarchy of LCISs the authors could compress the dynamic

range of large contrast features and then add detail from small features back into the final image.

In addition to its value as a tone reproduction operator, this work is relevant to my research due

to its similar approach (albeit for different reasons) to feature analysis and simplification via

anisotropic-like diffusion (Section 3.3.2).

39

Temporal Dynamic Range. Mantiuk et al. [107] extended the MPEG-4 video com-

pression standard to deal with high dynamic range video. They described a luminance quanti-

zation method optimized for contrast threshold perception of the HVS. Additionally, the pro-

posed quantization offered perceptually-optimized luminance sampling to implement global

tone mapping operators via simple and efficient look-up tables.

Pattanaik et al. [122] proposed a new operator to account for transient scene intensity adjust-

ments of the HVS in animation or interactive real-time simulations. Their operator simulated

the dramatic compression of visual responses, and the gradual recovery of normal vision, caused

by large contrast fluctuations, for example when quickly entering or leaving a dark tunnel on a

bright sunny day.

2.4.2. Adaptive Rendering

Realistic image synthesis is computationally extremely expensive due to the complexity of in-

teractions between light and matter that need to be modeled to achieve a convincing level of

optical/physical realism. This problem can be mitigated by lowering the goal from optical re-

alism to perceived realism, instead. Using perceptual models and metrics, applications can

allocate rendering resources to salient image regions, while reducing computational accuracy

and resolution in less salient regions.

Error Sources. Arvo et al. [2] defined three main causes of error in global illumina-

tion algorithms: (1) Perturbed Boundary Data - errors in the input data due to limitations of

measurement or modeling; (2) Discretization Errors - introduced when analytical functions are

replaced by finite-dimensional linear systems for actual computations; and (3) Computational

40

Errors - due to limited arithmetic precision. Any or all of these errors can result in the vi-

sual degradation of synthetic images and objectionable artifacts, such as faceting on tesselated

curved surfaces, banding (and even exaggerated Mach-banding effects) due to quantization,

aliasing as a result of insufficient sampling, and noise as a residual effect of stochastic models

used in random sample placement.

Static Scenes. Ferweda et al. [46] made use of the common observation that some of

these artifacts can be masked (hidden) when they appear co-located with visual texture. The

authors developed a computational model of visual masking that predicted how the presence of

one visual pattern affected the detection of another. Using their system, the authors could select

and devise texture patterns to use in synthetic image generation that would hide artifacts due to

the above error-types.

Bolin and Meyer [13] presented a perceptually inspired approach to optimize sampling dis-

tributions for image synthesis. They computed a wavelet representation of the currently ren-

dered scene and used a custom image quality model in combination with statistical information

about the spatial frequency distribution of natural images to determine locations where addi-

tional samples needed to be taken. Their approach was able to predict masking effects and

could be used to attain equivalent visual quality from different rendering techniques by control-

ling sample placement.

In similar work, Ramasubramanian et al. [136] devised a physical error metric that ac-

counted for the HVS’s loss of sensitivity at high background illumination levels, high spatial

frequencies, and high contrast levels (visual masking). To reduce the cost of their metric for

adaptive rendering, the authors separated luminance-dependent processing from the expensive

spatially-dependent component, which could be pre-computed one off.

41

Recently, Cater at al. [23] performed user studies to demonstrate that different visual tasks

have an effect on eye-tracking of images (effectively changing saliency in an image). They

therefore extended previous HVS-based systems by additionally considering a so-called task-

map. The task map encoded information about objects’ locations and their purpose for a given

task, and was generally specified manually. The authors modified the Radiance rendering en-

gine [162] to synthesize images, optimized for a given task. Unlike Santella and DeCarlo [144]

they did not perform further user studies to prove that their optimized images retained the same

fixation locations as the unoptimized images.

Dynamic Scenes. In addition to a saliency map and spatial frequency estimation, the

perceptual model of Yee et al. [180] included an estimate of retinal velocity. Because detail

resolution in high velocity regions is limited, the authors could speed up global illumination

solutions by up to an order of magnitude.

Myszkowski [116], developed an extension to Daly’s [33] VDP, called Animation Quality

Metric (AQM) to facilitate high-quality walk-throughs of static environments and to speed up

global illumination computations of dynamic environments.

42

CHAPTER 3

Real-Time Video Abstraction

Figure 3.1. Abstraction Example. Abstractions like the one shown here canbe more effective in visual communication tasks than photographs. Original:Snapshot of two business students on an overcast day. Abstracted: After severalbilateral filtering passes and with DoG-edges overlayed. Quantized: Luminancechannel soft-quantized to 8 bins. Note how folds in the clothing and shadows onground are emphasized.

In this chapter, I present an automatic, real-time video and image processing framework with

the goal of improving the effectiveness of imagery for visual communication tasks (Figure 3.1).

This goal is naturally broken down into two tasks: (1) Modifying imagery based on visual

perception principles (Sections 3.2-3.3); and (2) proving that such modifications can lead to

43

improved performance in visual communication (Section 3.4). Additionally, I show how the

various processing steps in my framework can be utilized for artistic stylization purposes.

The framework operates by modifying the contrast of perceptually important features, namely

luminance and color opponency. It reduces contrast in low-contrast regions using an approx-

imation to anisotropic diffusion, and artificially increases contrast in higher contrast regions

with difference-of-Gaussian edges. The abstraction step is extensible and allows for artistic or

data-driven control. Abstracted images can optionally be stylized using soft color quantization

to create cartoon-like effects.

Technical Contributions. Unlike most previous video stylization systems, my framework

is purely image-based and refrains from deriving an explicit image representation1. That is, in-

stead of computing a structural description of the image content and then subsequently stylizing

or otherwise modifying this description, my framework directly manipulates perceptual features

of an image, in image space. While this may seem to limit stylization capabilities at first sight,

I devise several soft quantization functions that offer important benefits for abstraction, perfor-

mance, and stylization: (1) a significant improvement in temporal coherence without requiring

user-correction; (2) a highly parallel framework design, allowing for a GPU-based, real-time

implementation; and (3) parameters for the quantization functions which allow for a different,

but rich set of stylization options, not easily available to previous systems.

Theoretical Contributions. I demonstrate the effectiveness of the abstraction framework

with two user-studies and find that participants are faster at naming abstracted faces of known

persons compared to photographs. Traditionally, faces are considered very difficult to abstract

1While implementation details may vary, an explicit image representation generally describes an image in terms ofvector or curve-based bounded areas. See Section 3.1.1, pg. 47 and Figure 3.2 for details.

44

and stylize. Participants are also better at remembering abstracted images of arbitrary scenes in

a memory task. The user studies employ small images to emulate portable display technology.

I believe that small imagery will play an increasingly important role in the immediate future,

with the onset in ubiquity of mobile display-enabled devices like mobile phones, digital cam-

eras, personal digital assistants, game consoles, and multimedia players. To keep these devices

portable, their display size is necessarily limited and the given screen space has to be used

effectively. A framework that offers increased recognition of image features for visual com-

munication purposes while reducing the complexity of images and thus aiding compression is

therefore a valuable asset.

My framework is one of only a few existing automatic abstraction systems built upon per-

ceptual principles, and the only one to date that achieves real-time performance.

3.1. Related Work

A number of issues are important for most stylization and abstraction systems and can be

used to differentiate my work from previous systems. These are defined in the following section

and later used to discuss previous systems.

3.1.1. Definitions

Automatic vs. User-driven. As discussed in Section 2.3 various computational models of

low-level human perception have been proposed. These automatically approximate a limited

set of visual perceptual phenomena. No computational (or even theoretical) model exists to-

date that satisfactorily predicts or synthesizes anything but the most basic visual features. Most

models break down when attempting to analyze global effects requiring semantic information

45

or integration over the entire visual field, and effects based on binocular vision. These limita-

tions are partly due to the fact that not much is known about how humans achieve such global

analysis [188]. Consequently, any system relying on semantic information or intended to cre-

ate art requires human interaction. Other systems, particularly those intended to aid (and not

replace) humans in a particular visual task, can well benefit from automation. Ideally, a system

should offer a best-effort automatic solution along with an overriding or extension mechanism

to improve upon the results. This is the approach I have taken in my automatic video abstraction

framework.

Real-time vs. Off-line. By definition, the amount of computation that can be performed

by a real-time system is limited by the intended frame-rate. Because my framework is designed

to support visual communication, real-time performance is paramount to support interactive

applications like video-telephony or video-conferencing. Other applications, like visual data-

base searches or summaries can be created off-line and then accessed asynchronously. My

framework design leverages parallelism of the underlying image processing operations wher-

ever possible, enabling real-time performance on modern GPU processors.

Temporal Coherence. Temporal coherence is a desirable property of any animation and

video system because unintentional incoherence draws perceptional attention and is therefore

distracting. A system exhibits temporal coherence if small input changes lead to small out-

put changes and is not given for most stylization systems using discrete conditionals and hard

quantization functions. Additional problems arise if scene objects need to be identified and

tracked through computer vision algorithms, as those algorithms are often brittle (see Explicit

46

Figure 3.2. Explicit Image Structure. Two pairs of images showing explicitimage structure (Left image of pair shows color coded segments. Right image ofpair shows colors derived from original image). Coarse Segmentation: The levelof detail is manually chosen to segment the image into semantically meaningfulsegments. Some detail, like the face, is too fine to be resolved at this level. FineSegmentation: The level of detail is chosen so that the face is resolved, but thisleads to over-segmentation in the remaining image. A common approach to thisproblem is to over-segment an image and then use a heuristical method to mergeadjacent segments, but such heuristics are commonly non-robust and temporallyincoherent, requiring user correction.

image structure, below). My framework offers temporal coherence by two different mecha-

nisms: (1) reducing noise in the input images with non-linear diffusion; and (2) soft pseudo-

quantization functions that are all continuous or semi-continuous2 (and adaptive where applica-

ble).

2Formally, a function, f , defined on some topological space X, f : X 7→ R, is upper semi-continuous at x0, iflim sup

x→x0

f(x) ≤ f(x0), and lower semi-continuous, if lim infx→x0

f(x) ≥ f(x0).

For my soft quantization functions, it is also true that the ranges of the continuous intervals are much greater thanthe ranges of the discontinuities.

47

Explicit Image Structure and Stylization. An explicit image structure is the logical rep-

resentation of image elements, such as objects, and their relative positioning (Figure 3.2). Im-

age structure is commonly represented with a (possibly multi-resolution) hierarchy of contour-

bound areas, expressed as polylines or parametric curves. There exist several advantages of

such explicit representations. They can be arbitrarily scaled, they can be recombined in differ-

ent ways, and most importantly for stylization systems, their geometric descriptions can be pa-

rameterized and then simplified or stylized freely. Several disadvantages counterbalance these

benefits. Correctly identifying and extracting image structure from raw images is a difficult and

costly vision problem, often requiring user-correction and preventing real-time performance

(see Automatic vs. User-driven and Real-time vs. Off-line, above). A related problem is that

of tracking image structure between successive frames, particularly for noisy input, non-trivial

camera movements, and occlusions. My framework stays clear of these vision problems to be-

come fully automatic as well as real-time at the cost of a more limited range of stylistic options.

I offset this limitation by providing a rich set of user-parameters to the quantization functions

of the framework.

In addition to the points mentioned above, the discussion in Section 1.1.3 on the merits of

psychophysical validation applies directly to related works as well.

Having defined the most important design factors for work directly related to mine, I can

now continue to discuss previous systems in terms of these factors.

3.1.2. Previous Systems

Among the earliest work on image-based NPR was that of Saito and Takahashi [142] who

performed image processing operations on data buffers derived from geometric properties of

48

3-D scenes. These buffers contained highly accurate values for scene normals, curvature, depth

discontinuities and other measure that are difficult to derive from natural images without knowl-

edge of the underlying scene geometry. Unlike my own framework, their approach was mainly

limited to visualizing synthetic scenes with known geometry.

To reliably derive limited image structure from their source data, Raskar et al. [138] com-

puted ordinal depth from pictures taken with purpose-built multi-flash hardware. This allowed

them to separate texture edges from depth edges and perform effective texture removal and

other stylization effects. My own framework cannot derive ordinal depth information or deal

well with general repeated texture but also requires no specialized hardware and therefore does

not face the technical challenges of multi-flash for video.

Several video stylization systems have been proposed, mainly to help artists with labor-

intensive procedures [161, 26]. Such systems computed explicit image structure by extending

the mean-shift-based stylization approach of DeCarlo and Santella [36] to computationally ex-

pensive3 three-dimensional segmentation surfaces. Difficulties with contour tracking required

substantial user intervention to correct errors in the segmentation results, particularly in the

presence of occlusions and camera movement. My framework does not derive an explicit rep-

resentation of image structure but offers a different mechanism for stylization, which is much

faster to compute, fully automatic, and temporally coherent.

Contemporaneous work by Fischer et al. [49] explored the use of automatic stylization tech-

niques in augmented reality applications. To visually merge virtual objects with a live video

stream, they applied stylization effects to both virtual and real inputs. Although parts of their

3Wang et al.’s [161] system took over 12 hours to segment 300 frames (10 seconds of video) and users had tocorrect errors in approximately every third frame.

49

system are similar to the framework presented here, their approach is style-driven instead of per-

ceptually motivated, leading to different implementation approaches. As a result, their system

is limited in the amount of detail it can resolve, their stylization edges require a post-processing

step for thickening, and their edges tend to suffer from temporal noise4.

Recently, some authors of NPR systems have defined task-dependent objectives for their

stylized imagery and tested these with perceptual user studies. DeCarlo and Santella [36] used

eye-tracking data to guide image simplification in a multi-scale system. In follow-up work, San-

tella and DeCarlo [144] found that their eye-tracking-driven simplifications guided viewers to

regions determined to be important. They also considered the use of computational saliency as

an alternative to measured saliency. My own work does not rely on eye-tracking data, although

such data can be used. My implicit visual saliency model is less elaborate than the explicit

model of Santella and DeCarlo’s later work, but can be computed in real-time and can be ex-

tended for a more sophisticated off-line version. Their explicit image structure representation

allowed for more aggressive stylization, but included no provisions for the temporal coherence

featured in my framework.

Gooch et al. [62] automatically created monochromatic human facial illustrations from

Difference-of-Gaussian (DoG) edges and a simple model of brightness perception. Using an

extended soft-quantization version of a DoG edge detector, my framework can create similar

illustrations in a single pass and additionally address color, real-time performance and temporal

coherence. My face recognition study follows closely the protocol set forth by Stevenage [149]

and consequently used by Gooch et al. [62].

4It should be noted that while these drawbacks are generally not desirable for a video stylization system, theyhelped to effectively hide the boundaries between real and virtual objects in Fischer et al.’s system.

50

Work by Tumblin and Turk [158], traditionally associated with the tone-mapping literature,

is worth mentioning for its use of related techniques and the fact that the authors took inspira-

tion from artistic painterly techniques5. In order to map high-dynamic range (HDR) images into

a range displayable on standard display devices, Tumblin and Turk decomposed an HDR im-

age into a hierarchy of large and fine features (as defined by a conductance threshold function,

related to local contrast). Hierarchical levels with a large dynamic range were then compressed

before combination with smaller features, effectively compressing the range of the entire im-

age without sacrificing small detail. The low curvature image simplifiers (LCIS) used at each

hierarchy level are closely related to the approximate anisotropic diffusion operation I use for

simplification, but are based on higher order derivatives. Despite this similarity, Tumblin and

Turk’s goals were different in that they would not modify low dynamic range images, whereas

I am interested in simplifying and abstracting these.

3.2. Human Visual System

Visual processing of information in humans involves a large part of the brain and processing

operations too vast and complex to be currently fully understood, let alone be modeled. Given

the design considerations defined in Section 3.1.1 (automation, real-time performance, temporal

coherence) I limit the framework to modeling a small part of visual processing and base my

design on the following assumptions:

(1) The human visual system operates on various features of a scene.

(2) Changes in these features (contrasts) are of perceptual importance and therefore visu-

ally interesting (salient).

5Artists are commonly faced with the difficulty of capturing high dynamic range, real-world scenes on a canvas oflimited dynamic range

51

(3) Polarizing contrast (decreasing low contrast while increasing high contrast) is a basic

but useful method for automatic image abstraction.

3.2.1. Features

Although the human visual experience is generally holistic, several distinct visual features are

believed to play a vital role in low level human vision, among these are luminance, color oppo-

nency, orientation, and motion [121]. Evidence for such features derives from several sources.

Within the visual cortex, several structurally different and variedly connected sub-regions

have been identified, whose comprising cells are selectively sensitive to very distinct visual

stimuli (e.g. Area V3: Orientation. Area V4: Color. Area V5: Global Motion) [188]. In ad-

dition, cerebral lesions and other pathological conditions can lead to cases where the holistic

visual experience is selectively impaired (e.g. color blindness types: protanopia, deuteranopia,

tritanopia, monochromasy, and cerebral achromatopsia; form deficiency: visual agnosia; mo-

tion blindness: akinetopsia) [188]. Similar evidence can be gleaned from blind people who

regain sight. Their visual system is generally heavily underdeveloped and (depending on age)

may never fully recover, but they can almost immediately perceive lines, edges, brightness, and

color6 [65].

Based on this evidence, I consider luminance, color, and edges (which really are a sec-

ondary feature) in my real-time framework. The framework uses the perceptually uniform

CIELab [179] color space to encode the luminance (L) and color opponency (a and b) features

of input images and performs all abstraction operations in this feature space. The perceptual

6The problem often does not lie in perceiving the individual features of the visual world, but their meaningfulintegration and interpretation.

52

uniformity of CIELab guarantees that small distances measured in this space correspond to per-

ceptually just noticable differences (see Contrast, below). The framework design further allows

for inclusion of additional features for off-line processing or when such features can be viably

computed in real-time on future hardware (Section 3.5.4).

3.2.2. Contrast

Constant features are generally not a prime source of biologically vital information (e.g. a fea-

tureless blue sky; a tree with uniformly green leaves; a stationary bush). Changes in features

(feature contrasts) are often much more important (e.g. the silhouette of a hawk hovering above;

the color-contrast of a red apple on a green tree; the motion of a tiger moving in the bushes).

For this reason, humans are notoriously inept at estimating absolute stimuli and much more pro-

ficient at distinguishing even small differences between two similar stimuli [105]. People can

name and describe only a handful of colors, yet they can differentiate hundreds of thousands of

colors. People have difficulty estimating speed when moving, yet they are extremely sensitive

to acceleration and deceleration7. Only very few people can tell the frequency of a pure sinu-

soidal sound wave, yet most people can distinguish two different notes. In technical terms, the

resolution of absolute measures of features can be orders of magnitude less than the differen-

tial resolution of so-called just-noticeable-differences 8 (JND) [105, 45]. Because changes play

such an important role in perception, much of my framework is based on contrasts (see below).

7For example, without visual feedback, one cannot tell if an elevator is moving or stationary, only if it is startingor stopping.8Is is therefore not surprising to find differential measures becoming increasingly prominent in computer graphicsresearch [123, 156, 59].

53

3.2.3. Saliency

To remove extraneous detail from imagery while emphasizing important detail requires a mea-

sure of visual importance. Itti et al. [79, 78] recognized the biological importance of high fea-

ture contrasts in their saliency model, introduced in Section 2.2. Because their explicit model

is computationally rather expensive and thus too complex for real-time applications, I employ

a simpler, implicit9 saliency model for my automatic, real-time implementation. Within my

framework, the following restrictions apply:

(1) It considers just two feature contrasts: luminance, and color opponency.

(2) It does not model effects requiring global integration.

(3) It processes images only within a small range of spatial scales (Section 3.2.5).

Since the framework (Figure 3.4) optionally allows for externally-guided abstraction via user-

maps (Equation 3.2 and Figure 3.9), a more complex saliency map, like that of Itti et al., can be

supplied at the cost of sacrificing real-time performance.

3.2.4. Contrast Polarization

Exaggerating feature contrasts can aide in visual perception. For example, super-portraits and

caricatures have been shown to help recognition of faces10 [17, 149, 62] and can be considered

a special case of the more general peak-shift effect [66].

My approach for image simplification and abstraction is therefore to simply polarize the

existing contrast characteristics of an image: to diminish feature contrast in low contrast regions,

9Here, implicit means that contrast is both the measure that defines saliency and the operand that is modified viasaliency.10Here, feature refers to facial features (like big nose, tight lips); and contrast refers to feature differentials com-pared to an ideal norm-face.

54

Figure 3.3. Scale-Space Abstraction. Left: Image of a man used as the baselevel in a scale-space representation. Left to right: Difference-of-Gaussian(DoG) feature edges computed at increasingly coarser scales. As the kernel sizefor the DoG filters increases (about an order of magnitude from left to right), thevisual depiction changes from a concrete instance of a man in shirt and trousers,to a generic and abstract standing figure.

while increasing feature contrast in high contrast regions in order to yield abstractions that are

easier and faster to understand.

3.2.5. Scale-Space

Real-world entities are comprised of structural elements at different scales (Figure 3.3). A forest

can span dozens of kilometers, each tree can be dozens of meters high, branches are several me-

ters long, leaves are best measured in centimeters, while the leaves’ cells extend only fractions

of millimeters. It makes as little sense to describe a forest in terms of millimeters as it does to

describe a leaf in terms of kilometers. The fact that scale is such an important aspect when dis-

cussing structure has led to the development of several scale-space theories [175, 93, 102]. In

terms of the human visual system, Witkin’s continuous (linear) scale-space theory is compatible

with results by De Valois and De Valois [159], showing that receptive fields (Figure 3.11) of

cortical cells include a fairly dense representation of sizes in the spatial frequency domain [121].

55

Figure 3.4. Framework Overview. Each step lists the function performed,along with user parameters. The right-most paired images show alternative re-sults, depending on whether luminance quantization is enabled (right) or not(left). The top image pair shows the final output after the optional image-basedwarping step.— Cameron Diaz with permission of Cameron Diaz.

My framework supports structural scale with various framework parameters (σd, σe), which

can be used to extract and smooth features at a given scale (Figures 3.3 and 3.13). Particularly,

a single spatial scale can be defined for edge-detection, and the non-linear diffusion process

(Section 3.3.2) inherently operates at multiple scales due to its iterative nature11.

3.3. Implementation

The basic workflow of my framework is shown in Figure 3.4. The framework first polarizes

the given contrast in an image using nonlinear diffusion (Section 3.3.2). It then adds highlight-

ing edges to increase local contrast (Section 3.3.3), and it optionally stylizes (Section 3.3.5) and

sharpens (Section 3.3.4) the resulting images.

11It should be noted that only the base scale of the non-linear diffusion is well-defined. Additional scales arespatially varying due to the non-linearity. As such, the multi-scale operations are less powerful than those basedon Itti et al.’s explicit representation.

56

3.3.1. Notation

This work combines results from such diverse disciplines as Psychology, Physics, Computer Vi-

sion, and Computer Graphics, each of which tend to have their unique formalisms and notation.

In my own work, I try to use recognizable formulations of existing results and favor readability

over mathematical rigor. Specifically, I mix notation from continuous and discrete domains and

I do not discuss issues arising due to boundary conditions, as these issues are not specific to

my work. Additionally, numerical accuracy is not a deciding factor in my framework because

there exists no ground-truth to judge against and because the filters I employ are stable for the

parameter ranges given, unless explicitly stated otherwise.

3.3.2. Extended Nonlinear Diffusion

Linear vs. Non-linear diffusion. Linear filters, like the well-known Gaussian blur, are an

effective method for decreasing the contrast of an image (Figure 3.5). In the frequency domain,

the Gaussian blur acts as a low-pass filter, meaning that high-frequency components are subdued

or even eliminated. As a result, edges become softer and contrast decreases. Unfortunately, this

particularly applies to sharp edges, which contain a broad spectrum of frequency components.

As it is my goal not only to lower low contrast, but also to preserve or even increase high

contrast, edge blurring poses a problem.

To explain the relevant technical terms in their historical context, it is useful to introduce

an alternative description of the Gaussian blur. Several linear filters, the Gaussian blur among

them, can be interpreted as solutions to the heat equation [43]. To gain an intuitive understand-

ing one can imagine a room filled with a gas of spatially varying temperature. Because the gas

is free to move around, it will attempt to reach an equilibrium state of constant temperature

57

Figure 3.5. Linear vs. Non-linear Filtering. Scanlines (top row) of luminancevalues for the horizontal lines marked in green in the bottom row images. Sig-nificant luminance discontinuities are marked with vertical lines. Original: Theoriginal scanline contains several large and sharp discontinuities, correspond-ing to semantically meaningful regions in the source image that I would liketo preserve (wall, guard outline right, right leg fold, guard outline left). Thescanline also contains a large amount of small, high frequency components ontop of the base signal. These smaller components generally constitute textureor noise, which I would like to subdue. Linear Filter: Linear filtering (here,Gaussian blur) successfully subdues high-frequency components, thus simplify-ing the scanline. Since a linear filter operates isotropically and homogenouslyit also suppresses the high-frequency components of the sharp discontinuitiesthus smoothing these undesirably. Non-Linear Filter: The anisotropic and in-homogeneous action of the non-linear filter smooths away high frequencies inlow contrast regions, while preserving most frequencies in high contrast regions.Compare the shape of all scanlines, especially at the discontinuities marked withvertical lines.

everywhere. If there exists no spatial bias in the way the gas can move (apart from boundary

conditions), the system is said to have a constant diffusion conduction function and the gas dif-

fuses isotropically in all directions. In that case a linear diffusion function, like the Gaussian

blur, can be used to model the diffusion process12.

12To bring this example back to the image domain, imagine an arbitrary image to which one applies a very smallGaussian blur. As a result neighboring colors mix. Repeating this process ad infinitum mixes all colors into asingle color that is the average of the initial colors.

58

Perona and Malik [125] defined a class of filters with spatially varying diffusion conduc-

tion functions, resulting in anisotropic diffusion. These filters have the property of blurring

small discontinuities and sharpening edges13 (Figure 3.5). Using such a filter with a conduc-

tion function based on feature contrast, the contrast can be effectively amplified or subdued.

Unfortunately, Perona and Malik’s neural-net-like implementation is not efficient enough for a

real-time system on standard graphics hardware, due to its very small (see Footnote 12) spatial

support14.

Barash and Comaniciu [4] demonstrated that anisotropic diffusion solvers can be extended

to larger spatial neighborhoods, thus producing a broader class of extended nonlinear diffusion

filters. This class includes iterated bilateral filters as one special case, which I prefer due to

their larger support size and the fact that they can be approximated quickly and with few visual

artifacts using a separated kernel [126].

Extended Nonlinear Diffusion. Given an input image f(·), which maps pixel locations

into some feature space, I define the following customized bilateral filter, H(·):

(3.1) H(x, σd, σr) =

∫e− 1

2

(‖x−x‖

σd

)2

w(x, x, σr)f(x) dx∫e− 1

2

(‖x−x‖

σd

)2

w(x, x, σr) dx

13A filter that increases the steepness of edges towards a true step-like discontinuity is sometimes called shock-forming.14Their approach is still very much parallelizable and could be efficiently implemented in special hardware.

59

x : Pixel location w(·) : Range weight function

x : Neighboring pixel m(·) : Linear weighting function

σd : Neighborhood size w′(·) : Diffusion conduction function

σr : Conductance threshold u(·) : User-defined map

In this formulation, x is a pixel location and x are neighboring pixels, where the neighbor-

hood size is defined by σd (blur radius). For implementation purposes, I limit the evaluation

radius to two standard deviations, ±2σd, and normalize the convolution kernel to account for

the missing area under the curve. This rule applies similarly to all following functions involving

convolutions with exponential fall-off.

Increasing σd results in more blurring, but if σd is too large features may blur across sig-

nificant boundaries. The range weighting function, w(·), is closely related to the diffusion

conduction function (see below) and determines where in the image contrasts are smoothed or

sharpened by iterative applications of H(·).

(3.2) w(x, x, σr) = (1−m(x)) · w′(x, x, σr) + m(x) · u(x)

Range Weights and Diffusion Conduction Functions. My definition of Equation 3.2

extends the traditional Bilateral filter to become more customizable for data-driven or artistic

control.

For the real-time, automatic case, I set m(·) = 0, such that w(·) = w′(·) and Equation 3.1

becomes the familiar bilateral filter [155]. Here, w′(·), is the traditional diffusion conduction

60

Figure 3.6. Diffusion Conduction Functions and Derivatives. Three possiblerange functions, w′, for use in Equation 3.2. All functions have a Gauss-likebell shape, but differ in their differentiability and differential function shape.Since all functions produce very similar results when applied to an image, thebest choice for a given application depends largely on the support for optimizedimplementations of each function.

function and can take on numerous forms (Figure 3.6), given that w′(x = x, x, σr) = c, with c

some finite constant, and limx→±∞

w′(x, x, σr) = 0.

61

w′E(x, x, σr) = e

− 12

(∆fxσr

)2

(3.3)

w′I(x, x, σr) =

A

1 +(

∆fx

σr

)2(3.4)

w′C(x, x, σr) =

A2·(1 + cos

(∆fx·π3·σr

))if − 3 σr ≤ ∆fx ≤ 3 σr,

0 otherwise.

(3.5)

where ∆fx = ‖f(x)− f(x)‖(3.6)

Equation 3.3 is the conduction function used by Tomasi and Manduchi [155] (they use the

term range weighting function, as above) and employed for most images in this chapter. Equa-

tion 3.4 is based on one of Perona and Malik’s [125] original functions and Equation 3.5 is a

function I devised for its finite spatial support (both other functions have infinite support and

need to be truncated and normalized for practical implementations). Figure 3.6 shows compar-

isons of these functions along with their first two derivatives. In practice, I find that all functions

give comparable results and a selection is best based on implementation efficiency on a given

platform and subjective quality estimates. As I am interested in manipulating contrast, all pro-

posed conduction functions operate on local contrast15, as defined in Equation 3.6. Perona and

Malik [125] called parameter σr in Equations 3.2-3.5 the conductance threshold in reference

to its deciding role in whether contrasts are sharpened or blurred. Small values of σr preserve

almost all contrasts, and thus lead to filters with little effect on the image, whereas for large

15Other non-linear diffusion filters that operate on higher-order derivatives of the image have been proposed toachieve different goals [164, 158].

62

Figure 3.7. Progressive Abstraction. This figure shows a source image (unfil-tered) that is progressively abstracted by successive applications of an extendednonlinear diffusion filter. Note how low contrast detail (e.g. the texture in thestone wall and the soft folds in the guard’s garments) is smoothed away, whilehigh contrast detail (facial features, belts, sharp creases in garment) is preservedand possibly enhanced.

values, limσr→∞

w′(·) = 1, thus turning H(·) into a standard, linear Gaussian blur. For interme-

diate values of σr, iterative filtering of H(·) results in an extended nonlinear diffusion process,

where the degree of smoothing or sharpening is determined by local contrasts in f(·)’s feature

space. Figure 3.7 shows the progressive removal of low contrast detail due to iterative nonlinear

diffusion.

Automatic vs. Data-driven Abstraction. With m(·) 6= 0, the range weighting function,

w(·), turns into a weighted sum of w′(·) and an arbitrary importance field, u(·), defined over

the image. In this case, m(·) and u(·) can be computed via a more elaborate visual saliency

model [130, 78], derived from eye-tracking data [36], or painted by an artist [71]. Figure 3.8

shows comparisons between DeCarlo and Santella’s [36] explicit stylization system and my

63

Figure 3.8. Data-driven Abstraction. This figure shows images abstracted au-tomatically (left) vs. abstractions guided by eye-tracking data (right). The toprow are original images by DeCarlo and Santella [36], while the bottom rowshows my results given the same data. It is noteworthy that despite some stylis-tic differences the two systems abstract images very similarly, although the sys-tems themselves are radically different in design. Particularly, it is not necessaryto derive a computationally expensive explicit image representation to achievemeaningful abstraction.— Top images and eye-tracking data by Doug DeCarlo and AnthonySantella, with permission.

implicit framework. I created the data-driven example by converting DeCarlo and Santella’s

eye-tracking data into an importance map, u(·), setting m(·) := u(·), and tuning the remaining

framework parameters to approximate the spatial scales and simplification levels found in De-

Carlo and Santella’s original image, to allow for better comparability. After setting the initial

64

Figure 3.9. Painted Abstraction. User-painted masks achieve an effective sep-aration of foreground and background objects. Automatic: The automatic ab-straction of the source image yields the same level of abstraction everywhere.Foreground & Background: Masks (shown as insets) selectively focus abstrac-tion primarily on the background and foreground, respectively.— Original Sourceimage in public domain.

parameters, the framework ran automatically. Note that although the two abstraction systems are

radically different in design and implementation (e.g. DeCarlo and Santella’s image-structure-

based system versus my image-based framework), the level of abstraction achieved by both is

very similar.

Figure 3.9 demonstrates the use of a user-painted importance mask, u(·). As above, I set

m(·) := u(·). The masks, shown as insets in the figure, are kept simple for demonstrative

reasons but could be arbitrarily complex. In effect, a user can simply paint abstraction onto

an image with a brush, the level of abstraction depending on the brightness and spatial extent

of the brush. Since the framework operates in real-time, this process affords immediate visual

feedback to the user and allows even novice users to easily create abstractions with a simple

and intuitive interaction common in many image manipulation products.

65

Optimizations and Other Considerations. Applying a full extended non-linear diffu-

sion solver with reasonable spatial support and sufficient iterations to achieve valuable abstrac-

tion is computationally too expensive for real-time purposes. Fischer et al. [49] addressed this

problem by applying their full filter implementation on a downsampled input image and then

interpolating the result to the original size. While this allowed them to perform at least one

iteration in real-time, the upsampling interpolation caused blurring of the resulting image, as

expected.

My solution uses a separable implementation of the non-linear diffusion kernel. A two-

dimensional kernel is separable if it is equal to the convolution of two one-dimensional kernels:

∫R2

k1(x1, x2) dx1 dx2 =

∫R

k2(x1) dx1 ∗∫R

k3(x2) dx2

The one-dimensional kernels can be applied sequentially (the latter operating on the result of the

former) thus reducing the computational complexity from O(n2) to O(n), where n is the radius

of the convolution kernel, and in turn limiting costly memory fetches. Mathematically, a non-

linear filter is generally not separable, the bilateral filter included. Still, I have obtained good

results with this approach in practice. My results show empirically that a separable approxi-

mation to a bilateral filter produces minor (difficult to see with the naked eye) spatial biasing

artifacts compared to the full implementation for a small number of iterations (< 5 for most

images tested and using the default values in this chapter). Due to the shock-forming behavior

of the bilateral filter, these biases tend to harden and become more pronounced with successive

iterations (Figure 3.10).

66

Figure 3.10. Separable Bilateral Approximation. Two images of a bilateralfilter diffusion process after 41 iterations. Full: Using the full two-dimensionalimplementation. Approximate: Using two separate one-dimensional passes. Inmost cases these errors are fairly small and only become prominent after a largenumber of iterations.— Fair use: The images shown here for educational purposes are deriva-tions of a small portion of an original image as shown on daily television.

Pham and Vliet [126] corroborate this result in contemporaneous work. They show empir-

ically that a single iteration of a separable bilateral filter produces few visual artifacts, even for

the worst-case scenario of a 45 tilted discontinuous edge.

Figure 3.10 shows results for large number of iterations where errors tend to accumulate.

I observe two types of effects: (1) sharp diagonal edges often evolve into jagged horizontal

and vertical steps (examples À, Á, and Â); and (2) soft diagonal edges fail to evolve (exam-

ples Â and Ã). For most of my videos and images, including the user-study, I have found it

sufficient to apply between 2-4 iterations, so that spatial biases are rarely noticeable. The speed

improvement, on the other hand, is in excess of 30 times in the GPU implementation.

67

Figure 3.11. Center-Surround Cell Activation. The receptive field of a cor-tical cell is modeled as an antagonistic system in which the stimulation of thecentral cell (blue) is inhibited by the simultaneous excitation of its surroundingneighbors (green). In other words, a center-surround cell is only triggered if ititself receives a signal while its receptive field is not stimulated. This systemgives rise to the Mexican hat shape in 3-D (left, checkered shape) and the corre-sponding curves shown in the right image. The combined response curve can bemodeled by subtracting two Gaussian distribution functions whose standard de-viations are proportional to the spatial extent of the central cell and its receptivefield [121].

As noted previously, all abstraction operations are performed in CIELab space. Conse-

quently, the parameter values given here and in the following sections are based on the assump-

tion that L ∈ [0, 100] and (a, b) ∈ [−127, 127].

3.3.3. Edge detection

In general, edges are defined by high local contrast, so adding visually distinct edges to regions

of high contrast further increases the visual distinctiveness of these locations.

68

Figure 3.12. DoG Edge Detection and Enhancement. The center-surroundmechanism described in Figure 3.11 can be used to detect edges in an image.Source: An abstracted image used to detect edges. DoG Result: The raw outputof a DoG filter needs to be quantized to obtain high contrast edges. Step Quan-tization: Discontinuous quantization results in temporally incoherent edges nearthe step boundary. Smooth Quantization: Using Equation 3.7 for quantizationresults in edges that quickly fade at the quantization boundary, leading to im-proved temporal coherence. Compare the circled edges in the bottom images.

Marr and Hildreth [109] formulated an edge detection mechanism based on zero-crossings

of the second derivative of the luminance function. They postulated that retinal cells (cen-

ter), which are stimulated while their surrounding cells are not stimulated, could act as neural

implementations of this edge detector (Figure 3.11). A computationally efficient approxima-

tion to this edge detection mechanism is the quantized result of the difference-of-Gaussians

69

Figure 3.13. DoG Parameter Variations. Extending the standard DoG edgedetector with soft quantization parameters allows me to create a rich set ofstylistic variations. Left: A classic DoG result (no shading) with fine edges(low scale parameter σe). (σe, τ, ε) = (0.7, 0.9904, 0). Center: Same edgescale as in left image, but with additional shading information. Note that thisimage is not simply a combination of edges with luminance information as inGooch et al. [62], because edges in dark regions (e.g. person’s right cheek,bottom of beard) are still visible (as bright lines against dark background).In terms of style, the image has a distinct charcoal-and-pencil appearance.(σe, τ, ε) = (0.7, 0.9896, 0.00292). Right: Coarse edges using a large spatialkernel (compare detail in hair and hat with left image) and light shading aroundeyes, cheek and throat. (σe, τ, ε) = (1.8, 0.9650, 0.01625). Parameter ϕe = 5.0throughout. Given the above parameters, these images are created fully automat-ically in a single processing step.— Original photograph used as input c©Andrew Calder, withpermission.

(DoG) operator (Figure 3.12). Rather than using a binary edge quantization model as in pre-

vious works [22, 62, 49], I define my edges using a slightly smoothed continuous function,

D(·), (Equation 3.7; depicted in Figure 3.4, bottom inset) to increase temporal coherence in

animations and to allow for a wider range of stylization effects (Figure 3.13) than previous

implementations.

70

Figure 3.14. Edge Cleanup Passes. DoG Edges are extracted after ne < nbbilateral filter passes to eliminate noise that could lead to temporal incoherencein the edges. From left to right, this figure shows the original edges containedin a source image, the edges extracted after two and after four bilateral cleanuppasses. Note that the differences between no cleanup and two passes are muchgreater than between two and four passes, indicating that a point of diminishingreturns is quickly reached.— Original photograph used as input c©Andrew Calder, with permis-

sion.

D(x, σe, τ, ε, ϕe) =

1 if (Sσe − τ · Sσr) > ε,

1 + tanh(ϕe · (Sσe − τ · Sσr)) otherwise.

(3.7)

Sσe ≡ S(x, σe)(3.8)

Sσr ≡ S(x,√

1.6 · σe)(3.9)

S(x, σe) =1

2πσe2

∫f(x) e−

12(

‖x−x‖σe

)2

dx(3.10)

Equation 3.8 and Equation 3.9 represent Gaussian blurs (Equation 3.10) with different stan-

dard deviations and correspond to the center and negative surround responses of a cell, respec-

tively. The factor of 1.6 in Equation 3.9 relates the size of a typical center-surround cell to

71

Figure 3.15. DoG vs. Canny Edges. DoG Edges: Soft DoG edges tuned toyield results comparable to the Canny edges. The thickness of the lines areproportional to the strength of edges as well as the scale at which edges aredetected (Figure 3.3) giving the lines an organic feel. Canny Edges: Canny edge-lines are designed to be infinitely thin, irrespective of scale. This is advantageousfor image segmentation (Figure 3.2), but often belies the true scale of edges,making it more difficult to visually interpret the resulting lines. Canny EdgesEroded: Morphological thickening of lines, as in Fischer et al. [49], can easilyhide small detail (e.g. threads in hat).— Original photograph used as input c©Andrew Calder,with permission.

the extent of its receptive field [109]. Together, the parameters τ and ε in Equation 3.7 control

the amount of center-surround difference required for cell activation. Parameter ε commonly

remains zero, while τ is smaller yet very close to one. Various visual effects can be achieved by

changing these default values (Figure 3.13). Parameter ϕe controls the sharpness of the activa-

tion falloff. A larger value of ϕe increases the sharpness of the fall-off function thereby creating

a highly sensitive edge detector with reduced temporal coherence, while a small value increases

temporal coherence but only detects strong edges. Typically, I set ϕe ∈ [0.75, 5.0]. Parameter

σe determines the spatial scale for edge detection (Figure 3.3). The larger the value, the coarser

the edges that are detected. For nb bilateral iterations, I extract edges after ne < nb iterations to

reduce noise (Figures 3.14 and 3.23). Typically, ne ∈ 1, 2 and nb ∈ 3, 4.

72

Canny [22] devised a more sophisticated edge detection algorithm (sometimes called opti-

mal), which due to its computer vision roots is commonly used to derive explicit image repre-

sentations via segmentation [36], but has also been used in purely image-based systems [49].

Canny edges are well suited for image segmentation because they are infinitely thin16 and guar-

anteed to lie on any real edge in an image, but at the same time they can become disconnected

for large values of σe and are computationally more expensive than DoG edges. DoG edges

are cheaper to compute and not prone to disconnectedness, but may drift from real image edges

for large values of σe. I prefer DoG edges for computational efficiency, temporal coherence,

because their thickness scales naturally with σe (Figure 3.3 and Figure 3.15), and because my

soft-quantization version (Equation 3.7) allows for a number of stylistic variations. I address

edge drift with image-based warping.

3.3.4. Image-based warping (IBW)

DoG edges can become dislodged from true edges for large values of σe and may not line up

perfectly with edges in the color channels. To address such small edge drifts and to sharpen

the overall appearance of the final result (Figure 3.4, top-right), I optionally perform an image-

based warp, or warpsharp filter. IBW is a technique first proposed by Arad and Gotsman [1] for

image sharpening and edge-preserving upscaling, in which they moved pixels along a warping

field towards nearby edges (Figure 3.16).

16For their image-based system, Fischer et al. [49] artificially increase the thickness of Canny edges using mor-phological operations.

73

Figure 3.16. IWB Effect. Top Row: An image before and after warping and thecolor-coded differences between the two (Green = black expands; Red = blackrecedes). Bottom Row: Detail of the person’s left eye. Note that although theeffect is fairly subtle (zoom in for better comparison) it generally improves thesubjective quality of images considerably, particularly for upscaled images. Thisfigure uses an edge image as input for clarity, but in the full implementation theentire color image is warped.— Original photograph used as input c©Andrew Calder, with

permission.

Given an image, f(·), and a warp-field, Mw : R2 7→ R2, which maps the image-plane onto

itself17, the warped image, W (x), is constructed as:

(3.11) W (x) = f(M−1w (x))

This notation is after Arad and Gotsman [1] where M−1w is used to indicate backward mapping,

which is preferable for upscaling interpolation.

17That is, the warp-field maps pixels positions rather than pixel values.

74

Figure 3.17. Computing Warp Fields. An input image is blurred and con-volved with horizontal and vertical Sobel kernels, resulting in spatially varyingwarp fields for sharpening an image.— Original photograph used as input c©Andrew Calder,with permission.

In my implementation, which closely follows Loviscach’s [103] simpler IBW approach, Mw

is the blurred and scaled result of a Sobel filter, a simple 2-valued vector field that in the discrete

domain (see Section 3.3.1) is easily invertible to obtain M−1w :

Mw(x, σw, ϕw) = ϕw ·1

2πσw2

∫·Ψ(x) · e−

12(

‖x−x‖σw

)2

dx(3.12)

Ψ(x) =

fL(x) ∗

−1 0 +1

−2 0 +2

−1 0 +1

, fL(x) ∗

+1 +2 +1

0 0 0

−1 −2 −1

T

(3.13)

Here, parameter σw in Equation 3.12 controls the area of influence that edges have on the

resulting warp. The larger the value the more distant pixels are affected. Parameter ϕw controls

the warp-strength, that is, how much affected pixels are warped toward edges. A value of zero

has no effect, while very large values can significantly distort the image and push pixels beyond

75

the attracting edge18. For most images, I use σw = 1.5, and ϕw = 2.7 with bi-linear or bi-cubic

backward mapping.

Note, that while Equation 3.11 operates on all channels of the input image, Equation 3.13 is

only based on the Luminance channel, fL, of the image. Figure 3.17 shows the horizontal and

vertical Sobel components of Ψ(·) for a given input image.

3.3.5. Temporally Coherent Stylization

To further simplify an image (in terms of its color histogram) and to open the framework further

for creative use, I perform an optional color quantization step on the abstracted images, which

results in cartoon or paint-like effects (Figures 3.1 and 3.18).

(3.14) Q(x, q, ϕq) = qnearest +∆q

2tanh(ϕq · (f(x)− qnearest))

In Equation 3.14, Q(·) is the pseudo-quantized image, ∆q is the bin width, qnearest is the bin

boundary closest to f(x), and ϕq controls the sharpness of the transition from one bin to another

(top inset, Figure 3.4). Equation 3.14 is formally a discontinuous function, but for sufficiently

large ϕq, these discontinuities are not noticeable.

For a fixed ϕq, the transition sharpness is independent of the underlying image, possibly

creating many noticeable transitions in large smooth-shaded regions. To minimize jarring tran-

sitions, I define the sharpness parameter, ϕq, to be a function of the luminance gradient in the

abstracted image. I allow hard bin boundaries only where the luminance gradient is high. In

low gradient regions, bin boundaries are spread out over a larger area. I thus offer the user

a trade-off between reduced color variation and increased quantization artifacts by defining a18For completeness: negative values of ϕw push pixels away from edges, which looks interesting, but is generallynot useful for meaningful image abstraction.

76

Figure 3.18. Luminance Quantization Parameters. An original image alongwith parameters resulting in sharp and soft quantizations. Compare details inmarked regions. Sharp: A very large ϕq creates hard, toon-shading like bound-aries. (q, Λϕ, Ωϕ, ϕq) = (8, 2.0, 32.0, 500.0). Soft: A larger number of quantiza-tion bins and low value of ϕq creates soft, paint-like whisks at the quantizationboundaries. (q, Λϕ, Ωϕ, ϕq) = (14, 3.4, 10.6, 9.7). Edge scale σe = 2.0 for bothabstractions.

target sharpness range [Λϕ, Ωϕ] and a gradient range [Λδ, Ωδ]. I clamp the calculated gradients

to [Λδ, Ωδ] and then generate a ϕq value by mapping them linearly to [Λϕ, Ωϕ]. The effect for

typical parameter values are hard, cartoon-like boundaries in high gradient regions and soft,

painterly-like transitions in low gradient regions (Figure 3.18). Typical values for these param-

eters are q ∈ [8, 10] equal-sized bins and a gradient range of [Λδ = 0, Ωδ = 2], mapped to

sharpness values between [Λϕ = 3, Ωϕ = 14].

Although soft quantization is not a novel idea, it has hardly been used for abstraction sys-

tems, particularly in a locally adaptive form. My pseudo quantization approach, apart from be-

ing effective and efficient to implement, offers significant temporal coherence advantages over

previous systems using discontinuous quantization or automatic image-structure-based systems.

77

In standard quantization, an arbitrarily small luminance change can push a value to a different

bin, thus causing a large output change for a small input change, which is particularly trou-

blesome for noisy input. With soft quantization, such a change is spread over a larger area,

making it less noticeable. Using a gradient-based sharpness control, sudden changes are further

subdued in low-contrast regions, where they would be most objectionable. Finally, an adaptive

controlling mechanism offers the benefits of both effective quantization and temporal coherence

with easily adjustable trade-off parameters set by the user.

3.3.6. Optimizations

In designing my framework, I capitalize on two types of optimizations: parallelism and separa-

bility.

Parallelism. Modern graphics processor units (GPUs) are highly efficient parallel com-

putation machines and are particularly well suited for many image processing operations. To

take advantage of this parallel computing power, every element in my processing framework is

highly parallelizable, that is, it does not rely on global operations (like min(·), max(·),∑

(·),

etc.) and all operations only rely on previous processing steps (i.e. no forward dependencies).

In addition, the non-linear diffusion (Section 3.3.2) and edge-detection (Section 3.3.3) opera-

tions after the initial noise-removal iterations (n > ne) can be performed in parallel, as can the

center and surround kernel convolutions of the edge-detection itself. I use Olsen’s [119] GPU

image processing system to automatically compute and schedule processes and resolve memory

dependencies.

78

Separability. As discussed in Section 3.3.2, the separable implementation of a two-

dimensional filter kernel yields a significant performance gain. Since the Gaussian(-like) con-

volution features so heavily in the abstraction framework (see Section 6.1.3 for a discussion of

this observation), I take advantage of this optimization in almost every processing step (non-

linear diffusion, edge-detection, and image-based-warping).

3.4. Experiments

Section 3.2 explains the perceptual considerations that have gone into the framework design

and Section 3.3 details the various image processing operations that implement the correspond-

ing image simplification and abstraction steps, but this still does not guarantee that the ab-

stracted images are effective for visual communication. To verify that my abstractions preserve

or even distill perceptually important information, I performed two task-based studies that test

recognition speed and short term memory retention. The studies use small images because (1) I

see portable visual communication and low-bandwidth applications to practically benefit most

from my framework and (2) because small images may be a more telling test of the framework

as each pixel represents a larger percentage of the image.

Participants. In each study, 10 (5 male, 5 female) undergraduates, graduate students or

research staff acted as volunteers.

Materials. Images in Study 1 are scaled to 176 × 220, while those in Study 2 are scaled

to 152×170. These resolutions approximate those of many portable devices. Images are shown

centered on an Apple Cinema Display at a distance of 24 inches to subtend visual angles of 6.5

and 6.0, respectively. The unused portion of the monitor framing the images is set to white.

79

Figure 3.19. Sample Images for Study 1. The top row shows the original im-ages (non-professional photographs) and the bottom row shows the abstractedversions. Note how many wrinkles and individual strands of hair are smoothedaway, reducing the complexity of the images while actually improving recog-nition in the experiment. All images use the same σe for edges and the samenumber of simplification steps, nb. — Pierce Brosnan and Ornella Muti by Rita Molnar, Cre-ative Commons License. Paris Hilton by Peter Schafermeier, Creative Commons License. George Clooney,public domain.

In Study 1, 50 images depicting the faces of 25 famous movie stars are used as visual stimuli.

Each face is depicted as a color photograph and as a color abstracted image created with my

framework (Figure 3.19). In Study 2, 32 images depicting arbitrary scenes are used as visual

stimuli. Humans are a component in 16 of these images (Figure 3.20).

Analysis. For both studies, p-values are computed using two-way analysis of variance

(ANOVA), with α = 0.05.

80

Figure 3.20. Sample Images from Study 2. The top row shows the originalsnapshot-style photographs and the bottom row shows the abstracted versions.Note how much of the texture in the original photographs (like water waves,sand, and grass) is abstracted away to simplify the images. All images use thesame σe for edges and the same number of simplification steps, nb.

3.4.1. Study 1: RecognitionSpeed

Hypothesis. Study 1 tests the hypothesis (H1) that abstracted images of familiar faces are

recognized quicker compared to normal photographs. Faces are a very important component of

daily human visual communication and I want the framework to help in the efficient represen-

tation of faces.

Procedure. To ensure that participants in the study are likely to know the persons depicted

in the test images, I use photographs of celebrities as source images and controls. The study

uses a protocol [149] demonstrated to be useful in the evaluation of recognition times for facial

81

images [62] and consists of two phases: (1) reading the list of 25 movie star names out loud;

and (2) a reaction time task in which participants are presented with sequences of the 25 facial

images. All faces take up approximately the same space in the images and are three quarter

views. By pronouncing the names of the people that are rated, participants tend to reduce

the tip-of-the-tongue effect where a face is recognized without being able to quickly recall the

associated name [149]. For the same reason, participants are told that first, last, or both names

can be given, whichever is easiest. Each participant is asked to say the name of the pictured

person as soon as that person’s face is recognized. A study coordinator records reaction times,

as well as accuracy of the answers. Images are shown and reaction times recorded using the

Superlab software product for 5 seconds at 5-second intervals. The order of image presentation

is randomized for each participant.

Data Conditioning. Two additional volunteers were eliminated from the study after

failing familiarity requirements. One volunteer was not familiar with at least 25 celebrities.

Another volunteer claimed familiarity with at least 25 celebrities, but his or her accuracy for

both photographs and abstractions was more than three standard deviations from the remainder

of the group, indicating the the volunteer was not reliably able to associate faces with names.

By the same reasoning, two images were deleted from the experimental evaluation because

their accuracy (in both conditions) was more than three standard deviations from the mean. This

could indicate that those images simply were not good likenesses of the depicted celebrities or

that familiarity with the celebrities’ names was higher than with their faces.

Results and Discussion. Data for this study (Figure 3.21, Top Graph; Table A.1) shows a

correlation trend between timings for abstractions and photographs. Three data pairs (2, 4 & 5)

82

Figure 3.21. Participant-data for Video Abstraction Experiments. TopGraph: Data for study 1 showing per-participant averages for all faces. Mid-dle & Bottom Graphs: Data for study 2 showing timings and number of clicksfor participants to complete two memory games, one with photographs and onewith abstractions. Data pairs for both experiments are not intended to refer tothe same participant and are sorted in ascending order of abstraction time.

83

show only a very small difference between recognition times in both presentation conditions,

but for all data pairs, the abstraction condition requires less time than the photographs.

Averaging over all participants shows that participants are faster at naming abstract images

(M = 1.32s) compared to photographs (M = 1.51s), thus rejecting the null hypothesis in favor

of H1 (p < 0.018). In other words, the likelihood of obtaining the results of our study by pure

chance is less than 1.8% and it is therefore more reasonable to assume that the results were

caused by a significant increase in recognizability of the abstracted images. The accuracy for

recognizing abstract images and photographs are 97% and 99% respectively, and there is no

significant speed for accuracy trade-off. I can thus conclude that substituting abstract images

for fully detailed photographs reduces recognition latency by 12.6%.

Interestingly, this significant improvement was neither reported by Stevenage [149] nor by

Gooch et al. [62]. Since both of these authors only used black-and-white stimuli, I suspect

that the simplified color information in my abstraction framework contributes to the measured

improvement in recognition speed. This promises to be a worthwhile avenue for future research.

It is worth pointing out that the performance improvement measured in this study might

seem small in terms of percentage, but it represents an improvement in a task that humans are

already extremely proficient at. In fact, there exist brain structures dedicated to the recognition

of faces [188] and many people can recognize familiar faces from the image of a single eye or

the mouth alone. A similar remark can be made about the result of the next study, which are

even more marked.

84

3.4.2. Study 2: Memory Game

Hypothesis. Study 2 tests the hypothesis (H2) that abstracted images are easier to memorize

(in a memory game) compared with photographs. By removing extraneous detail from source

images and highlighting perceptually important features, my framework emphasizes the essen-

tial information in these images. If done successfully, less information needs to be remembered

and prominent details are remembered more easily.

Procedure. Study 2 assesses short term memory retention for abstract images versus

photographs with a memory game, consisting of a grid of 24 cards (12 pairs) that are randomly

distributed and placed face-down. The goal is to create a match by turning over two identical

cards. If a match is made, the matched cards are removed. Otherwise, both cards are returned

to their face-down position and another set of cards is turned over. The game ends when all

pairs are matched. The study uses a Java program of the card game in which a user turns over a

virtual card with a mouse click. The 12 images used in any given memory game are randomly

chosen from the pool of 32 images without replacement, and randomly arranged. The program

records the time it takes to complete a game and the number of cards turned over (clicks) before

all pairs are matched.

Study 2 consists of three phases: (1) a practice memory game with alphabet cards (no

images); (2) a memory game of photographs; and (3) a memory game of abstract images. All

participants first play a practice game with alphabet cards to learn the user-interface and to

develop a game strategy without being biased with any of the real experimental stimuli. No

data is recorded for the practice phase. For the remaining two phases, half the participants are

85

presented with photographs followed by abstracted images; and the other half is presented with

abstracted images followed by photographs.

Results and Discussion. In the study, participants were significantly faster in com-

pleting a memory game using abstract images (Mtime = 60.0s) compared to photographs

(Mtime = 76.1s), thus rejecting the null hypothesis in favor of H2 (p time < 0.003). The fact that

the probability of obtaining the measured timings by pure chance is less than 0.3% indicates a

statistically highly significant result. Participants further needed to turn over far less cards in

the game with abstract images (Mclicks = 49.2) compared to photographs (Mclicks = 62.4)

with a type-I error likelihood of p clicks < 0.004, again highly significant. Presentation or-

der (abstractions first or photographs first) did not have a significant effect. Despite the fact

that the measured reduction in time (21.3%) and the reduction in the number of cards turned

over (21.2%) were almost identical, the per-participant data in Figure 3.21 (Middle and Bottom

graphs) and Table A.1 does not indicate a strong correlation between timing results and clicks.

As in study 1, the results (both timing and clicks) for all participants were lower for abstractions

than for photographs (only minimally so for the timing data pairs 2 & 10). Since the number

of clicks corresponds to the number of matching errors made before completing the game19, the

lower number of clicks for the abstracted images indicates significantly fewer matching errors

compared to photographs and I conclude that my framework can simplify images in a way that

makes them easier to remember.

19The minimum number of clicks is 24, one per card. This is unrealistic, however, as the probability for randomlypicking a matching pair by turning two cards out of 24 is 1 : 23. By removing this pair, no additional knowledgeof the game is discovered, so that even with perfect memory the probability for the next pair is 1 : 21, and so on.

86

Figure 3.22. Failure Case. A case where the contrast-based importance as-sumption fails. Left: The subject of this photograph has very low contrast com-pared with its background. Right: The cat’s low contrast fur is abstracted away,while the detail in the structured carpet is further emphasized. Despite this rarereversal of contrast assignment, the cat is still well represented.

3.5. Framework Results and Discussion

3.5.1. Performance

The framework was implemented and tested in both a GPU-based real-time version, using

OpenGL and fragment shader programs, and a CPU-based version. Both versions were tested

on an Athlon 64 3200+ with Windows XP and a GeForce GT 6800. Performance values depend

on graphics drivers, image size, and framework parameters. Typical values for a 640 × 480

video stream and the default parameters given in this text are 9 − 15 frames per second (FPS)

for the GPU version and 0.3− 0.5 FPS for the CPU version.

3.5.2. Limitations

Contrast. The framework depends on local contrast to estimate visual saliency. Images with

87

very low contrast do not carry much visual information to abstract (e.g. the fur in Figure 3.22).

Simply increasing contrast of the original image may reduce this problem, but also increases

noise. Figure 3.22 demonstrates a rare inversion of this general assumption, where the main

subject exhibits low contrast and is deemphasized, while the background exhibits high contrast

and is emphasized. Extracting semantic meaning about foreground versus background from

images automatically and reliably is a hard problem, which is why I use the contrast heuristic,

instead. Note that despite the contrast reversal the cat in the abstracted image in Figure 3.22 is

still clearly separated from the similarly colored background due to overall contrast polarization.

In practice, I have obtained good results for many indoor and outdoor scenes.

Scale-Space. Human vision operates at a large range of spatial scales simultaneously. By

applying multiple iterations of a non-linear diffusion filter, the framework covers a small range

of spatial scales, but the range is not explicitly parameterized and not as extensive as that of real

human vision.

Global Integration. Several features that may be emphasized by my framework are actu-

ally deemphasized in human vision, among these are specular highlights and repeated texture

(like the high-contrast carpet in Figure 3.22). Repeated texture can be considered a higher-

order contrast problem: while the weaving of the carpet exhibits high-contrast locally, at a

global level the high-contrast texture itself is very regular and therefore exhibits low contrast in

terms of texture-variability. Dealing with these phenomena using existing techniques requires

global image processing, which is impractical in real-time on today’s GPUs, due to their limited

gather-operation capabilities20.

20The framework deals partially with some types of repeated texture. See Section 3.5.7 (Indication) for details.

88

3.5.3. Compression

A thorough discussion of theoretical data compression and codecs exceeds the scope of this

dissertation because traditional compression schemes and error metrics are optimized for natural

images, not abstractions (Sections 2.3 and 2.4.1). To recall, many existing error metrics, even

perceptual ones, yield a high error value for the image pairs in Figures 3.19 and 3.20, although

I have shown in Section 3.4 that my abstractions are often better at representing image content

for visual communication purposes than photographs.

An interesting point of discussion in this respect is the error source. Several popular block-

based encoding schemes (e.g. JPEG, MPEG-1, MPEG-2) exhibit blockiness artifacts at low bit-

rates while many frequency-based compression schemes produce ringing around sharp edges.

All of these artifacts are perceptually very noticeable. Artifacts in abstraction systems, like that

presented here, are of stylistic nature and people tend to be much more accepting of these [145]

because they do not expect a realistic result. Non-realistic image compression promises to be

an exciting new research direction.

In terms of the constituent filters in the framework, Pham and Vliet [126] have shown that

video compresses better using traditional coding methods when bilaterally filtered beforehand,

judged by RMS error and MPEG quality score. Collomosse et al. [26] list theoretical compres-

sion results for vectorized cartoon images. Possibly most applicable to my abstractions is work

by Elder [42], who describes a method for storing the color information of an image only in

high-contrast regions, achieving impressive compression results.

Without going into technical detail, it can be shown that the individual filter steps in the

framework simplify an image in the Shannon [146] sense and a suitable component compression

scheme should be able to capitalize on that. For example, the emphasis edges in Section 3.3.3

89

pose a problem for most popular compression schemes due to their large spectral range, yet the

edges before quantization are derived from a severely band-limited DoG filter of an image. In

general, an effective compression scheme would not attempt to compress the final images, but

rather the individual filter outputs before quantization. The final composition of the channels

would then be left to the decompressor. Another advantage of this approach, which promises

novel applications for streaming video, is that only selected channels may be distributed for

extreme low-bandwidth transmission (e.g. only the highlight edges) and that the stylistic options

represented by the quantization parameters can be chosen by a decompression client (viewer)

instead of hard-coded into the image-stream.

3.5.4. Feature Extension

I do not include an orientation dependent feature in the contrast feature space because of its

relatively high computational cost and because orientation is generally only necessary for high-

level vision processes, like object recognition, whereas my work focuses on using low level

human vision processes to improve visual communication. Should such a feature be required,

the combined response for Gabor filters at different angular orientations can be included in the

input feature space conversion step in Figure 3.4. This response would need to be scaled to a

comparable range as the other feature channels to retain perceptual uniformity. For implemen-

tation details of a separable, recursive Gabor filter, compatible with the framework, see Young

et al. [181].

90

Figure 3.23. Benefits for Vectorization. Vectorizing abstracted images carriesseveral advantages. Edges: Extracting edges after ne smoothing passes removesmany noisy artifacts that would otherwise have to be vectorized. Here, the differ-ence between two consecutive passes is shown. Color: Quantization results after1 and 5 non-linear diffusion passes, respectively. The simplification achieved bythe abstraction is evident in the simplified quantization contours, which requiresfewer control-points for vector-encoding. Zoom in for full detail.

3.5.5. Video Segmentation

My stylization step in Section 3.3.5 is a relatively simple modification to an abstracted image.

Despite this, I have found that it yields surprisingly good result in terms of color flattening and

is much faster than the mean-shift procedures used in off-line cartoon stylization for video [26,

161]. Interestingly, several authors [4, 14] have shown that anisotropic diffusion filters are

closely related to the mean-shift algorithm [27]. It is thus conceivable that various graphics

applications that today rely on mean-shift could benefit from a much faster anisotropic diffusion

implementation, at least as a pre-process to speed up convergence.

91

3.5.6. Vectorization

An explicit image representation is an integral part of many existing stylization systems. Al-

though I already discussed the trade-offs between these explicit representations and my image-

based approach, I show here how my abstraction framework can be used as a pre-process to

derive an efficient explicit image representation.

Benefits. Vectorization21 of images of natural scenes requires significant simplification

for most practical applications because generally neighboring pixels are of different colors so

that a true representation of an input image might require a single polygon for each pixel. This

simplification is essentially analogous to the abstraction qualities I have discussed so far: I want

to simplify (the contours and colors) of a vector representation of an image, while retaining

most of the perceptually important information. Consequently, it is not surprising that my ab-

straction framework can aid in this simplification process with its use of a non-linear diffusion

filter and pseudo color quantization. Figure 3.23 demonstrates the two key benefits: (1) noise

removal; and (2) contour simplification. Because the non-linear diffusion step of the framework

removes high-frequency noise, this information does not need to be encoded into a complex

vector representation. Similarly, the quantization contours of the abstracted images are pro-

gressively simplified in their shape, requiring fewer control points to encode into any standard

vector format. This approach of simplification followed by vectorization contrasts with the

traditional approach of vectorization followed by simplification [36, 161, 26]. The main advan-

tage of the traditional approach is that vector representations at different spatial scales can be

treated independently, in the course of which some features may be removed completely as part

of the simplification. The advantages of the approach presented here are that, as above, many

21Here, defined as the act of converting an image into bounded polygons of a single color.

92

features do not need to be vectorized in the first place, that the simplification can happen much

faster, and that temporal coherence of consecutive frames is improved. The reason for increased

temporal coherence is rooted in the sparse parametric representation that vectorization affords.

Given an efficient (low redundancy) vector representation (e.g. B-splines) of a shape, this shape

can change considerably if one of its control-points is removed or altered excessively. Since

simplification of vectors includes just these types of modifications, the traditional vectorization

approach is prone to very unstable shape representations22. If vectorization is performed after

simplification, then temporal coherence is mainly a function of the coherence quality of the

vectorization input. Given the good coherence characteristics of my framework23, this leads to

improved temporal coherence after vectorization.

Implementation. The vectorization implementation I have chosen is based on simple iso-

contour extraction of the color information in the abstracted and hard-quantized images. I vec-

torize the edge and color information separately to keep the vectorized representation as simple

as possible. Individual polygons are expressed as polylines or Bezier curves, depending on the

local curvature of the underlying contours and written out as Postscript files. Vectorization of a

single image takes in the order of 1-3 seconds, depending on the resolution of the input image

and the desired complexity of the vectorized output. This process is not optimized for efficiency.

Limitation. An advantage of temporal vectorization extensions, in addition to increased

temporal coherence, is the possibility of a more compact temporal vector representation. Instead

of encoding each frame independently one can specify an initial shape and then encode how the

22This has led to some computationally very expensive temporal vectorization extensions [161, 26].23This refers to coherence as a result of simplification and smoothing, not soft-quantization functions, as thesefunctions cannot be used for vectorization (see Limitation, below).

93

shape transforms for successive frames24. Unfortunately, this is a difficult problem, as it requires

accurate knowledge of where a shape in one frame can be found in the next. This object-

tracking problem (in this case, contour-tracking) is a major research effort in the computer

vision community and not, as yet, robustly solved. For this reason, both Wang et al. [161]

and Collomosse et al. [26] require user-interaction to correct for tracking mistakes, particularly

in the presence of camera movement and occlusions. My vectorization approach faces the

same challenges and limitations when moving from a single frame encoding to an inter-frame

encoding scheme.

Vectorization, as defined here, requires true discontinuous quantization boundaries (for both

edges and color information). As a result my vectorized images loose those temporal coherence

advantages that stem from the soft-quantization functions of my framework.

3.5.7. Complimentary Framework Effects

In addition to the design goals that I implemented within the abstraction framework, a handful

of stylistic effects presented themselves for free as a result of the framework’s various image

processing operations. Initially, this came as a surprise to me, considering that (1) I did not

intentionally program these effects; (2) most of the effects are traditionally considered artistic,

not perceptual or computational; and (3) most effects are considered challenging research ob-

jectives in their own right (see Indication, below). Upon reflection, though, these observations

strengthen my belief that there are many unknown connections between perception and art that

await to be modeled and measured with the use of NPR. In this dissertation, I include the two

most prominent effects that have also been discussed in previous work.

24Most video compression schemes make use of this inter-frame coherence by encoding just the information thatchanges between two frames in so called delta-frames.

94

Figure 3.24. Automatic Indication. The inhomogeneous texture in these im-ages causes spatially varying abstraction. As a result, fine detail subsists in someregions, while being abstracted away in other regions. Note how the bricks inthe top and middle images are only represented intermittently with edges, yet theobserver perceives the entire wall as bricked. The few visible brick instances areinterpreted as indicating a brick wall and empty regions are visually interpolated.The same applies to the blinds in the middle image and the shingles, the wheatand the trees in the bottom image. These types of indication are commonly usedby artists, particularly the shadows indicated underneath the windowsill in thetop image and the fine branches in the bottom image, which are hinted at byfaint color, while only the main branches are drawn with edges.

95

Indication. Indication is the process of representing a repeated texture with a small num-

ber of exemplary patches and relying on an observer to interpolate between patches. Winken-

bach and Salesin [169] explain the associated challenges thus: “Indication is one of the most

notoriously difficult techniques for the pen-and-ink student to master. It requires putting just

enough detail in just the right places, and also fading the detail out into the unornamented

parts of the surface in a subtle and unobtrusive way. Clearly, a purely automated method for

artistically placing indication is a challenging research project.”

For structurally simple, slightly inhomogeneous textures with limited scale variation, like

the examples in Figure 3.24, my framework can perform simple automatic indication, includ-

ing stroke texture25 (Figure 3.24: top-right image, shadows under window-sill). The frame-

work achieves indication by extracting edges after a number of abstraction simplification steps.

Depending on the given image contrast and Equation 3.2, some parts of an image are simpli-

fied more, some less, in an approximation to the perceived difference in those image regions.

The emphasis DoG edges then highlight high contrast texture regions that remain prominent

throughout the simplification process. All other edges in textured regions are removed, leaving

the missing texture to be inferred by an observer.

As DeCarlo and Santella [36] noted, such simple indication does not deal well with complex

or foreshortened textures. The automatic indication in my framework is not as effective as

the user-drawn indications of Winkenbach and Salesin [169], but some user guidance can be

supplied via Equation 3.2, to provide vital semantic meaning.

25Winkenbach and Salesin [169] refer to line markings that represent both texture and tone (brightness) as stroketexture.

96

Figure 3.25. Motion Blur Examples. Motion Lines: Cartoons often indicatemotion with motion lines. Motion Blur: This sequence shows a radial pattern ofrays at different orientations (angle) and of varying width (radius), which is con-volved with a motion blur filter at different orientations. Note that lines parallelto the direction of the motion blur are preserved, while lines perpendicular to themotion blur are maximally blurred.

Figure 3.26. Motion Blur Result. Original: Images of a stationary car and amoving motion-blurred car. DoG Filter: Corresponding images from my mod-ified DoG filter. Note how many of the remaining horizontal lines resemble thespeed lines used by comic artists.— Original image released under GNU Free DocumentationLicense.

Motion Lines. Comic artists commonly indicate motion with motion lines parallel to the

suggested direction of movement (Figure 3.25, Motion Lines). Interestingly, Kim and Fran-

cis [89, 52] showed that these motion lines are not purely artistic and actually have perceptual

foundations, which is likely the reason why artists have adopted them in the first place and why

97

they are so easily understood. The DoG edges in my framework automatically create streaks

resembling motion lines as shown in Figure 3.26. Although I did not explicitly program this

behavior (as in Collomosse et al. [26]), it can be easily explained.

Motion blur is a temporal accumulation effect that occurs when a camera moves relative to

a photographed scene. This relative movement can be any affine transformation like translation,

rotation, and scaling but I focus this discussion on translational movements only. A motion blur,

or oriented Blur, O(·), can be formulated using a modification of the familiar Gaussian kernel:

O(x, σo, θ) =

∫f(x) · e−

12(

‖Θ(x−x,θ)‖σo

)2

dx∫e−

12(

‖Θ(x−x,θ)‖σo

)2

dx

(3.15)

Θ(x, θ) =

(cos(θ) sin(θ)

0 0

)· x(3.16)

Here, parameter σo determines how much the image is blurred, i.e. the duration of the expo-

sure in relation to the speed of the scene relative to the camera. Parameter θ indicates the blur

direction in the image plane. Equation 3.15 is a very simple but sufficient model for this dis-

cussion that does not take into account depth (image elements moving at different speeds) and

assumes that only the camera moves with respect to the scene. Figure 3.25 (Motion Blur) shows

the result of this filter on a pattern of lines of varying widths and different orientations. Lines

parallel to the blur direction are blended only with themselves and appear unaffected, while

lines perpendicular to the blur direction are blended with neighboring lines and loose sharp-

ness. Intermediate angles vary with the sine of the angle. In the car example in Figure 3.26, the

vertical line of the door is blurred away, while the door’s horizontal line (parallel to the motion)

is preserved.

98

The DoG filter therefore mainly detects edges in the direction of motion, because other

edges are largely blurred away. As a consequence, the resulting image looks like it has motion

lines added.

3.5.8. Comparison to Previous Systems

I have pointed out throughout this Chapter how my framework differs from previous systems

in terms of the design goals I have chosen, and I have demonstrated performance increases

in perceptual tasks not evident in previous work [149, 62]. However, these comparisons are

still not as detailed as they should be, mainly due to the fact that the NPR community is lack-

ing comparison criteria above the level of simple frame-rate counts. There are other issues

that compound this problem. Most stylization systems are not based on perceptual principles

and therefore not psychophysically validated. Performing such comparative analyses oneself

is complicated by the fact that previous stylization systems are rarely freely available, difficult

and time-consuming to implement, and that they generally have a limited amount of results

openly available. There simply is no standard repository of imagery for NPR applications (like

the Stanford bunny for meshes, or the Lena image for image processing). I hope that my work

can contribute to the solution to these problems by making available a large number of input

and result images and videos, and more importantly, by validating my own framework with

psychophysical experiments that can be used in direct comparison with future NPR systems.

99

3.5.9. Future Work

Despite the numerous processing steps that comprise my video abstraction framework, it is

simple to implement and shows great potential in terms of computational and perceptual effi-

ciency. I therefore hope that the framework will be adopted for a number of interesting research

directions.

NPR Compression. As noted in Section 3.5.3, I believe that abstractions generated by

my framework are subject to good compression ratios, yet most current compression schemes

are likely to perform sub-optimally. Non-photorealistic compression is basically unheard of,

partly because the compression community has very well-defined and rigid ideas about realism

and desirable image fidelity. I believe NPR compression to be promising future research mainly

because of the significant removal of information in abstractions and because of the ability to

alter the reconstruction parameters on the decompression side for stylistic effect and perceptual

efficiency.

Minimal Graphics. In their paper called Minimal Graphics, Herman and Duke [69]

state that “[the] main question which still remains is how to automatically extract the minimal

amount of information necessary for a particular task?”. I have shown that two specific tasks

can be performed better given my abstractions, but I did not show (nor do I believe) that this

performance increase is maximal. As Section 3.4 demonstrated, removal of information can

actually lead to better efficiency for specific perceptual tasks, but there is a point at which addi-

tional removal of information will bring about a decline in efficiency26. It would be interesting

and valuable to use a framework like the one presented here to graph a chart of image informa-

tion versus task efficiency and to map these findings to framework parameters. Such perceptual

26This can be proven by considering the extreme case of removing all information.

100

research using an NPR framework would be another example of how to close the loop of mutual

beneficence that this dissertation is intended to demonstrate.

3.6. Summary

In this chapter, I presented a video and image abstraction framework (Figure 3.4) that works

in real-time, is temporally coherent, and can increase perceptual performance for two recogni-

tion and memory tasks (Section 3.4).

Framework. Unlike previous systems, my framework is purely image-based and demon-

strates that meaningful abstraction is possible without requiring a computationally expensive

explicit image representation. To the best of my knowledge, my framework is one of only three

automatic abstraction systems that prove effectiveness for visual communication tasks with

user studies. Of these, my studies are the most comprehensive (two tasks with colored stim-

uli compared to one study with colored stimuli and two studies with black-and-white stimuli,

respectively).

By basing the framework design on perceptual principles, I obtain at least two visual effects

(Section 3.5.7) in my output images for free27, which previous systems implemented explicitly

and with computational overhead. These effects are indication, the suggestion of extensive

image texture by sparse texture elements, and motion lines, an artistic technique to illustrate

motion in static images.

Customizable Non-linear Diffusion. I developed an extension (Equation 3.2) to the

bilateral filter (Equation 3.1) as an approximation to non-linear diffusion that allows for external

control via user-data in various forms (painted, data-driven, or computed).

27Without explicit computation devoted to these effects.

101

Temporally Coherent Quantization. I constructed two smooth quantization functions,

both of which (1) increase temporal coherence for animation; and (2) offer stylization options

not available to previous systems using discontinuous quantization functions. The first quanti-

zation function (Equation 3.7) operates on the well-known DoG function to extract edges from

an image. The second quantization function (Equation 3.14) flattens colors in an image for

data reduction and artistic purposes. Another contribution of this second function is its spatially

adaptive behavior, which achieves a good trade-off between the desired level of quantization

and temporal coherence by adapting to local image gradients.

Additional Materials. More information on this project, including a conference pa-

per [171], GPU code, and an explanatory video, can be found on the Siggraph 2006 con-

ference DVD. The same materials and additional images are also available online at http:

//videoabstraction.net.

http://videoabstraction.net

http://videoabstraction.net

102

CHAPTER 4

An Experiment to Study Shape-from-X of Moving Objects

Figure 4.1. Shape-from-X Cues. The human visual system derives shape infor-mation from a number of distinct visual cues which can be targeted and testedusing non-photorealistic imagery. Shading: Lambertian shading varies with thecosine of the angle between light direction and surface normal. Flat surfaces ex-hibit constant color, while curved surfaces show color gradients. Texture: Textureelements, or texels, accommodate their form to align with the underlying surface,causing texture compression. Contours: Discontinuities of various surface prop-erties are shown in different colors (red: silhouette, black: outline, green: ridges,blue: valleys). Note that the objects’ shadows are synonymous with silhouette-contours as seen from the casting light’s point-of-view. Motion: Under rigidbody motion, points on the surface move at different speeds and in differentdirections.

In this Chapter, I present a psychophysical experiment that uses non-photorealistic imagery

to study the perception of several shape cues for rigidly moving objects in an interactive task.

Traditionally, most shape perception studies only display a small number (generally one;

two for some comparison experiments. See Section 4.3.4) of static objects. Yet, most interac-

tive graphical environments, such as medical visualization, architectural visualization, virtual

103

reality, physical simulations, and games, contain a large number of concurrent dynamic shapes

and objects that move independently or relative to an observer. Because shape perception is

vital to many recognition and interaction tasks, it is of great interest to study shape perception

for multiple shapes in dynamic environments, in order to develop effective display algorithms.

The experiment I propose in this chapter benefits greatly from carefully designed non-

photorealistic imagery to separate and individually study shape cues that find common usage in

many computer graphics applications.

4.1. Introduction

The art and science of photorealism in computer graphics, as exemplified in Figure 1.1,

has shown impressive improvements over the last decades, but the associated computational

demands have put this level of realism out of reach of most real-time applications. As a result,

real-time 3-D graphics commonly only offer best-effort approximations in terms of realistic

lighting, shading, and material effects. These limitations beg several important questions. What

effects do the approximations have on applications depending on shape perception? If we want

to prioritize computational resources for the most effective shape cues for a given set of shapes

or a given application, how do we determine this effectiveness?

Another set of questions concerns the necessity for realism. I have already mentioned in

Chapter 1 and demonstrated in Chapter 3 that sometimes less is more when it comes to visual

stimuli for humans1. Realistic images can, at times, be overbearing or conflicting in terms of

the information that is presented to a viewer, and it may be more effective for a given task to

display less information that is emphasized appropriately [150]. Being freed from the restric-

tions that reality (even an approximate one) imposes, how can we emphasize the shape of an

1Incidentally, results in this Chapter reiterate this concept.

104

object effectively using stylistic (non-realistic) elements? Similarly, how can we compare the

effects of various known stylization techniques for conveying shape?

I believe the answers to these questions to be important, not only because they will advance

the state-of-the-art in realistic and non-realistic graphics, but because they may provide insights

into the development of art, and our perception of art. Of course, this chapter provides only

very few actual answers to these questions. What it provides instead is a simple and flexible

experiment that enables research into these questions.

4.1.1. Experimental Design Goals

The set of shape cues I investigate (shading, contours, and textures) is not meant to be exhaus-

tive, but rather demonstrative. The number of additional existing shape cues, their possible

parameterizations, and the permutations of combined effects are probably too vast to explore

in a single lifetime. As such, the main purpose of this chapter is to demonstrate an example

of the types of studies that my experiment supports and to offer my methodology up for other

researchers in computer graphics and perception to perform their own investigations.

In designing the experiment, I take special care to address the following key issues:

(1) A number of different shape cues can be studied in isolation and in combination —

This is important to support a broad range of studies.

(2) The difficulty of the experimental task can be easily adjusted — If the task is too easy

or too difficult no meaningful data can be gathered. The task should be designed so that

participants at different performance levels can provide meaningful statistical data.

105

(3) The interaction itself is simple — It is important to separate the task from the inter-

action necessary to perform the task. While the task should be as difficult as possible

(without being impossible), the interaction should be very simple to ensure that the

performance of the task is measured and not that of the interaction.

(4) The performance of participants can be tested under time-constrained conditions —

Most traditional shape experiments have no time limit for their trials. Because humans

can only attend to very few stimuli simultaneously [39, 160] the results under time-

pressure might very well be different from these static experiments and offer important

guidance for the design of real-time applications.

(5) The experimental shapes are general, relevant, and parameterizable — This is impor-

tant so that valid and meaningful statements can be made about the shapes that are

tested and the results that apply to them. It also facilitates replication and verification

of experimental results by third parties.

(6) Learning effects and other biases for the task are minimal — After an initial period of

getting acquainted with the interaction and developing a strategy, learning and memory

of the experimental procedure should not impact the performance of the interactive

task2, so that performance differences are due to varied experimental conditions and

not increasing experience. For the same reason, the experimental conditions should not

be biased or otherwise predictable to ensure the experimental data reflects perceptual

performance instead of system biases or deductive reasoning abilities.

2Note, that this is different from studying memory performance, as in Section 3.4.2. Even there, the position ofcards between trials was randomized, so that participants could not remember the correct position from the previoustrial. Instead, participants had to remember the positions anew for each trial, thus making each trial independent.

106

It is my hope that these design goals are specific enough to provide meaningful results, yet

general enough to allow other researchers to (1) adopt my experimental framework to study

other types of shape cues, and to (2) evaluate the effectiveness of interactive non-photorealistic

rendering systems to convey shape information. Section 4.4 explains how I implement the

above goals in my own experiment and Section 4.7 demonstrates via data analysis that these

goals were attained.

4.1.2. Overview

In the experiment I present here, participants are shown 16 moving objects, 4 of which are des-

ignated targets, rendered in different shape-from-X styles. Participants select these targets by

simply touching a touch-sensitive table onto which the objects are projected. The experimen-

tal data shows that simple Lambertian shading offers the best shape cue, followed by outline

contours and, lastly, texturing. The data also indicates that multiple shape cues should be used

with care, as these may not behave additively in a highly dynamic environment. This result is

in contrast to previous additive shape cue studies for static environments and reflects the impor-

tance of investigating shape perception in the presence of motion. To the best of my knowledge,

my experiment is unique in its capacity to compare the effectiveness of multiple shape cues in

dynamic environments and it represents a step away from traditional, impoverished (reduction-

ist) test conditions, which may not translate well to real-time, interactive applications. Other

advantages of the experiment are that it is simple to implement, engaging and intuitive for par-

ticipants, and sensitive enough to detect significant performance differences between all single

shape cues.

107

4.1.3. Note on Chapter structure

Although this chapter follows largely the same structure as Chapter 3, it does so in a slightly dif-

ferent order. This is due to the fact that Chapter 3 presents an automatic abstraction system based

on perception and verified by two experiments; whereas this chapter presents a psychophysical

experiment to study perception, based on non-photorealistic imagery. In this chapter, I therefore

introduce important aspects of the human visual system (Section 4.2) before discussing related

work (Section 4.3).

4.2. Human Visual System

Most interaction with our visual world requires some shape identification or categorization.

The shape of the visible portion of an object can be correctly interpreted if the distance between

each point on the surface of the object, PO, and its projection onto the eye’s retina, PE , is

known (Figure 4.2). I will refer to this distance as the depth at PE . Calculating the depth from

the light-signal at the retina is an ill-constrained problem, because the light reaching PE could

have emanated from any point along the view-ray cast through PO. To address this problem,

the human visual system is equipped with a number of mechanisms to infer depth information

from an image. The convergence of depth interpretations from different mechanisms leads to

a stable perception of shape3. The different depth interpretation mechanisms that allow shape

perception are collectively referred to as Shape-from-X. The most important shape cues for

computer graphics applications are shading, texture, contours, and motion. Other important

3Sometimes convergence does not occur, leading to multiple possible shape interpretations, as in the famous Neckercube illusion. It is interesting that in such cases only one interpretation can be perceived at a time and that thedifferent perceived interpretations alternate perpetually [75].

108

Figure 4.2. Left: Depth Ambiguity. Light reflects off a surface point PO, reach-ing the retina at point PE . The length of the vector |~v|, ~v = PO − PE , is thedistance between the surface point and the viewer. This situation is ambiguousfor the viewer, because the light could have emanated anywhere along the ray,PE + α · ~v, α ∈ R+.

Figure 4.3. Right: Tilt & Slant. The orientation of a surface at a point can bedescribed by the tilt and slant of a thumbtack gimbal placed at that point andaligned with the surface normal. Both the length of the gimbal’s rod and theelongation of the attached disk in the image plane indicate the local surface ori-entation.

shape cues exist, like binocular stereopsis and ocular accommodation and vergence, but are less

commonly applied in a computer graphics context.

4.2.1. Shading

The shading of an object is a complex function of the object’s properties, such as shape and

material, as well as those of its environment, including lights, other objects and the direction

from which it is viewed. For simple illumination conditions (see below) a change in surface

orientation can be inferred from a change in surface shading [95, 96]. Real-time computer

graphics commonly approximate realistic shading with the Phong reflection model [127], a

109

local illumination model that considers ambient light, diffuse reflection, and specular reflection.

To reduce the number of free variables in my experiment, I set the ambient contribution to zero

and only model diffuse reflection (Lambertian shading) as

Ir = kd · I0 · dot(~n, ~),

where I0 and Ir are the incoming and reflected light intensities, respectively, ~n is the surface

normal at PO, ~ is the incoming light direction (as in Figure 4.2), dot denotes the vector dot-

product and kd ∈ [0 . . . 1] indicates the diffuse reflectance properties of the object. I use a single

point-light-source at infinity. Since the dot-product of two vectors changes smoothly according

to the cosine of the angle between the vectors, the change in light-intensity on a Lambertian

surface is a good indicator of change in surface orientation (Figure 4.1).

4.2.2. Contours

Smooth changes in depth often indicate smooth changes in the shape of a surface. Depth dis-

continuities, on the other hand, are a likely sign of figure/ground separation (where figure is

an object and ground is everything else), changes in local topology, or abutting but distinct

surfaces. Such discontinuities are therefore important visual markers for the distinction of

figure from ground, object components, and surfaces. Changes in surface normals and other

differential geometry measures, such as principal curvatures, can also be used to mark shape

discontinuities or extrema in images. Together, these define the set of contours of an object,

some of which are shown in Figure 4.1, Contours. Note that while some contour-types depend

110

only on object shape, others also depend on the observer’s point-of-view [97]. Several non-

photorealistic rendering algorithms rely on contours to convey essential, but much condensed

shape information [70, 35].

4.2.3. Texture

Texture is most often described in terms elemental texture units, called texels, and their distri-

bution on a surface. Figure 4.1, Texture, illustrates the use of a random cell texture to indicate

shape. Many natural scenes and materials contain textures, such as fields of grass or flowers,

heads in a crowd, woven fabric, etc. While Gibson [56] was the first to identify and investi-

gate the importance of texture as a depth-cue, several works have since extended his research.

Cumming et al. [31] defined three distinct parameters along which texture covaries with depth:

compression4, density, and perspective. They found that compression accounts for the major-

ity of texture variation in shape, so I focus on this cue. Compression refers to the change in

shape of a texel when mapped onto a surface non-orthogonal to the viewer. Another impor-

tant factor in texturing is the distribution of texels on a surface, which is generally achieved

through a parametrization function of the surface. This function provides a mapping relating

texel distribution and orientation to surface shape.

4.2.4. Motion

If an object moves relative to an observer (via translation, rotation, or a combination of the

two), then points on its surface that are at different depths move at different relative speeds.

Therefore, the relative movements of these points convey the underlying depths for rigid objects

4Not to be confused with the term compression used in information theory.

111

(Figure 4.1, Motion). The rigidity constraint is necessary because plastic deformations can also

lead to relative movement of surface points, and the two types of motion, rigid and plastic,

cannot be distinguished visually. Such a constraint is also employed by human perception in

the form of a bias towards recognizing motion as rigid, if such an interpretation is consistent

with the visual stimulus [121].

4.2.5. Limitations

None of the shape cues above is a sufficient requisite for shape perception. The shading of

an object may be indistinguishable from the color variation of its material. The efficiency of

shape-from-texture depends largely on the texture used, its homogeneity and its parametriza-

tion, all of which are arbitrary. Contours are highly localized visual markers, requiring visual

interpolation, and are commonly under-constrained. Lastly, shape-from-motion depends on the

reliable tracking of surface points, as well as robust distinction of rigid and plastic motion,

both of which cannot be guaranteed. This insufficiency of any single shape cue to provide ro-

bust depth-information explains the redundant shape detection mechanisms of the human visual

system.

4.3. Related Work

Compared to Chapter 3, the rendering techniques used in my experiment are too basic to

warrant a comparison to previous work. Instead, I focus here on the related experiments that

researchers have undertaken to study shape. The following discussion is structured according

to the shape-from-X cues of Section 4.2.

112

4.3.1. Shape from Shading

In early pioneering work on non-realistic shape perception, Ryan and Schwartz [141] presented

participants with photographs and shaded images of objects in different configurations and mea-

sured the time it took participants to correctly identify the depicted configuration. Due to the

preliminary nature of their study, they used only 3 arbitrary real objects. The configurations

of the objects depended on their functionality, which may not have been known to participants.

More importantly, the authors lacked computer graphics capabilities and commissioned an artist

to produce the shaded images. Their experiment therefore largely measured the artist’s crafts-

manship at conveying the different object configurations.

Koenderink et al. [95, 96] invented a thumbtack-shaped widget, as in Figure 4.3, for partic-

ipants to indicate the perceived shape on a Lambertian surface. Sweet and Ware [153] investi-

gated interaction between shading and texture and also included specular reflections in one of

their experiments. Again, participants used Koenderink’s thumbtack widget for feedback. John-

ston and Passmore [83] mapped Phong-shaded spheres with band-limited random-dot textures.

Instead of using the thumbtack widget they asked subjects forced-choice questions about the

spheres and a paired surface patch, which was oriented in a different direction or had a different

curvature to the spheres. As explained in Section 4.3.4, none of the presented evaluation tech-

niques lend themselves to experimentation with moving objects and their results may therefore

not apply to many highly dynamic computer graphics applications.

Barfield et al. [5] investigated the effect of simple computer shading techniques (wireframe,

flat shading with one or two light sources, and smooth shading) on the mental rotation perfor-

mance (see Section 4.3.4, Mental Rotation) of participants. The mental rotation task is similar

113

to the task in my experiment, but several differences exist to enable real-time dynamic interac-

tion: my experiment uses multiple concurrent shapes, the shapes all move, and the shapes differ

in their constituent parts, not in their arrangement.

Rademacher et al. [132] compared photographs to shaded computer graphics images of sim-

ple geometric shapes and asked participants whether they were seeing a photograph or synthetic

image. In a similar experiment, Ferwerda et al. [47] compared the perceived realism of pho-

tographs of automobile designs with versions rendered in OpenGL and rendered with a global

illumination model. As such, neither experiment directly measured the perception of shape, but

the contribution of soft shading, number of lights, and surface properties to the perception of

subjective realism.

Kayert et al. [88] probed neural activity of Macaque monkeys using invasive surgical probes

to study the modulation of inferior temporal cells to nonaccidental shape properties (NAP)

versus metric shape properties (MP) of shaded objects. Such experiments can obviously not be

performed on human subjects.

Biederman and Bar [8] used non-sensical objects with diffuse and specular shading to com-

pare shape perception theories based on NAP against theories based on MP. The effects of

shading itself were not measured.

4.3.2. Shape from Contours

In their study described above, Ryan and Schwartz [141] also presented participants with line

drawings and cartoons of different object configurations. Because an artist generated their im-

ages, it is likely that considerable perceptual and cognitive effort went into creating effective

114

images5. At the same time, the method for creating the images cannot be described quanti-

tatively. The results of their experiment are thus largely biased by the artist and difficult to

replicate.

Shepard and Metzler [147] evaluated mental rotation performance of three-dimensional

chained cubes presented in a line-drawing style. Yuille and Steiger [184] performed follow-

up work based on the same experiment.

In his recognition-by-components (RBC) paper, Biederman [10] used line drawings to il-

lustrate his theory and perform various experiments to determine the effects of reducing the

number of components used to represent an object, and of deleting parts of the lines used to

represent each component. The experiments were designed to support the RBC theory and not

to test the effectiveness of line-drawings to purvey shape.

In contrast to the large corpus of research in rendering techniques for contours on polygonal

meshes [35, 70], implicit surfaces [16, 128], and even volumetric data [20, 41], the literature

on perceptual evaluation of contour rendering from 3-D models is relatively sparse. Gooch and

Willemsen [60] performed a blind walking task in an immersive virtual environment, rendered

with contours, to determine perceived distances as compared to distances estimated in the real

world. They did not evaluate how contours compared to other shape cues. Another differ-

ence to my work is that Gooch and Willemsen probed estimation of quantitative distances, a

task that humans find notoriously difficult, whereas my experimental design is geared towards

shape estimation and categorization, for which only relative or qualitative depth-information is

required.

5In fact, the different types of representations varied not only in their use of shading or lines, but also in the amountof detail that was depicted. Particularly the cartoon representations were more symbolic than literal copies of theoriginal scene.

115

4.3.3. Shape from Texture

Various authors have shown that texture elements, aligned with the first and second principle

curvature directions of a surface, are good candidates for indicating local surface shape and

curvature [77, 58, 90, 153]. The specific experiments of these authors do not translate well

to dynamic scenes, but it will be interesting future work to verify their results for dynamic

environments using my experiment.

4.3.4. Measurements

Most work on shape perception uses one of the following established methods to measure per-

ceived surface shape:

(1) Thumbtack gimbal. Participants place a (virtual) gimbal widget akin to a thumb-

tack at a particular orientation on the object’s surface so that the pin’s direction is

aligned with the estimated surface normal. Both the direction of the pin and the eccen-

tricity of the attached disk are used to indicate and measure estimated tilt and slant, as

in Figure 4.3 [95, 98, 118, 96, 58, 90, 153].

(2) Mental Rotation. Participants are shown a pair of images in one of two configu-

rations: (a) depicting the same object but from different viewpoints; and (b) depicting

different objects. The experimenter measures the time a participant takes to decide

on a given configuration. This task requires participants to mentally rotate one of the

shapes to match the other and can employ 2-D shapes and rotations (in-plane) [30, 51]

or 3-D shapes and rotations [147, 184, 5, 11].

116

(3) Exemplars/Comparisons. Several physical objects with a variety of surface shapes

are kept at hand. These are used as exemplars, so that a participant of the study can

indicate the position on a surface with similar properties to that of the object being stud-

ied (e.g. [151]). When measuring perceived distances in virtual environments some

experiments required users to walk the estimated distance in the real-world as an anal-

ogous measure [60, 111], while others asked participants to estimate the time it would

take them to walk the perceived distance at a constant pace in the real world [129].

(4) Naming of objects. Subjects are shown depictions of real-world objects and are

asked to name the depicted object as quickly as they can [10].

Discussion. While the gimbal and exemplar methods are capable of yielding highly sen-

sitive quantitative data, they do not transfer to a fully dynamic context like the one I am in-

vestigating. Moving objects simply do not hold still long enough to perform these types of

measurements. The same restriction applies to mental rotation, because at least one of the two

shapes to be compared has to be stationary. Naming of objects requires participants to be famil-

iar with the object they are presented with. Even if they know the name, participants might still

suffer from the tip-of-the-tongue effect, where a known word is not readily verbalized [149].

My solution is to opt for the more qualitative shape perception task detailed in Section 4.4.

Technically, this task is closer to shape categorization than exact shape quantification, but then,

so are most everyday shape-dependent tasks. To distinguish a plate from a cup humans need to

make qualitative judgements about the objects’ shapes rather than compare their exact dimen-

sions or other geometric measures.

117

Direct vs. indirect Measurements. Another way to classify measurements is in terms

of direct versus indirect measurements. Placing a widget on a surface position yields direct

numerical values for the estimated surface normal. As discussed, this is not practical for moving

objects. Consequently, much of the work on distance perception in Virtual Environments (VEs)

uses indirect measurements. Plumert at al. [129], Gooch and Willemsen [60], and Messing and

Durgin [111], have all used walking-related tasks to indirectly estimate perceived distances in

VEs. The task was either to guess the time it would take to walk from the current position

to another position inside the VE, or to actually walk the estimated distance in the real world

without visual feedback from the VE.

Indirect measurements have the disadvantage that they generally include larger individual

variations and have to be related to the measure of interest via some mapping, which may intro-

duce additional error. The advantage is that they allow a dynamic and often more natural and

intuitive experimental scenario, e.g. walking versus orientating a widget using a mouse. In my

experiment, I use several indirect measures of performance, allowing me to present participants

with an intuitive task and interaction paradigm that evokes competitive performance levels. I

address individual variations and cumulative errors in the statistical analysis of the experimental

data (Section 4.6.2 and Section 4.6.3).

4.3.5. Test Shapes

The test objects in previous studies can be broken down into two broad categories: (1) repre-

sentational, i.e. representing real-world objects (e.g. flashlight, banana, table, chair) [141, 10,

11, 60, 47, 129]; and (2) non-representational (or nonsensical) objects.

118

Figure 4.4. Real-time Models. The complexity of models used in real-time andinteractive applications, like games and 3-D visualizations, is often kept fairlylow to minimize the computational demand of the rendering process. Many ofthese simple models can thus be well described in terms of generalized cylin-ders [12, 67].

Because the perception of representational objects is affected by a person’s familiarity with

the object (e.g. a telephone versus a specialized technical instrument), most related work em-

ployed non-representational shapes, instead. Some authors have used wire-like [140, 18] or

tube-shaped [5] objects that resemble bent paper-clips and pipes, while others used similar

stimuli comprised of chained cubes instead of wires [147, 184, 154]. Several authors have

used soft, organic, blob-shaped or amoebic objects [18, 118]. Some experiments were based

on generalized cylinders [8, 11, 88]. Finally, a number of experiments, mostly those involv-

ing shape-from-texture studies, used shapes resembling undulating hills and valleys or folded

cloth [151, 153, 90, 3].

My own experiment uses a form of generalized cylinders, called geons [10] (Section 4.4.6)

to avoid the familiarity problem associated with real-world objects. Additionally, generalized

cylinders are easily parameterized, and, unlike wires, cubes, blobs, or cloth, are flexible enough

119

to describe many basic shapes, and can be combined to form a large number of real-world

objects, particularly low-resolution objects commonly found in real-time graphics (Figure 4.4).

4.4. Implementation

In this Section, I use the concepts and terminology introduced in the related work (Sec-

tion 4.3) to explain how my experimental design implements the goals put forth in the introduc-

tion (Section 4.1). I start off with a brief overview of the experiment to define the given task

and interaction, and then list the details of implementing each of the goals.

4.4.1. Overview

Participants in the experiment have to distinguish particular shapes from a set of moving objects

under different display conditions. Figure 4.5 shows the experimental setup. Participants sit in

front of a touch-sensitive board onto which I project moving test shapes with an overhead data-

projector. Participants are asked to select objects that share certain shape characteristics. An

object is selected by simply touching a finger on the board where the object is displayed. Each

experimental trial consists of different phases during which objects are displayed in different

shape-from-X modes (Figure 4.6). Although I am mainly interested in shape from shading,

contours, and texture, I test two additional display modes, one combining shading and contours

and one using an alternative color texture (for details, see Section 4.4.2). The system records

all user events, as well as several system events (Section 4.6).

120

Figure 4.5. Experimental Setup. Left: Schematic diagram of setup. A data pro-jector projects imagery onto a conductivity-based touch-sensitive surface. Theuser, grounded through the seat and a foot mat, simply taps on virtual objectsdisplayed on the surface. A single computer synthesizes imagery and gathersdata. Right: Photograph of actual setup.

4.4.2. Shape Cue Variety (1)

In the real world (or in photorealistic rendering) all shape cues are simultaneously present in a

scene to various degrees. The fact that the human visual system derives shape information from

numerous sources does not mean, however, that all of these sources are equally valuable or can

be leveraged equally. To test the individual contributions of shape cues to shape perception, the

cues have to be separated into orthogonal (mutually independent) stimuli. The following list

121

Figure 4.6. Display Modes. Left column: Screen-shots from the experiment foreach of the display modes. Test objects are rendered onto a static backgroundwith the same visual characteristics as the foreground objects to prevent outlinesfrom depth-discontinuities (as in the right column). Static objects are extremelydifficult to identify. Once they are moving, they pop out from the backgroundimmediately. Right column: Objects highlighted visually for comparison.

122

describes the rendering techniques used to produce these non-photorealistic stimuli, which are

depicted in Figure 4.6.

(1) Outline. Of the many possible contours I can render (silhouette, outline, creases,

ridges, valleys, etc. [70, 137]), I choose outlines, because they are the basis of most

NPR line-style algorithms. Outlines are those edges, e, for which

(4.1) dot(~v,~t1) · dot(~v,~t2) < 0 , with e = ~t1,~t2,

where ~v is the view vector, and e is the edge shared by the triangles6 with normals ~t1

and ~t2. There exist many efficient methods to implement Equation 4.1 [70], but one of

the fastest methods is a two-pass rendering approach in which first only back-facing

(dot(~v,~tb) > 0) triangles are rendered into the graphics card’s depth buffer, followed

by front-facing (dot(~v,~tf ) < 0) triangles that are rendered into the color buffer with

OpenGL’s line drawing mode.

Since colored, shaded, and textured objects create a natural silhouette (a type of

contour) against a differently colored, shaded, or textured background (right column

in Figure 4.6), I have to ensure that the other display modes do not inadvertently create

a contour cue. To ensure this, I fill the display background with static random elements

as in Figure 4.6, left column. These backgrounds are designed to resemble the current

display mode without containing any complete instance of the 16 test objects (this

ensures that participants are not exposed to targets in the background with which they

might try to interact) and thereby create a homogenous display. While this makes

6I use triangles for the outline definition because triangles form a generic geometric primitive supported by mostrendering systems.

123

static identification extremely difficult, the test objects pop out immediately from their

surroundings when animated (see also the discussion on Motion, below).

(2) Shading. The open graphics language (OpenGL [177]) provides built-in support for

the shading model described in Section 4.2.1. To obtain smooth shading across trian-

gles approximating curved surfaces, OpenGL interpolates the normals at the vertices.

Special care needs to be taken when rendering sharp edges (e.g. the boxes in Row 3

of Figure 4.10) to prevent the interpolation scheme from visually smoothing out these

edges. I achieve this with so-called smoothing groups, which only interpolate normals

within each group, but not across groups. This requires specifying several normals per

vertex, depending on which face references the vertex.

(3) Mixed. In anticipation that Shading and Outline might yield statistically indistin-

guishable data, I add a Mixed mode combining the two, to test for cumulative effects7.

(4) TexISO & TexNOI. For texturing I rely on OpenGL’s built-in texturing capabili-

ties. I use a trichromatic random-design texture, which is sphere-mapped onto the

objects8. To prevent the texture cue from interfering with the shading cue, the colors

of the texture are isoluminant (TexISO mode), i.e. they have different chrominance

values but the same luminance9. The colors are chosen to roughly fall within the red,

green, and blue part of the color spectrum, and are calibrated for equal luminance at

the participant’s head position using a Pentax Spotmeter V lightmeter. In case that

7See Section 4.7.3 for a discussion of possible interactions between shape cues.8Along with the choices for lighting model and contour type, this mapping was picked with reason but can stillbe considered fairly arbitrary. There exist many more possible mappings than can be explored in this dissertation.See Section 6.2 for further discussion.9Note that colors may reproduce non-isoluminant on your printer or display device.

124

an isoluminant texture interacts particularly strongly with motion [104], I include a

control texture mode without isoluminant colors (TexNOI mode).

Motion. As illustrated in Figure 4.1, motion itself is a shape cue. There are several rea-

sons why I do not separate motion from the other shape cues. First, many real-time graphics

environments, including games, immersive VR, and visualizations contain a significant number

of dynamic elements. Since shape cues may perform differently in highly dynamic environ-

ments compared to static environments, it is sensible to include motion in the experimental

setup. Second, shape-from-Motion relies on discernible parts of an object to move. In general,

these parts are only discernible because of their color, shading, contour, or texture properties.

Although motion is processed independently in separate cortical structures (area V5, to be spe-

cific [188]) these structures rely on the output from other cortical areas. It is therefore simply

impractical to separate motion from other shape cues.

One concern might still be that motion interacts with (depends on) some cues more strongly

than with others. Critics of my experiment have even ventured that motion is the main effect

I am able to measure. My position is, as above, that I am interested in the effectiveness of

different shape cues in dynamic environments, irrespective of any naturally existing bias that

may exist and which is therefore also part of any real or virtual dynamic scene. In response

to the above criticism, I refer to the results in Section 4.7, showing significant performance

differences for all different types of shape cues.

4.4.3. Adjusting Task Difficulty (2)

If an experimental task is too easy all participants are likely to excel and no statistical variation

can be measured to compute the effect of different experimental conditions. The same applies

125

to a task that is too difficult. If no participant is able to perform the task, no meaningful data can

be gathered. If we think of the performance graph as a simple statistical curve with total failure

(0%) on the one side and absolute success (100%) on the other, then there exists some tran-

sitional region in the middle that represents the performance threshold for that task. The best

data is measured at the point of greatest slope in the transition region because small changes

in subjective task difficulty lead to the greatest effect in measured performance. Of course,

this point is difficult to find in practice because it depends on many variables, including prim-

ing, learning, daily form and individual variability. Unlike physical performance measures like

speed and strength, perceptual performance tends not to vary as greatly between participants.

Small variations do exist, though, and should be accounted for in the task design.

In short, the given task should be difficult enough to challenge experts, yet manageable

for novices. I design towards this goal by including two system parameters that affect task

difficulty: object speed and object count. The speed of objects is set low enough to ensure

adequate visual detection and interaction, that is, all participants can interact at least with some

objects and the number of objects they interact with, correctly or incorrectly, determines their

performance. The number of objects, on the other hand, is set so high as to make a perfect

trial (correct interaction with all objects) unlikely, even for an expert. I determined the actual

parameters for speed and object count heuristically by performing a small set of trials.

4.4.4. Simple Interaction (3)

Version 0.1: What not to do. In a first version of the experiment, I initially used a task

that required participants to drive a virtual car through a winding course (Figure 4.7). The

idea was to use a task that participants were used to from daily experience and that would bear

126

practical relevance for interactive tasks in computer graphics application (e.g. navigation and

orienteering). A fair amount of effort and coding went into modeling the car’s interaction with

gravity, the terrain, friction, inertia, etc. down to the wheels spinning at the correct rotational

velocity when in contact with the ground. I took every precaution to ensure that the car would

handle as expected from a real car, yet would be easy to drive. I even set the initial acceleration

and maximum speed so that participants only had to steer right or left. Despite all this, the

experiment turned out to be a failure because for the majority of participants the driving task

was simply too difficult. Because my intention in this chapter is not merely to introduce a

single experiment, but rather a flexible methodology, I believe the mistakes of that first design

are instructive as good examples of what problems to be aware of and what design decisions can

impact interaction performance (the main discussion continues with the Conclusion paragraph

on page 128).

Adjusting task difficulty. As discussed above, the task should accommodate participants

with different performance levels or skills. In the car experiment, the main means of adjusting

the task difficulty was to change the maximum speed of the car. Because a faster car allowed

less reaction time, the task difficulty could be increased. Due to the realistic physics, however,

a faster car was also more difficult to control, would break out in sharp turns, etc. (Figure 4.7,

right). Despite several trials to establish a good common speed, the interaction turned out to be

too difficult for most participants and too easy for some.

Measuring performance. I employed several indirect performance measurements in the

car experiment including lap-time and deviation from an ideal path. In the analysis of the data, I

determined that lap-time was directly correlated with deviation from the ideal path and that the

deviation was mostly a factor of interaction difficulty and not of the difficulty in perceiving the

127

Figure 4.7. The First Version of the Experiment. Left: Screen-shots of single(top row) and dual (bottom row) shape cue modes. The outline mode providesa strong visual cue for the bottom of the valley (center of screen), a good guideto drive towards. Shading provides the same cue although less pronounced, butadditionally yields curvature information. Texture does not provide much shapeinformation in a static screen-shot but helps to produce shape-from-Motion whenanimated. Right: Analysis view (top-down) of 1.5 laps of a good driving per-formance with inset showing the participant’s view at the simulated time. Notehow the driver undulated around the ideal path, even on straight sections.

shape of the driving terrain. Participants constantly oversteered in one direction, followed by

overcompensation in the opposite direction (Figure 4.7, right). As a result, the distance traveled

by some participants was almost 1.7 times that of the ideal path and many participants suffered

various degrees of motion sickness.

Viewpoint. To increase the available reaction time and give a better sense of spatial

awareness, the car experiment featured a third person (bird’s eye) view of the scene, common in

many games (Figure 4.7, left). This allowed participants to see further ahead and perceive the

car in its environmental context. It also placed the participants in a very unusual position to drive

a car and most participants had trouble adjusting to this view. As it turned out, participants with

128

game experience had few problems adjusting to the third person perspective, while for most

others the cognitive leap seemed too large without significant practice.

Conclusion. The interaction with the system should fulfil the following requirements:

• Learning – The interaction should be simple to learn so that the amount of necessary

training per participant is minimized.

• Intuition – The interaction should be intuitive, so that participants are not preoccupied

with remembering arbitrary mappings between interaction and task.

• Unobtrusiveness – The interaction should not be obtrusive (e.g. by attaching many

wires or restrictive head-gear) to ensure that participants behave naturally.

• Dynamics – The interaction should be suitable for a task with moving objects.

The car experiment was able to address the dynamics and unobtrusiveness requirements, but

appeared difficult to learn and unintuitive for some participants.

Solution. My approach in the present version of the experiment is a touch-to-select inter-

action paradigm (Figure 4.5), which requires no technical skills10 and is commonplace in many

social situations (simple learning).

To the best of my knowledge, the use of a touch-table interface for shape perception studies

is novel and offers three distinct advantages. First, it removes the level of indirection associated

with many pointing devices (intuition). Mice, for example, translate motion on the plane of

their supporting surface (e.g. table) into motion in the plane of the display device. These two

planes are commonly perpendicular to each other, resulting in some learning effort for novice

users. Second, it eliminates the need for a cursor to indicate the current position of the pointing

10In fact, pointing is one of the earliest methods of gestural expression for children as young as infancy [50].

129

device, which might otherwise distract. Third, it does not require any external pointing device

(unobtrusiveness).

A disadvantage of the particular front-projection model I use is that the participant’s hands

cast visible shadows on the display. Interestingly, only 7 of the 21 participants ever noticed

these shadows and of those 7, only 3 felt somewhat impaired by the shadows.

4.4.5. High-demand Task/Time-Constraint (4)

In shape discrimination tasks involving reaction time there are two conceptual methods to in-

crease the task difficulty. One is to decrease the perceptual difference between target and dis-

tractor stimuli, thereby increasing the time participants take to correctly distinguish the two.

The other is to decrease the time participants have to make a distinction.

I implement the second method by exposing participants to a large number of potential

targets (not by minimizing the time each target is displayed) so that there is less reaction time

available per target11. The reason I prefer the second method is that, in my opinion, it reflects

better the type of real-life dynamic interactions we encounter in our daily lives. When driving

down a road we have to make split-second decisions about objects that are either negligible or

potential hazards. When playing sports or even just walking in a crowded room, we have to

constantly make decisions about the shape and motion of our surroundings, often without much

time to make these decisions.

In a severely time-constrained situation, we may not have the luxury of taking in all the

available evidence and making a correct decision. Instead, we have to make a best-effort deci-

sion with the information available at the time. Additionally, it is known that humans can only

11Of course, participants can choose to ignore most targets and instead focus only on a few but with more accuracy.My experiment accounts for such strategic variations (Section 4.6).

130

attend to a rather limited number of perceptual stimuli at a time [39, 160]. This may lead to

shape cue prioritization not evident in static experiments but important for real-time graphics

applications.

4.4.6. Test Shapes (5)

Section 4.3.5 briefly discussed the different test shapes that have featured in previous work. I

now pick up some of the concepts introduced there to define what I consider to be important

characteristics of a test shape set.

Familiarity Bias. Familiarity with the shapes should not influence their perception.

Some people might be more familiar with dogs than with chinchillas, and this familiarity may

influence the reaction time and accuracy for shape-dependent tasks, particularly those using

naming times as measurements.

Generality. The test shapes should represent a large number of basic shapes. If, for

example, a shape set only consists of shapes with right angles, then the results I obtain from a

study using these shapes do not apply to rounded shapes, or shapes with non-orthogonal angles.

Relevance. While I prefer nonsensical shapes to avoid familiarity biases, I want the set

of shapes to reasonably approximate a number of real-world objects in conjunction with each

other. For example, stick-shaped objects could be used to model a broom or rake, but they

would make poor components for building voluminous objects like a book or a refrigerator.

Parametrization. In reporting my experimental findings, I should be able to parameter-

ize the shapes that I used, so that others can get an intuition for the types of shapes the findings

131

are valid for and so that they can replicate my results. A counter-example I mentioned previ-

ously is the work by Ryan and Schwartz [141]. They used images of a hand, a switch, and a

steam valve as stimuli. No obvious correlation exists between these objects, they are not repre-

sentative of a class of objects, and it is not obvious which of their shape characteristics had an

influence on the experimental outcome.

Shape Theories. A number of different theories exist to explain human perception of

shape [94, 108, 57, 10, 64], each with their own strengths and limitations. While some the-

ories are based on exemplars, others define metric properties, or non-accidental properties of

shapes12.

For the purpose of my experiment, I can choose any theory that suitably fulfills my require-

ments of non-bias, generality, relevance, and parametrization. An important point to note is that

I do not actually require the theory to be valid in terms of modeling human perception because

I use the theory only to model shapes, not to model perception13.

Geon Theory. One theory that suits my requirements, and which describes shapes

that are easily parameterized with standard computer graphics techniques is Biederman’s [10]

recognition-by-components (RBC), or geon theory. A geon is a volumetric shape similar to a

generalized cylinder [12, 67], i.e. a volume constructed by sweeping a two-dimensional shape

along a possibly curved axis. Geons can vary in terms of the geometry and symmetry of the

sweeping shape and axis. Biederman defined 4 categories in which geons may vary, by impos-

ing the restriction that each category must produce non-accidental viewing features. That is,

the feature (e.g. curved vs. straight sweeping axis) must be evident for all but a small number

12An in-depth discussion of different theories is beyond the scope of this dissertation but the interested readermight find Gordon [64] useful as a starting reference.13A good theory can, of course, help to interpret the results of an experiment.

132

Figure 4.8. Constructing Shapes. Each experimental object consists of a mainbody and two attachments. To ensure that attachments are visible from any view-point, they are duplicated for each object, mirrored, and rotated by 90. Colorsare used only for illustration purposes.

Figure 4.9. Shape Categories. Both the main body of objects and the attach-ments vary along two non-accidental shape categories. Main Body: Main bodieseither have a round or square cross-section, and have a longitudinal axis of con-stant or tapered width. Attachments: Attachments also have a round or squarecross-section, but their longitudinal axis is either straight or curved.

133

of distinct accidental views and that slight perturbation of an accidental view must reveal the

feature. This restriction ensures that unique geon identification is invariant to most translations

and rotations.

To construct compound objects, geon theory devises a hierarchical network, whose nodes

are geons and whose structure indicates relative geon positioning. Like all other existing shape

categorization theories, geon theory faces problems of generality, i.e. it is not evident how subtle

shape differences, like those between apples and oranges, could be modeled. Nonetheless, the

simple shapes described by geon theory are reminiscent of the basic shape primitives found

in many virtual environments, architectural mockups, and computer games (Figure 4.4). The

principled categorization of geon shapes further allows me to specify exactly the types of shapes

and objects for which my experimental results are valid.

Constructing Shapes. To construct shapes for my experiment, I combine a main body

shape with two identical but mirrored and rotated attachment shapes (Figure 4.8). Because a

single large shape is easy to see and differentiate, participants are instructed to ignore the shape

of the main body and only differentiate shapes according to their attachments. The main body

therefore merely serves to increase the difficulty of the perceptual task without increasing the

cognitive load on participants14. The attachments are duplicated and transformed so that they

are visible from any direction as the compound object moves across the display.

The main bodies and attachments vary along three parametric dimensions (CS, LS, LA, see

Figure 4.10 caption), adapted from a subset of Biederman’s descriptors. Main bodies vary

according to their cross-section (CS) and longitudinal size (LS), whereas attachments vary

according to their cross-section and longitudinal axis (LA), (Figure 4.9). Parameters for the

14Compared to a hypothetical combination task, where participants would have to look for attachments only onparticular main body shapes.

134

Figure 4.10. Experiment Object Matrix. The complete set of experimentalobjects comprised of 2 × 2 × 2 × 2 = 16 shape permutations. The varia-tional parameters are: CS, cross-section (square/round); LA, longitudinal axis(straight/curved); LS, longitudinal size (constant/tapered). Parameters for theseproperties are chosen to preserve the volumes of main bodies and attachments.

135

construction of the main body and attachments are chosen to yield an approximately constant

volume, to ensure that average display size of objects under random rotation is approximately

equal.

Targets and Distractors. Together, the permutations of parameters add up to 2 (main

body+shape) × 23 (3 parametric dimensions with 2 choices each ) = 16 objects, listed in Fig-

ure 4.10. For each trial of the experiment, a different column (same attachments, different main

body) in Figure 4.10 is selected as the set of target objects, with the remaining objects acting as

distractors.

4.4.7. Learning and Biasing (6)

The performance of participants in a visual perception experiment depends partly on the partici-

pants (e.g. acuteness of vision, reaction-time) and partly on the experimental setup (e.g. display

modes, test-shapes). The setup parameters affecting performance should not be predictable

to ensure that the experimental data reflects the participants’ perceptual abilities and not their

memory or deductive reasoning skills. For example, if participants knew that every fourth object

was a target object, while the three intermediate objects were distractors, then they would only

have to recognize one target correctly and thence continue to count. In the car-experiment de-

scribed in Section 4.4.4, I only used two different tracks, one for training and one for the trials,

so that participants’ performances were the aggregated effect of the different visual stimuli and

of learning the curves of the trial-track. Such aggregation can be separated into its constituent

components using statistical techniques, but this requires many more independent trials.

In practice, learning of some sort is often unavoidable, even if it is only to practice an

interaction technique or to internalize the experimental instructions. To ensure that this learning

136

does not affect the experimental data, the experimenter can perform a practice trial without

collecting data.

A system bias is an effect that results in two experimental conditions differing by any other

measure than the intended free variable. If, for example, the target-to-distractor-ratio of the

Outline mode was different from the Shading mode, then this could affect the experimental

data even if the two modes otherwise behaved identically in terms of their ability to provide

shape information. The following paragraphs list the precautions I took to minimize learning

effects and biases during experimental trials.

Display Strategies. Because the different shape cues provide different shape informa-

tion, participants have to develop varied strategies to distinguish targets from distractors (Fig-

ure 4.11). For example: flat, shaded surfaces are single-colored, while curved, shaded surfaces

show color gradients. Outlines, on the other hand, do not use color at all. To enable participants

to develop a strategy for each display mode and to ensure that the learned strategy for one mode

does not bias the performance in a later display mode that uses a similar strategy, I require

participants to perform a trial run that shows all display modes in random order (the detailed

experimental procedure is listed in Section 4.5).

Randomization. To avoid introducing a system bias into the experiment, I fully random-

ize all system variables. Particularly, I randomize the order in which the columns in Figure 4.10

are chosen as targets objects, including the practice trial. The order of the 5 display modes for

each trial and the practice trial is also random.

Objects move across the screen in random linear paths, but I ensure that they always cross

the entire display, that all objects take the same amount of time to cross the display (8 seconds),

137

Figure 4.11. Mistaken Identity. During the experiment, objects constantly ro-tate randomly. This ensures that the objects can be viewed from all directions,generates a depth-from-motion cue, and separates objects from the background.The rotation also increases the likelihood of accidental views for which someobjects may look alike. Per definition of accidental views, these views are inher-ently unstable and will quickly disambiguate, but the viewer is required to trackobjects for a finite amount of time to reliably interpret the scene. Labels in theimage correspond to the labels in Figure 4.10, i.e. objects with the same label aredifferent views of the same object, while objects with different labels are viewsof different objects. The top row illustrates different objects that look similar forsome views. The bottom row shows that for some views (middle) the silhouetteof two objects (left and right) can be identical. Different perceptual strategiesmay be necessary to disambiguate similar looking shapes.

and that they constantly rotate at similar speeds (between 1-3 radians per second). I determined

the values for linear and angular velocities empirically based on a small group of participants.

I ensure that the ratio of target objects to distractor objects always remains 1 : 4 by only

using the fixed set of objects shown in Figure 4.10. When any object is selected by a participant

(correct or incorrect), or when an object has crossed the display entirely, it is re-initialized,

which causes its trajectory and rotational velocity to be reset. The object is also deactivated for

138

a random time between 2-7 seconds, so that participants cannot anticipate the type of object

re-appearing on the display.

4.4.8. Hardware & Software

I implemented the entire experimental system in C++, using OpenGL for rendering. The system,

including rendering and data acquisition, ran on an AMD AthlonTM 3500+ with 2Gb RAM and

displayed via a Dell 3300MP digital projector. Participants interacted via a touch-sensitive

DiamondTouch table interface (Figure 4.5).

4.5. Procedure

All participants are asked to be seated in front of the experimentation desk and given brief

oral instructions as to the duration of the experiment as well as the interaction method. They

are then asked to follow the instructions on the screen and address any questions they may have

after reading the instructions but before commencing the experiment. The participants wear

ear-mufflers and sit in an isolated partition to minimize distractions. Participants read a few

short pages of instructions. The instructions introduce the objects that are used in the experi-

ment and explain with text and visual examples how to differentiate targets from distractors. To

advance to the next instruction page, participants use the same interaction as in the experiment

(i.e. touching the table). Participants are then asked to perform a short practice trial (20 sec-

onds per display mode). In the practice trial, participants are given visual feedback about their

performance. The instructions state that such feedback is not given during the experimental

trials. The user display is replicated on an external display visible to the experimenter who can

139

monitor the participants’ performances and note any obvious problems (e.g. a participant only

hitting distractors instead of targets). After the practice trial, the experimental trials begin.

Each experimental trial is preceded by a single instruction summary page, followed by a

textual description and visual example of targets versus distractors. Afterwards, the actual trial

begins. Each trial consists of the same set of targets shown in all 5 display modes in randomized

order. Each display mode is shown for 60 seconds, followed by a fade-to-black and several

seconds of darkness to prevent delayed interactions from one mode affecting the following

mode. Each trial ends with intructions to the participants informing them of the completion

of the trial and allowing them to rest for up to a minute before continuing. Altogether, each

participant performs 4 trials, one for each column in Figure 4.10, for a total time of about

25-30 minutes, including the practice trial and rest-periods.

After the last trial, participants are asked to fill out a short questionnaire with yes/no and

Likert-type (1 to 5) questions (Figure B.1) to collect subjective ratings for shape cues, personal

performance, experimental duration, fatigue, and discomfort.

4.6. Evaluation

In Section 4.3.4, I explained the different measurement methods commonly used in shape

perception experiments. In this Section, I discuss the direct measurements I gather for each

participant (Section 4.6.1) and how these are converted into indirect measures to discount indi-

vidual variations due to risk disposition and interaction strategy (Section 4.6.2). Finally, Sec-

tion 4.6.3 describes the statistical analysis of the acquired data.

140

4.6.1. Measurements

For each trial of each participant, the system records the following named interaction events

(direct measurements), along with time-stamps:

shots The number of times a participant indicates an object-selection by

touching the table input-device.

correct The number of times the touched object is of the target type.

incorrect The number of times the touched object is of the distractor type.

missed The number of times the participant touches the background instead of

an object. This is seldom due to a participant mistaking the background

for an actual object, but rather because of imprecise hand-eye coordina-

tion (missed events are generally followed immediately by correct or

incorrect events).

Using the above definitions, it is always true that shots = correct + incorrect + missed.

The system also records the following system events:

lost The number of target objects that traverse the screen completely without

intervention. This happens if the participant fails to identify the object

as being of the target type, or if the participant is too busy interacting

with other objects.

initialized The number of objects that are initialized to traverse the screen. An ob-

ject is re-initialized every time the participant touches it, or if it traverses

the screen completely without interaction.

141

4.6.2. Aggregate (Indirect) Measures

Independent of their objective performance skills, participants’ data in task-driven studies is

sensitive to subjective factors such as competitiveness, strategy, and risk profile. In my exper-

iment some people might only take a shot if they are very certain of the object type they are

selecting, while others might take as many shots as possible while risking false target identifi-

cation. To normalize for these factors, and to answer other performance questions that cannot

be measured directly, I define the following aggregate measures in terms of their name, the

question the measure is supposed to help answer, a motivation for the measure’s definition, the

formula to compute the measure, and the measure’s scale/range. All aggregate measures are

defined in terms of the direct measures of Section 4.6.1.

success — Of the shots taken, how many are correct? — Some participants might be more

aggressive and willing to take risks. If they shoot often, then their absolute correct count might

be high despite also making many mistakes. To compare such participants to more conservative

ones, who shoot less but are more accurate, I normalize the correct count on the number of

shots taken:success =

correct

shots.

Scale: The range of success is normalized, with a value of 1 indicating a perfect score.

failure — Of the shots taken, how many are incorrect? — An equivalent motivation as for

success (above) applies here:failure =

incorrect

shots.

Scale: The range of failure is normalized, with a value of 1 indicating that all attempted shots

were incorrect.

142

risk — Of the objects crossing the screen, how many shot attempts are taken? — To assess

the risk profile of a participant, I measure the readiness of that participant to shoot at an object

crossing the screen. The higher the risk measure, the more a participant is willing to risk an

incorrect target choice (or the more confident the participant is in his or her decision). This

makes risk more of a personality measure than a performance measure:

risk =shots

initialized− shots.

Scale: Because each shot itself causes an object re-initialization (to keep the number of objects

on the screen roughly constant), I subtract shots from the denominator. This means that risk

has no upper bound (and is not normalized to 1). A risk value of 0 indicates no risk (no shots).

A value of 1 indicates very high risk (the number of shots equals the number of freely initialized

objects). Values above 1 indicate extreme risk (more shot attempts than new objects traversing

the screen), but no participant in the study exhibited such risk behavior..

placement — How good is each participant’s hand-eye coordination, and is this a function

of display mode? — When participants interact with the system they are supposed to select a

target object (whether their actual selection is correct or incorrect is irrelevant). Failure to do so

(selecting the background instead), indicates poor hand-eye coordination, which may be linked

to the display mode:

placement =correct + incorrect

shots=

shots−missed

shots.

Scale: The range of placement is normalized, with a value of 1 indicating perfect placement.

143

detection — How well can correct target objects be detected on the display? — This

measure compares the correctly shot targets to the number of target objects that traversed the

screen completely without having been shot at. Because lack of hand-eye coordination can

lower the number of correctly identified targets that are actually shot, the numerator takes into

account both the correct shots, as well as a fraction of missed shots that likely would have been

correct given overall performance:

detection =correct + missed · correct

correct+incorrect

lost.

Scale: Like risk, the range of detection is not normalized to 1, but uses the value of 1 as a

qualitative threshold. A detection value of 0 means that no objects were correctly detected. A

value of 1 indicates that half the target-type objects were detected. A value of 2 means that

twice as many target-type objects were detected than were lost, etc.

4.6.3. Analysis

I first convert the direct measurements into aggregate measures and average the latter over the

4 trials that each participant performs (Figure 4.13 and Figure 4.14). I then average these per-

participant averages (Table B.1–B.5) over each display mode (Figure 4.12) to obtain overall

performance values for each display mode.

To establish whether different shape-from-X display modes have a significant effect on ag-

gregate performance measures, I perform a repeated-measure analysis of variance (ANOVA)

for each of the aggregate measures (Table 4.1).

144

Figure 4.12. Aggregate Comparison. These bar graphs show aggregate mea-sures averaged over all participants and all trials. Error bars are normalizedstandard errors. Note the different vertical scales (to show small details) whencomparing. Interpretations for scales are given in Section 4.6.2.

145

Figure 4.13. Detailed Aggregate Measures. These charts show per-participantvalues for each display mode and each aggregate measure. Most data is rea-sonably normal-distributed (see Figure 4.14), with occasional outliers, some ofwhich are extreme. Individual performance appears mostly consistent through-out, in accordance with the general analysis remarks in this Section. Note thedifferent scales of the individual charts.

146

Figure 4.14. Detailed Aggregate Measures Histograms. These charts showthe histograms (frequency distribution) for the detailed participant data in Fig-ure 4.13. Despite the erratic appearance of the traces (the number of participantslimits the resolution of the histogram) the distributions appear to be evenly ornormal distributed and no clustering is evident. Note the different scales of theindividual charts.

147

The ANOVA analysis only determines overall effect of display modes. Further analysis

of pairs of results for different display modes using Student’s t-test allows me to detect sig-

nificantly different means for each display mode pair. Table 4.2 shows the t-test results for all

combinations of display modes. Since the likelihood of a false positive is higher across multiple

tests than for each individual test, I use the highly conservative Bonferroni correction, which

divides the alpha-value, α, by the number of tests, n, so that, α → αn

.

4.7. Results and Discussion

I tested 21 participants (8 female, 13 male) comprised of graduate students and university

staff volunteers. All participants had normal or corrected-to-normal vision.

4.7.1. Data Consistency and Distribution

The scatter-spread charts in Figure 4.13 show that the group of participants as a whole per-

formed fairly consistently (vertical ordering of participants does not change dramatically be-

tween the different display modes). Good performers generally performed well throughout and

poor performers generally performed worse for most modes. Some notable outliers are evident,

though. One extreme outlier (more than three interquartile ranges from the third quartile) is

recorded for participant BA056’s Mixed-risk performance. Further analysis of this trial data

shows that the participant attempted more than double the number of shots during the first trial

compared to the remaining trials. Although this led to an 18% higher success rate in the first

trial, it also resulted in an up to ten-fold higher failure rate. The reason for this abnormality

is difficult to ascertain because the remaining trials show much more moderation in terms of

shots attempted. Since no feedback was given during or between trials, the participant decided

148

to adjust his or her interaction behavior autonomously. I eliminate the above-mentioned extreme

outlier from the analysis lest I contaminate the remaining data.

Another interesting example is BA055’s Mixed-detection performance, which, in this case,

is exceptionally good. As above, this anomaly is due to the performance of a single trial (third).

During this trial the participant lost almost no objects and detection becomes very sensitive for

very low values of lost due to not being a normalized measure. It should be noted, however, that

BA055’s detection score is among the highest of all participants for all display modes. Given

this detection proficiency, it is somewhat surprising that the participant’s success scores are

only average, especially considering that the failure scores are also among the lowest of all

participants. A possible reason may be the fairly low placement scores (hitting the background

instead of objects). Judging by the low failure scores, the poor placement is not likely due to

mistaking the background for real objects, but more likely due to poor hand-eye coordination

or fatigue (most missed objects occurred in the last trial).

4.7.2. Strategies

During the initial practice trial, participants develop strategies for categorizing target and dis-

tractor objects for the different display modes (Section 4.4.7). To support reasoning about shape

perception strategies, I use Figure 4.14, which displays the same data as Figures 4.13 and 4.12

but this time as histograms, to demonstrate distribution properties. The graphs in Figure 4.14

approximate a near normal distribution (when taking into account the low resolution of the his-

togram, limited by the number of participants in the study and the discrete nature of histograms).

Most importantly, there exists no evidence to suggest participants splitting into multiple distinct

clusters (e.g. like the two humps of a bactrian camel). Such clusters would be suggestive of the

149

existence of a small number of shape recognition strategies with distinct performance character-

istics, applied by participant subgroups. The lack of clusters does not rule out the possibility of

several coexisting strategies, however. Multiple strategies could exist that happen to be equally

effective. Or there could be a large number of strategies with different performance charac-

teristics and the exhibited distributions are an effect of the population’s likelihood to adopt an

optimal strategy for a given display mode.

A simpler explanation, and one in line with exit interviews and personal experience, is that

all single shape cues suggest a single strategy15. The results for Mixed mode are therefore

highly interesting as several possibilities arise:

(1) Coexistence : The two strategies for Outline and Shading remain in effect indepen-

dently and participants choose the better strategy to apply to Mixed. In that case,

Mixed should perform like the better of Outline and Shading.

(2) Interference : The two strategies for Outline and Shading remain in effect but

interdependently. In the case of constructive interference, participants make use of

both strategies simultaneously and performance rises above the level of Outline or

Shading alone. In the case of destructive interference, one strategy hinders the other

and performance is less than the better of Outline and Shading.

(3) Synergy : The simultaneous presence of Outline and Shading shape cues allows for

a novel strategy only applicable to Mixed. In this case, participants can choose to use

the new strategy or stick with the better of the Outline and Shading strategies. The

performance should thus not be worse than for the individual shape cues.

15The following discussion is facilitated by assuming a single strategy per display mode, but does not depend onit. Each occurrence of strategy could be replaced by distinct set of strategies without affecting the arguments.

150

4.7.3. Interference vs. Synergy

Altogether, I find that Shading provides the best shape cue in my study (as determined by

success, failure, and detection scores), followed by Mixed, Outline, and the texture modes,

TexIso & TexNOI (Figure 4.12).

According to the above discussion, the particular ordering of Shading, Mixed, and Outline

suggests a destructive interference effect instead of coexistence, synergy, or constructive inter-

ference.

Intriguingly, the combination of Outline and Shading actually decreases the efficiency of

Shading, instead of adding constructively by helping to disambiguate, as might be expected.

This finding reiterates a common theme throughout this dissertation, that for perceptual tasks

less visual information can be more effective than more information. Indeed, several partici-

pants commented in the exit interview that the Mixed mode offered too much information and

confused them. My explanation for this result is that different detection strategies for shape-

from-contours and shape-from-shading could impede each other. For the Outline mode it is

advantageous to compare the terminating contour angle of attachments, while Shading offers

the most reliable information in terms of presence or absence of gradients along the surfaces

interior of attachments. If these strategies are different enough, or even mutually exclusive,

participants may find it difficult to focus on one strategy while ignoring the other. These results

are therefore highly valuable for the design of effective shapes and shape-cues for interactive

non-realistic rendering systems.

This result is also important because it partly contradicts and partly augments findings of

previous shape perception studies. Bulthoff [19], for example, found that subjects underesti-

mated curvature of static objects shown with shading or texture alone, but results improved

151

when shading and texture were shown in conjunction, lending support to a synergy or construc-

tive interference theory. I believe the fact that Bulthoff detected an additive effect while I found

that multiple shape-cues may be counterproductive can be explained by expanding upon my

above theory on detection strategies with a timing argument.

Participants may find it difficult to focus on one strategy while ignoring the other, under

time-constrained conditions. While there is no reason to believe that humans would not take

all available evidence under consideration when given the time, I have mentioned previously

that the human visual system can only attend to a limited number of stimuli simultaneously.

It is therefore conceivable that for static scenes and when given ample time humans use mul-

tiple shape cues constructively, while in a time-critical interactive situation a shape cue prior-

itization takes place16. Such an argument could also find support in findings by Moutoussis

and Zeki [114], stating that each of the different visual processing systems of the HVS “[...]

terminates its perceptual task and reaches its perceptual endpoint at a slightly different time

than the others, thus leading to a perceptual asynchrony in vision - color is seen before form,

which is seen before motion, with the advantage of colour over motion being of the order of

60-100 ms [...]” ([189], pg. 79). I thus believe it is vital to perform more studies on shape-

perception for real-time, interactive tasks.

4.7.4. Interaction

Table 4.1 shows analysis of variance (ANOVA) results for the different aggregate measure-

ments, to test if varying the display mode had a significant effect on the means of these mea-

sures. The F (dof, n) value in the first column represents the ratio of two independent estimates

16Although the results were not statistically significant I also found evidence for such prioritization in the datatrends of the car-experiment.

152

Measure F(4,84) psuccess 48.594 1.14 ·10−20

failure 50.154 4.68 ·10−21

risk 13.317 2.23 ·10−8

placement 1.412 0.238detection 49.625 6.32 ·10−21

Table 4.1. Within-“Aggregate Measure” Effects. This table lists the F -valuefor the given degrees of freedom, and the p-value for each of the aggregate mea-sures across all display modes. Display modes are averaged over all trials of allparticipants. Given values assume sphericity.

of the variance of a normal distribution, where dof = m − 1 are the degrees of freedom, m is

the factor level (here, different display modes: m = 5), and n = 84 are the number of observa-

tions under identical conditions (4 trials for 21 participants). Higher F -ratios indicate a greater

dissimilarity of the variances under investigation. For the given dof and n values, an F -ratio

above 2.5 indicates statistical significance at the p = 0.05 level (actual p-values are shown in

the second column). That is, for F -ratios above 2.5 the chance of obtaining the observed data

is less than or equal to five percent. This means that it is much more likely that the observations

were not obtained by chance and instead represent an actual effect, in this case that the different

display modes have a significant effect on aggregate performance measures.

Given the values in Table 4.1, I find that the different display modes have a (highly, p <

0.01) significant effect on all our aggregate measures, except placement. This is an ideal result,

because it means that while performance measures related to the task are critically affected by

the display modes, the placement measure, related to interaction, does not vary significantly

with display mode. In other words, participants are able to consistently touch their intended

moving objects, even if the objects themselves may be difficult to differentiate.

153

Modes succ. fail. risk place. detec.Outline vs. TexNOI vsig vsig vsig 0.061 vsigOutline vs. TexISO vsig vsig vsig 0.080 vsigOutline vs. Shading vsig vsig vsig 0.184 vsig

Shading vs. TexNOI vsig vsig vsig 0.469 vsigShading vs. TexISO vsig vsig vsig 0.401 vsig

Mixed vs. TexNOI vsig vsig 0.001 0.282 vsigMixed vs. TexISO vsig vsig 0.001 0.190 vsigMixed vs. Outline 0.025 0.002 0.018 0.740 vsigMixed vs. Shading 0.025 vsig 0.346 0.380 0.099

TexISO vs. TexNOI 0.267 0.237 0.490 0.987 0.347

Table 4.2. Significance Analysis. This table lists p-values for Student’s pairedt-test of all combinations of display modes. The columns refer to each of theaggregate measures. A value of p < 0.005 is considered significant (bold-italic),while a value of vsig (p < 0.0005) is highly significant, under the highly conser-vative Bonferroni correction.

Another conclusion is that performance differences for the different display modes can be

detected with the aggregate measures, number of participants and number of trials specified in

this chapter. This suggests reusability of the experimental setup for numerous other dynamic

shape perception studies (Section 6.2).

As evident in Figure 4.12, the success rates for all display modes are significantly higher

than pure chance (25%), and chance (50%) if participants ignored instructions and considered

only one of the two attachment categories (CS & LA, in Figure 4.10). This indicates that par-

ticipants understand the instructions correctly (using both attachment categories for distinction)

and find the task easy enough to perform.

These result prove the successful implementation of the third design goal (interaction sim-

plicity) and I am hopeful that the experimental methodology can be adopted for a large variety

of additional display modes.

154

4.7.5. Motion and Color

For the detailed t-test analysis in Table 4.2 no significant differences are found between the

different texture modes. This is surprising, as motion perception is commonly thought to be

linked to luminance channels and independent of color [187, 134, 115]. In that case, the isolu-

minant texture mode, TexISO, should perform worse than the non-isoluminant texture mode,

TexNOI . In fact, the results of my study show the opposite trend (although that trend is not sta-

tistically significant) and are in line with subjective responses from the exit interview (Table B.6,

bottom row), indicating that TexISO appears easier than TexNOI to most participants. This

is an interesting finding and may substantiate recent studies that propose two different motion

pathways in the HVS, which process slow motion (chromatically) differently from fast motion

(achromatically) [55, 168, 104].

4.7.6. Risk Assessment

An interesting result, evident in Figure 4.12, is the positive correlation between success and

risk, and the negative correlation between failure and risk (significant at p(dof=3) < 0.01 in

both cases). Intuitively, it seems that more risk should lower success and increase failure, to

the point of ultimate risk, equating to pure chance. In my interpretation of this data, participants

are generally able to judge their limitations well and behave rather conservatively, in line with

the instruction to be as fast as possible without making any mistakes.

Mostly non-significant differences are found for Mixed vs. Outline, and Mixed vs.

Shading. I attribute this to the fact that Mixed performance is between Shading and Outline

for all measures but risk (Figure 4.12), the relatively large variability of Mixed, and the use

of the conservative Bonferroni correction. A possible explanation for the high risk value of

155

Mixed is that participants became more daring because they assumed that more visual infor-

mation would improve their correct target identification. The objective performance measures

(success and failure) do not corroborate this notion, and, interestingly, neither do the exit

interview results (Table B.6).

4.7.7. Exit Interview

From the exit interview (Figure B.1 and Table B.6) I gather that participants were content with

the duration of the experiment. No participant felt dizzy or disoriented during or after the

experiment and participants found the interaction paradigm very intuitive. Most participants

described the experiment as “fun”, even though this was not asked in the questionnaire.

4.8. Summary

In this chapter, I presented an experiment to study shape perception of multiple concurrent

dynamic objects. This experimental approach is novel as it deviates from traditional reductionist

(single, static shapes) studies whose results may not apply directly to most interactive graphics

applications.

Experimental Framework. I created several non-realistic display modes specifically

designed to target only a single shape cue at a time, allowing me to study individual shape cues

as well as combinations thereof.

My framework implementation carefully follows a number of high-level design goals de-

scribed in Section 4.1.1 and the statistical significance of the data collected during my study

156

(Section 4.6.3) indicates a high-quality experimental setup where these design goals were at-

tained. Results further indicate that my experiment supports a number of additional shape-from-

X studies with only minor modifications (Section 6.2).

Interaction. I presented a novel interaction paradigm that does not rely on pointing de-

vices or other indirect mechanisms. Participants interact with the experiment by simply touch-

ing a table at the position where they see an object. This interaction is simple, intuitive, unob-

trusive, and reliable (see placement values in Figure 4.12).

Task. Compared to most previous shape perception studies, which can quickly become

monotonous and tiring, my experimental task is inspired by games (Section 4.4) and intended to

be motivating. Participants have to react quickly and stay alert to achieve a good performance.

Most participants became very competitive during the experiment and described the task as fun

(Section 4.7.7).

Results. Although the main contribution of this chapter is a reusable experimental design

that allows for a large number of graphics-relevant shape perception studies, the results for the

single study I performed are already interesting (Section 4.7). The most important of these

results is that, in dynamic situations, shape cues do not seem to add constructively and may

even interfere destructively. This is an important result for interactive graphics, and one which

contrasts previous studies on static objects (Section 4.7.3).

Shape Cues in Action. It is common knowledge amongst graphic designers that different

types of shape-cues are effective for different design goals. I have made use of this concept

throughout the figures in this chapter: Figure 4.10 uses shading and coloring to indicate shape

and differentiate object parts, Figure 4.6 uses contours to draw attention to target objects, and

157

Figure 4.3 uses texture to illustrate a curved surface. It is important to study the effectiveness

of these shape cues for various perceptual tasks, to apply the cues appropriately in a given

graphical situation.

158

CHAPTER 5

General Future Work

Given the beneficial relationship between NPR and Perception advocated in this dissertation,

one might naturally ask questions such as: How far can we push this relationship? or What is

the ultimate perceptual depiction? or What other non-realistic imagery, apart from that inspired

by art, can be used for visual communication purposes? Although I do not presume to have

conclusive answers to these questions, I believe the direction outlined by my dissertation points

towards some interesting leads. To start this discussion off, I revisit the topic of realism versus

non-realism in the light of the issues addressed in previous chapters.

5.0.1. Realistic Images

Figure 5.1 illustrates simplistically the general lifecycle of a synthetic image from conception

to perception and onto cognition. I argued in Section 1.1.1 that every image serves a purpose.

For now, let this purpose be to convey a message, even if this message is only the image itself

(e.g. “A table and chair in the corner of a room”). In the purely photorealistic approach, this

message is encoded into a life-like visual representation, without reference to the HVS1, to be

consumed by an observer. The observer’s task, then, is to decipher the message given the input

image. If all elements of this encoding and decoding process work well, the observer recovers

a good approximation of the original message. Because the entire process is rather lengthy

1As noted in Section 2.4.2, even adaptive rendering, which sometimes does consider the HVS, does so mostly tohide artifacts, not to enhance images.

159

Figure 5.1. Lifecycle of a Synthetic Image. The image generation (rendering,blue outlines) starts with a concept: A table and chair stand in the corner of aroom. A user models the objects, sets up the scene, and renders the image toa display device. An observer views the final image on the display and startsdeconstructing the retinal projection (vision, red outlines). The observer goesthrough various low-level and cognitive processing steps before recognizing thedepicted scene: A chair next to a table in a room. If rendering and vision workin perfect harmony, the initial concept and the recognized scene are identical.Vision shortcuts are the attempt to bypass some of the rendering and visual de-coding pipeline to affect a more direct visual communication.

and complicated, and because there are no possible shortcuts (see below) for realistic image

synthesis, there are various stages at which the message can be degraded or confused.

5.0.2. Non-realistic Images

Non-realistic image synthesis is not bound by the constraints of the physical world. It thus

becomes easier to eliminate detail that (1) does not contribute to representing the message and

could in the worst case mask the message (confusion), and (2) requires additional rendering

160

resources, thereby incurring unnecessary costs. The best example of purposeful omission of

information is abstraction. Restrooms around the world generally do not post photographs of a

man and a woman on their doors. Doing so would give too much information, be too specific.

Patrons may be lead to believe that the room behind the door belongs to the depicted person.

Instead, restroom signs are abstract representations of men and women, so that any person of

the appropriate gender can identify with the depiction. The allowed shortcuts for non-realistic

images are to bypass optical models required for realistic image synthesis. I should note that

my use of the term shortcut chiefly refers to optimizations in visual communication. While it

is possible that such shortcuts are also computationally efficient (as is the case for many non-

realistic image synthesis algorithms that do not rely on global illumination solutions) I do not

require this to consider a shortcut to be effective.

5.0.3. Perceptually-based Images

I have argued throughout this dissertation that the effectiveness of non-realistic imagery can be

further increased by considering human perception. I indicate this with the perceptually-based

rendering label in Figure 5.1. To generate images optimized for low-level human vision, the

rendering process needs to include a model of perception (light-blue rendering input). Although

such a model does not introduce additional shortcuts on the rendering side, it might increase

efficiency on the visual decoding side. This is the approach I took in Chapter 3 and which lead

to increased performance in two perceptual tasks.

One way to discuss the questions I pose at the beginning of this Section is to investigate any

perceptual shortcuts beyond those already mentioned. In other words, Can we generate images

that convey a given message while by-passing more of the coding/decoding pipeline? I believe

161

the answer is, yes. To substantiate this claim, let me give a few examples of what I refer to as

vision shortcuts.

5.1. Vision Shortcuts

In most realistic and even non-realistic graphics, there exists a fairly straightforward con-

nection between a generated visual stimulus and its perceptual response. The intensity of a pixel

on a monitor is related to the perceived brightness of that pixel. The perceived color of a pixel

is related to the red, green, and blue intensities of that pixel, and so on.

There exist, however, various examples of visual stimuli producing a perceptual sensation

that is naturally associated with a very different type of stimulus: a sequence of black-and-white

signals can create the illusion of colors. An interlaced duo-chrome image can be perceived to

contain colors outside the gamut of additive mixture. A static texture pattern can elicit the

sensation of motion. Partially deleted outlines can be perceived as complete. The following

sections introduce these perceptual phenomena in terms of non-realistic imagery and discuss

some of their potential applications for visual communication.

5.1.1. Benham-Fechner Illusion: Flicker Color

The Phenomenon. In 1895, a toy-maker named Charles Benham created a spinning top

painted with a pattern similar to the left pattern in Figure 5.2. This toy was inspired by his

finding that when the pattern was spun, it created the appearance of multiple colored, concen-

tric rings2 [7]. Gustav Fechner [44] and Hermann von Helmholtz investigated the phenomenon

2I first experienced this illusion in a Natural Science museum in India. The exhibit was in motion when I readthe accompanying instructions and it was not until the disk was almost stationary that I was finally convinced thatthere were, indeed, no colors.

162

Figure 5.2. Flicker Color Designs. The left and center circular designs can beenlarged, cut out and placed on an old record turntable with adjustable speed.When viewing the animated pattern, most people experience concentric circlesin different colors. When the rotation is reversed, the color ordering reversesaccordingly. The square design is intended for a conveyor-belt motion, or to bepainted onto a cylinder. These designs are but a few of many others possible.Note though, that all designs contain half a period of blackness.

more generally and termed it pattern induced flicker color (PIFC), or flicker color for short.

Although the effect has been researched for a long time [21], a satisfactory explanation remains

elusive. An early theory stipulated that the pulse patterns of the Benham design approximated

neural coding of color information, similar to Morse code. Festinger et al. [48] argued that

Benham’s induced (or subjective) colors were only faint because they poorly approximated real

neural codes. They devised several new patterns with cell-typical activation and fall-off charac-

teristics and demonstrated that their patterns did not require the half-period rest-state of typical

Benham-like patterns (Figure 5.2). Festinger et al.’s theory was later disputed, particularly by

Jarvis [81] who could not reproduce their results. A currently accepted partial explanation ar-

gues that lateral inhibition of neighboring HVS cells exposed to flicker stimuli causes subjective

colors to be seen [182].

163

Applications. Apart from research work, the PIFC effect has found applications in oph-

thalmic treatment and even numerous patents (BD Patent Nr. 931533 11. Aug. 1955; U.S.

Patent #2844990, July 29, 1958; U.S. Patent #3311699, Mar. 28, 1967), including novelty ad-

vertisement before the era of color television. If more knowledge existed about the causes of

PIFCs and a reliable method to synthesize saturated and vibrant PIFCs were known, it could

be possible to induce color sensations in retinally color-blind or otherwise retinally-damaged

people.

5.1.2. Retinex Color

The Phenomenon. In Section 1.2.1, I mentioned the phenomenon of color constancy, which

allows humans to perceive the true color of a material instead of its reflected color. Another

phenomenon, sometimes called color illusion, is best explained with an example: If a small grey

square (the shape is not important) is placed upon a larger green square, then the grey square

appears tinted lightly red. Similarly, if a grey square is placed upon a larger red square, the

grey square appears tinted slightly green, i.e. the tint appears as the opposite color of the square

it is placed upon. This effect also works with cyan/yellow combinations and poses problems

for theories that posit that the cones in humanoid retinas are independently sensitive to red,

blue, and green wavelengths. To explain these color illusions, the cones’ responses cannot be

interpreted independently. Alternative theories, along with supporting physiological evidence

exist, based on antagonistic interactions between combinations of cones resulting in spectrally

opposing stimulation [76, 80].

164

Figure 5.3. Retinex Images. Viewing Instructions: Due to interlacing, the im-ages may not display well at some magnification levels. In the electronic versionof this document, zoom into each image until all the horizontal lines comprisingthe images appear of the same height. Then adjust your viewing distance to thedisplay until the individual lines seize to be discernible. In this configuration,examine the images for 30 seconds or more and then determine what colors yousee. Afterwards you can compare the real colors in Figure 5.4. Finally, zoomfully into the above images to inspect the actual colors used.

Edwin Land devised an experiment using both phenomena to suggest subjective colors,

which are objectively not present. In this experiment, he implemented the color illusion phe-

nomenon with a picture slide and a few color-filters to produce duo-chrome images that induced

the illusion of colors which were present only in the original image. The HVS interpreted

the overall color bias of his images as a global illuminant, thus taking advantage of the color

constancy phenomenon. Land described this experiment and the accompanying theory in his

Retinex3 publication [100].

Figure 5.3 shows two Retinex image examples. The images are best viewed on a computer

display with adjusted viewing conditions (see Figure 5.3 caption for instructions). The left

3Retinex=Retina + visual cortex.

165

Figure 5.4. Originals for Retinex Images. Originals used to construct imagesin Figure 5.3.— Left: Public Domain. Right: Creative Commons License.

image really only uses one color, red, but induces the sensation of green (and other colors) with

interlaced grey bands. Note the brown tinge of the burger bun, the yellow of the fries and the

blue’ish-green tray. The right image uses two different colors, green and red, to achieve a much

fuller color appearance. Note the grey color of the sweater-vest and the blue color of the shirt.

None of these colors are in the gamut of additive mixture of red and green. The image borders

are not strictly necessary but they help to improve the effect. In the left image, I selected a green

that is suggestive of the perceived color of the tray. In the right image, I selected a substitute

white, sampled from the bright stripes in the sweater-vest.

In previous work [172], I have conducted a distributed user-study4, to test whether subjective

colors could be induced reliably on different monitors and under different illumination condi-

tions. The results indicated that this was possible for a variety of monitors and illumination

4Similar to those suggested in Section 6.2.

166

conditions; that participants perceived colors clearly outside the gamut of additive mixture; but

that some colors were not identified uniquely.

Applications. Retinex theory, even though heavily criticized when Land first presented

it, has regained some interest in the research community and is used increasingly in image

enhancement applications [54, 6], including images taken by NASA5 [133, 178] during orbital

and space missions. Along with the flicker color illusion, discussed above, applied Retinex

theory is one of the prime examples of vision shortcuts. In fact, because these two examples

address visual processes so early on in the HVS, it could be possible to do away with an external

display device altogether. Once artificial retinas become a reality, Retinex theory and flicker

color could be used to encode color for direct neural stimulation.

5.1.3. Anomalous Motion

The Phenomenon. Anomalous motion is an example of a more indirect method of triggering

a sensation generally associated with a different stimulus. When viewing images like those in

Figure 5.5 at a suitable magnification level, the motion texture elements, which I call motiels,

appear to be moving. Akiyoshi Kitaoka has created many different types of anomalous motion

illusions6 and published several papers and books on the phenomenon [92, 91]. Despite Ki-

taoka’s and other research efforts, there exist many more types of anomalous motion designs

than theories explaining their perceptual mechanisms. In a sense, these illusions are great ex-

amples of Zeki’s observation about artists acting as neurologists7 (Section 1.2.3). There exist

5http://dragon.larc.nasa.gov.6A large number of Kitaoka’s designs are available at http://www.ritsumei.ac.jp/˜akitaoka/index-e.html.7This is not to imply that A. Kitaoka’s scientific prowess is in any way inferior to his artistic talents, but rather thatless scholarly individuals throughout the Internet have found it possible to adopt and modify his original designs.

http://dragon.larc.nasa.gov

http://www.ritsumei.ac.jp/~akitaoka/index-e.html

http://www.ritsumei.ac.jp/~akitaoka/index-e.html

167

Figure 5.5. Anomalous Motion. Tow row: Two anomalous motion designsusing the same motiel shape, but different color schemes. Bottom Row: TheRotating Snakes illusion, after A. Kitaoka. Changing the viewing distance orzooming in/out affects the magnitude of the effect. Try viewing only one imageat a time.

168

various rules of thumb to create anomalous motion designs. Most designs require a repeated

texture element (motiel) with the following characteristics: One side of the shape is brighter

than the center, while the opposing side is darker than the center. The brightness of the center

region should not be too different from the background. The shape of the motiel can be varied.

Most observers perceive motion in the light-to-dark direction of the motiel. The size of the

motiel has a significant effect on the magnitude of the illusion. These and additional rules help

in designing anomalous motion illusions, but they do not explain them. However, parametriza-

tion of these rules combined with computer graphics visualization may help us to learn more

about the extent to which these rules apply and when they break down. This, in turn, is likely to

increase our understanding of the illusions, and may lead to perceptual models explaining them

in more detail, again reiterating the leitmotif of my dissertation.

Applications. Possible uses of these illusions, in addition to their entertainment factor

and scientific interest, could include indication of motion in print media, motion visualization

in static displays, and velocity indication of slow moving objects. These applications are similar

to those Freeman et al. [53] proposed, although their system required a short animation sequence

not realizable on truly static media.

5.1.4. Deleted Contours

The Phenomenon. As noted in Section 4.2, the HVS is equipped with a fair amount of redun-

dancy to increase robustness of visual tasks and to deal with underconstrained visual situations.

Figure 5.6 illustrates another facet of this concept. The cube example shows that the straight

edges connecting the corners of the cube do not add much more visually useful information

to the image. Their presence can be inferred from the termination points of the corners. The

169

exact mechanism by which humans are able to automatically complete such missing contours

is not fully understood, but Hoffman [75] composed a set of rules that are viable candidates

for visual hypothesis testing, as introduced in Section 1.2.2. While Koenderink [94, 97] in-

vestigated the geometric properties of contours that allow shape recovery, Biederman and oth-

ers [10, 9, 112, 113] demonstrated via user-studies which types of contour deletions the HVS

could recover. The scissor example in Figure 5.6 shows that, as mentioned in Section 2.2, not all

visual information is of equal importance. While some contour deletions are easily recovered,

others are not. Interestingly, adding arbitrary plausible masking shapes to the unrecoverable

scissor image re-enables recognition.

I believe deleted contours are an excellent example of the minimal graphics described by

Herman et al. [69], which I mentioned in Section 3.5.9. If we do not require a complete contour

description to obtain shape, then how much do we need, and what? Junctions (corners and in-

tersections) are good candidates for a necessity requirement, but we need additional information

to discern curved features (e.g. a circle). Hoffman’s minima rule (referring to an extremum in

curvature) and other shape parsing rules [74, 148] could help in that respect.

Applications. In terms of applications of this phenomenon, it is surprising that (to the best

of my knowledge) no adaptive rendering technique (Section 2.4.2) makes use of the fact that

some parts of an outline are more salient than others. Given that most rendering systems have

3-D information readily available and could easily compute contour saliency, this is an avenue

worth investigating; not only to speed up rendering and hide artifacts, but to actively increase

the visual clarity of a rendered image.

170

Figure 5.6. Deleted Contours. Cube: To perceive a cube, it is not necessary tofully depict it because the HVS fills in missing information automatically. Scis-sors: Not all information is equally valuable. Strategically placed informationin the recoverable version facilitate the recognition task. Both the recoverableand non-recoverable versions contain about the same length of total contour,but redundant and coincidental information in the non-recoverable version makeidentification very difficult. Adding masking cues to the non-recoverable version(recoverable again) disambiguates coincidental information, leading to renewedrecovery.— Scissor example after Biederman [10].

5.2. Discussion

The examples of vision shortcuts I give in Section 5.1 are all more or less ad hoc and there

exists no unified framework that ties them all together. Some of their possible applications may

sound fantastical – for now. Too little is still known about the perceptual mechanisms whereby

vision shortcuts operate. NPR systems in conjunction with perceptual studies might bridge this

knowledge gap some day.

171

The great potential I see in vision shortcuts is as a continuation of what art (and NPR) has

already achieved: a divorce of function from form. This separation allows for greater freedom

in the design of images and more direct targeting of visual information. Art is not bound by the

requirement to simulate optical processes in order to convey a message, and more often than not

this helps the visual communication purpose of an image. Vision shortcuts take the separation

of function and form one step further. By almost directly addressing HVS processes (as in

the Flicker color examples) function (perception of color) can be targeted completely without

form. Note, that form, here, does not refer to a generic medium through which function may

be applied, but only to the natural medium associated with the function. In the case of color,

the natural medium is light of different wavelengths. Flicker color replaces this medium with a

series of pulses that objectively are nothing more than intermittent light signals, but in the HVS

these become perceptions of color. The divorce of color from wavelengths may enable us to

create revolutionary new display devices and techniques.

I do concede that we are still a long way from incorporating vision shortcuts into standard

rendering pipelines, but I hope that research at the interface between NPR and Perception, as

advocated in my dissertation, will bring us closer to that goal.

172

CHAPTER 6

Conclusion

In the beginning of this dissertation, I argue that the connection between non-realistic de-

piction and human perception is a valuable tool to improve the visual communication potential

of computer-generated images, and conversely, to learn more about human perception of such

images.

My perception-centric approach to non-realistic depiction differs from most previous NPR

work in that I am not interested in merely replicating an artistic style1, but that I focus on the

perceptual motivation for using non-realistic imagery.

In Chapter 3, I show how a perceptually inspired image processing framework can create

images that are effective for visual communication tasks. These images also resemble cartoons;

not primarily because my motivation was to imitate a cartoon-style, but because in the design

of the framework I used the same perceptual principles that make cartoons highly effective for

visual communication purposes. This subtle difference has important consequences: although

the resulting images of my framework may resemble those of previous works, my perceptually

inspired framework is faster than previous systems, more temporally coherent, and implicitly

generates certain visual effects (indication and motion lines), that other NPR cartooning systems

have to program explicitly. The appearance of these complimentary effects is likely linked

to the fact that, although the effects are commonly considered merely stylistic, they actually

1This is not to say that I argue against purely artistic use of NPR. There definitely is merit in such use for cre-ative expression and aesthetic purposes. For this reason, I included the various stylistic parameters introduced inChapter 3.

173

have roots in perceptual mechanisms and physiological structures [89, 52]. This demonstrates

some of the benefits of re-examining non-realistic graphics and artistic styles in the light of

perceptual motivations. Not only can this approach teach us about art and perception of art, but

it can provide insights to leverage the perceptual principles that make art so effective for visual

communication. We can then use that knowledge to improve computer graphics, realistic and

non-realistic alike.

Similarly, we can leverage existing non-realistic imaging techniques and the immense pro-

cessing power of graphics hardware to perform perceptual studies more relevant to interactive

computer graphics applications (Chapter 4) than the impoverished studies that are tradition-

ally performed. The knowledge gained from such studies is not only valuable for non-realistic

graphics, but is likely to transfer to improving realistic computer graphics, as well.

6.1. Conclusion drawn from Real-time Video Abstraction Chapter

In Chapter 3, I have presented a simple and effective real-time framework that abstracts

images while retaining much of their perceptually important information, as demonstrated by

two user studies (Section 3.4).

In addition to the contribution of the actual framework, I can draw several high-level con-

clusions from Chapter 3. While not all of these conclusions are necessarily novel, they are in

my opinion particularly well reflected in the framework’s design and implementation.

6.1.1. Contrast

All of the important processing steps in the framework are based on contrasts, not absolute

values, continuing a recently developing trend in the graphics community towards differential

174

(change-based) models and algorithms [123, 156, 59]. Particularly, the automatic version of

the non-linear diffusion approximation in Section 3.3.2 uses the given contrast in an image

to change said contrast, forming a closed-loop, implicit algorithm. I believe that differential

methods will play an increasingly important role in future systems, particularly in those based

on perceptual principles.

6.1.2. Soft Quantization

Temporal coherence has been a major problem for many animated stylization systems from the

very beginning [101]. There are several reasons for this. Stylization is often an arbitrary external

addition to an image (e.g. waviness or randomness in line-drawings of exact 3-D models) and

should therefore be controlled via a temporally smooth function (e.g. Perlin noise [124] or

wavelet noise [29]). Another problem that is more difficult to address is that of quantization.

Many existing stylization systems force discontinuous quantizations, particularly to derive an

explicit image representation [36, 161, 26]. My approach is different. I want to aid the human

visual system to increase efficiency for certain visual tasks, but I do not endeavor to perform

the visual task for the user, who is much more capable than any system that I can devise. In

terms of quantization this means that I will not force a quantization if I cannot be relatively sure

to make the correct decision (e.g. whether a pixel belongs to one object or another). Instead, I

perform a quasi-quantization, or soft-quantization, which suggests rather than enforces, and let

the observers mentally complete the picture for themselves. This principle is used effectively

in the color-quantization in Section 3.3.5 and the edge detection in Section 3.3.3. In essence, it

can often be useful to give a good partial solution than an erroneous full solution which needs

to be corrected with the help of the user [161, 26].

175

6.1.3. Gaussians, Art, and Perception

Gaussian filters or Gaussian-like convolutions appear recurringly in Chapter 3, from scale-space

theory and edge-detection to diffusion-approximations and motion-blur. Why should this be

so? My personal explanation is related to the receptive-field concept, which is illustrated in

Figure 3.11 and to which Zeki refers as “[...] one of the most important concepts to emerge

from sensory physiology in the past fifty years.” ([189], pp. 88). The receptive field of a cell

connects the cell with its physical or logical neighbors2 and thus performs information integra-

tion over larger and larger areas, and eventually the entire visual field. In essence, the output

of many cortical cells depends on the weighted input of its own trigger mechanism and that of

its neighbors, not entirely unlike the convolution of a Gaussian-like kernel with an image. This

connection has been shown physiologically [183] before, but I believe it to be important for two

additional reasons.

First, using Gaussian-like convolutions allows for very efficient, parallel, and implicit in-

formation processing frameworks, which become akin to neural net implementations when

processed iteratively. It might thus be interesting to look at neural nets and related artificial

intelligence applications in terms of image processing operations that can leverage the parallel

processing power of modern GPUs for high-performance computations.

Second, many of the Gaussian-based image processing operations I have used show inter-

esting connections to well-known artistic techniques and principles. There are obvious con-

nections3 like the one between DoG edges and line-drawings, but there are also less obvious

connections, like the indication and motion lines of Section 3.5.7. Another effect (not shown in

2This corresponds to the topological (physical) versus feature-based (logical) mappings found to connect the dif-ferent visual cortical areas [188].3In terms of relatedness, not causality.

176

detail) is that recursively bilaterally filtered images often tend to look like water-color paintings

(in which non-linear color diffusion through canvas and water plays a pivotal role).

In short, Gaussians-like filters seem to play heavily into both perception and art, and given

the fact that various artistic techniques have been traced back to cortical structures and low-

level perception [66, 17, 89, 52, 189], it might be worthwhile to attempt an explanation and

parametrization of art in terms of a perceptually-based computational information-integration

model using Gaussian-like functions.

6.2. Conclusions drawn from Shape-from-X Chapter

In Chapter 4, I have presented an experiment to study shape perception of moving objects

using non-realistic imagery. My main contributions in designing the experiment are the use of

simple, non-realistic visual stimuli to separate shape cues onto orthogonal perceptual shape-

from-X axes, and to display these cues in a highly dynamic environment for a time-critical,

game-like task. The experiment therefore demonstrates well the contribution that NPR can

make to perceptual studies, as well as a methodology that NPR researchers can use to validate

and improve their systems.

One of the most interesting results of the user study I performed using my experimental de-

sign is that shape cues in a time-constrained condition need not interact constructively. In fact,

my results indicate that the opposite may be the case. When multiple shape cues are present,

they can conflict and impede each other. Apart from the other results discussed in Section 4.7,

the most important conclusions to be drawn from Chapter 4 is that the experimental framework

fulfilled all the design goals put forth in Section 4.1.1, and that this allows for numerous ad-

ditional studies to be performed and evaluated with my experimental design. One benefit of

177

the design is that setting up new studies can be as simple as generating new sets of shapes to

be tested, or varying the texture parametrization. The following sections describe some of the

possible dynamic shape studies that my experiment supports, and which might yield valuable

perceptual insights to improve existing rendering systems and to develop new display algo-

rithms for interactive applications.

6.2.1. Contours

The different types of contours (silhouettes, outlines, ridges, valleys, creases, etc.) illustrated in

Figure 4.1, Contours, might be investigated.

Of particular interest would be the evaluation of the perceptually motivated suggestive

contours [35]. I actually included DeCarlo et al.’s suggestive contour code in the initial car-

experiment, but ended up not using it because the coarse real-time models of my setup did not

provide enough geometric detail for the suggestive contour method to work properly, and higher

resolution models prohibited real-time rendering. The same limitations apply to the geon shapes

of my current experiment. Because studying the effectiveness of suggestive contours requires a

more complex shape set and higher resolution models, some obstacles will have to be overcome

to enable real-time rendering performance. An exciting development in that regard is the new

feature set of the latest generation of graphics cards, which allows for geometry generation in

GPU code. This, together with instancing, might enable real-time suggestive contour rendering

of multiple complex, high-resolution models.

178

6.2.2. Textures

Apart from the simple sphere-mapping used in my experiment, a number of other shape param-

eterizations can be used to map texture onto objects [77, 58, 90, 153]. The type of texture can

also be varied. My experiment uses a random-design texture comprised of structure at a variety

of spatial scales. Most perception studies that focus solely on texture use sinusoidal gratings

of a well-defined frequency and amplitude. It will be interesting to study how these textures

perform in a dynamic experiment.

6.2.3. Shapes

I varied attachment shape along two categorical axes but there are other shape categories to

explore. Some of my results (Section 4.7.3) suggest that different shape cues might be more

effective for certain types of shapes than for others, so it will be interesting to perform additional

studies that match shape classes with optimal shape cues or shape cue combinations.

6.2.4. Dynamics

Another result of my experiment indicates that shape perception in dynamic environments may

abide by different rules than perception in static environments. In line with recent research

on motion detection of variously colored stimuli [55, 168, 104], it might be interesting to see

what exactly constitutes a static versus a dynamic environment. What kind of translational and

rotational velocities can be considered dynamic or even highly dynamic? How fast can the

different shape-from-X mechanisms of the human visual system reliably detect subtle shape

differences?

179

I explained in Section 4.4.2, Motion (pg. 124), that shape-from-Motion in isolation is diffi-

cult to study because for motion to be perceived, something has to move. Experiments on the

perception of biological motion have shown that motion can indeed be divorced from form [82].

Coherently moving dots in a random dot display can be perceived to represent the motion of

various rigid bodies or even complex biological entities. The dots are totally devoid of shape

when stationary, but become part of a moving form when animated (similar to the effect of the

complex background textures in my experiment). The resolution of sparsely distributed dots on

a display is obviously too limited to resolve the subtle differences between the shape categories

tested in my study, but it will be interesting to devise a modified version of my experiment that

can help to separate the effect of shape-from-Motion from the contribution of the other shape

cues in a dynamic environment.

6.2.5. Games and Distributed Studies

Finally, I am encouraged by the positive user-feedback from the experiment. Perceptual studies

often employ repetitive, time-consuming, and tedious tasks to obtain accurate data. Such tasks

can negatively impact concentration and performance levels. I found in my experiment that

participants generally enjoyed the interaction because it was simple and engaging. The task

also triggered competitive behavior in participants who wanted to do well and shoot as many

correct targets as possible. I see game-like interaction tasks for perceptual studies as a way of

obtaining data that is more relevant to real-life activities than in most traditional, reductionist

experiments. Of course, there are problems and pitfalls, as I have found out with the car-

experiment, but a careful experimental design can minimize those problems. I am interested

to see how popular game paradigms, like racing games, first-person shooters, or third-person

180

obstacle games can be modified to yield perceptually valuable and scientifically sound data.

One big advantage of such an approach, in addition to its immediate applicability to interactive

graphics, game design, and perception research, is the large base of volunteer gamers who could

download the experiment/game and would generate valuable data just by playing. The obvious

problems of limited control over the environmental conditions during the experimental trials

would have to be weighed against the benefits of fast and copious data-gathering possible in a

distributed, autonomous experiment.

6.3. Summary

The graphics community at large has acquired much knowledge about the design and per-

formance of rendering algorithms as well as interactive and even immersive applications. Yet,

very little is known about the perceptual effects of these algorithms and applications on human

task performance. It is my hope that in the future we will harness more of the advanced ren-

dering systems and processing power that computer graphics has to offer, to perform perceptual

studies that would otherwise not be possible. In return, the insights gained from such perceptual

studies can flow right back into designing graphical systems that are not only fast and photore-

alistic, but that provide verifiably effective visual stimuli for the human tasks they are intended

to support.

181

References

[1] Nur Arad and Craig Gotsman. Enhancement by image-dependent warping. IEEE Trans.on Image Processing, 8(9):1063–1074, 1999. 72, 73

[2] James Arvo, Kenneth Torrance, and Brian Smits. A framework for the analysis of errorin global illumination algorithms. In SIGGRAPH ’94: Proceedings of the 21st annualconference on Computer graphics and interactive techniques, pages 75–84, New York,NY, USA, 1994. ACM Press. 39

[3] Alethea Bair, Donald House, and Colin Ware. Perceptually optimizing textures for lay-ered surfaces. In APGV ’05: Proceedings of the 2nd symposium on Applied perception ingraphics and visualization, pages 67–74, New York, NY, USA, 2005. ACM Press. 118

[4] Danny Barash and Dorin Comaniciu. A common framework for nonlinear diffusion,adaptive smoothing, bilateral filtering and mean shift. Image and Video Computing,22(1):73–81, 2004. 58, 90

[5] Woodrow Barfield, James Sandford, and James Foley. The mental rotation and perceivedrealism of computer-generated three-dimensional images. Intl. J. Man-Machine Studies,29:669–684, 1988. 112, 115, 118

[6] H.G. Barrow and J.M. Tenenbaum. Line drawings as three-dimensional surfaces. Artifi-cial Intelligence, 17:75–116, 1981. 166

[7] C.E. Benham. The artificial spectrum top. Nature (London), 51:200, 1894. 161

[8] I. Biederman and M. Bar. One-shot viewpoint invariance in matching novel objects. Vi-sion Research, 39(17):2885–2899, 1999. 113, 118

[9] I. Biederman and E. E. Cooper. Priming contour-deleted images: evidence for interme-diate representations in visual object recognition. Cognitve Psychology, 23(3):393–419,1991. 169

182

[10] Irving Biederman. Recognition-by-components: A theory of human image understand-ing. Psychological Review, 94(2):115–147, 1987. 114, 116, 117, 118, 131, 169, 170

[11] Irving Biederman and Peter C. Gerhardstein. Recognizing depth-rotated objects: Evi-dence and conditions for three-dimensional viewpoint invariance. Experimental Psychol-ogy, 19(6):1162–1182, 1993. 115, 117, 118

[12] T. O. Binford. Generalized cylinders representation. In S. C. Shapiro, editor, Encyclope-dia of Artificial Intelligence, pages 321–323, New York, 1987. John Wiley & Sons. 118,131

[13] Mark R. Bolin and Gary W. Meyer. A perceptually based adaptive sampling algorithm.In SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphicsand interactive techniques, pages 299–309, New York, NY, USA, 1998. ACM Press. 40

[14] R. Van den Boomgaard and J. Van de Weijer. On the equivalence of local-mode finding,robust estimation and mean-shift analysis as used in early vision tasks. 16th Internat.Conf. on Pattern Recog., 3:927–930, 2002. 90

[15] Philippe Bordes and Philippe Guillotel. Perceptually adapted MPEG video encoding.Human Vision and Electronic Imaging V, 3959(1):168–175, 2000. 37

[16] D. J. Bremer and J. F. Hughes. Rapid approximate silhouette rendering of implicit sur-faces. Implicit Surfaces ’98, pages 155–164, 1998. 114

[17] S. E. Brennan. Caricature generator: The dynamic exaggeration of faces by computer.Leonardo, 18(3):170–178, 1985. 53, 176

[18] H. H. Bulthoff and S. Edelman. Psychophysical Support for a Two-Dimensional ViewInterpolation Theory of Object Recognition. Proc. of the Natl. Ac. of Sciences, 89(1):60–64, 1992. 118

[19] H. H. Bulthoff and H. A. Mallot. Integration of stereo, shading and texture. In A. Blakeand T. Troscianko, editors, AI and the Eye, pages 119–146. Wiley, London, UK, 1990.150

[20] Michael Burns, Janek Klawe, Szymon Rusinkiewicz, Adam Finkelstein, and Doug De-Carlo. Line drawings from volume data. ACM Trans. Graph., 24(3):512–518, 2005. 114

[21] C. Von Campenhausen and J. Schramme. 100 years of Benham’s top in colour science.Perception, 24(6):695–717, 1995. 162

183

[22] J. F. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analy-sis and Machine Intelligence, 8:769–798, 1986. 69, 72

[23] K. Cater, A. Chalmers, and G. Ward. Detail to attention: exploiting visual tasks forselective rendering. In EGRW ’03: Proceedings of the 14th Eurographics workshop onRendering, pages 270–280, Aire-la-Ville, Switzerland, Switzerland, 2003. EurographicsAssociation. 41

[24] Stephen Chenney, Mark Pingel, Rob Iverson, and Marcin Szymanski. Simulating cartoonstyle animation. In NPAR ’02: Proceedings of the 2nd international symposium on Non-photorealistic animation and rendering, pages 133–138, New York, NY, USA, 2002.ACM Press. 18

[25] Johan Claes, Fabian Di Fiore, Gert Vansichem, and Frank Van Reeth. Fast 3D cartoonrendering with improved quality by exploiting graphics hardware. In Proceedings of Im-age and Vision Computing New Zealand (IVCNZ) 2001, pages 13–18. IVCNZ, November2001. 19

[26] John P. Collomosse, David Rowntree, and Peter M. Hall. Stroke surfaces: Temporallycoherent artistic animations from video. IEEE Trans. on Visualization and ComputerGraphics, 11(5):540–549, 2005. 48, 88, 90, 91, 92, 93, 97, 174

[27] Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In ICCV ’99:Proceedings of the Int. Conference on Computer Vision-Volume 2, page 1197, Washing-ton, DC, USA, 1999. IEEE Computer Society. 90

[28] Robert L. Cook, Loren Carpenter, and Edwin Catmull. The Reyes image rendering ar-chitecture. In SIGGRAPH ’87: Proceedings of the 14th annual conference on Computergraphics and interactive techniques, pages 95–102, New York, NY, USA, 1987. ACMPress. 15

[29] Robert L. Cook and Tony DeRose. Wavelet noise. ACM Trans. Graph., 24(3):803–811,2005. 174

[30] Lynn A. Cooper. Mental rotation of random two-dimensional shapes. Cognitive Psychol-ogy, 7:20–43, 1975. 115

[31] B. Cumming, E. Johnston, and A. Parker. Effects of different texture cues on curvedsurfaces viewed stereoscopically. Vision Research, 33(5-6):827–838, 1993. 110

[32] Cassidy J. Curtis, Sean E. Anderson, Joshua E. Seims, Kurt W. Fleischer, and David H.Salesin. Computer-generated watercolor. Proceedings of SIGGRAPH 97, pages 421–430,August 1997. 18

184

[33] S. J. Daly. Visible differences predictor: an algorithm for the assessment of image fidelity.Proc. SPIE, 1666:2–15, 1992. 34, 41

[34] Richard Dawkins. Climbing Mount Improbable. W. W. Norton & Company, 1997. 23

[35] Doug DeCarlo, Adam Finkelstein, and Szymon Rusinkiewicz. Interactive rendering ofsuggestive contours with temporal coherence. In NPAR ’04, pages 15–24, New York,NY, USA, 2004. ACM Press. 19, 110, 114, 177

[36] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. ACMTrans. Graph., 21(3):769–776, 2002. 19, 33, 48, 49, 62, 63, 72, 91, 95, 174

[37] Michael F. Deering. A photon accurate model of the human eye. ACM Trans. Graph.,24(3):649–658, 2005. 15, 17

[38] Oliver Deussen and Thomas Strothotte. Computer-generated pen-and-ink illustration oftrees. Proceedings of SIGGRAPH 2000, pages 13–18, July 2000. 19

[39] J. Duncan. Selective attention and the organization of visual information. Journal of ex-perimental psychology. General., 113(4):501–517, December 1984. 105, 130

[40] Fredo Durand. An invitation to discuss computer depiction. In NPAR ’02: Proceedings ofthe 2nd international symposium on Non-photorealistic animation and rendering, pages111–124, New York, NY, USA, 2002. ACM Press. 16, 20, 22

[41] David Ebert and Penny Rheingans. Volume illustration: non-photorealistic rendering ofvolume models. In VIS ’00: Proceedings of the conference on Visualization ’00, pages195–202, Los Alamitos, CA, USA, 2000. IEEE Computer Society Press. 114

[42] James H. Elder. Are edges incomplete? Internat. Journal of Computer Vision, 34(2-3):97–122, 1999. 88

[43] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence,1998. 56

[44] G. T. Fechner. Uber eine Scheibe zur Erzeugung subjectiver Farben. Annalen der Physikund Chemie. Verlag von Johann Ambrosius Barth, Leipzig, pages 227–232, 1838. 161

[45] G. T. Fechner. Elemente der Psychophysik, volume 2. Breitkopf und Haertel, Leipzig,1860. 52

[46] James A. Ferwerda, Peter Shirley, Sumanta N. Pattanaik, and Donald P. Greenberg. Amodel of visual masking for computer graphics. In SIGGRAPH ’97: Proceedings of the

185

24th annual conference on Computer graphics and interactive techniques, pages 143–152, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 33, 34,40

[47] James A. Ferwerda, Stephen H. Westin, Randall C. Smith, and Richard Pawlicki. Effectsof rendering on shape perception in automobile design. In APGV ’04: Proceedings ofthe 1st Symposium on Applied perception in graphics and visualization, pages 107–114,New York, NY, USA, 2004. ACM Press. 113, 117

[48] L. Festinger, M. R. Allyn, and C. W. White. The perception of color with achromaticstimulation. Vision Res., 11(6):591–612, 1971. 162

[49] J. Fischer, D. Bartz, and W. Straßer. Stylized Augmented Reality for Improved Immer-sion. In Proc. of IEEE VR, pages 195–202, 2005. 48, 65, 69, 71, 72

[50] Alan Fogel and Thomas E. Hannan. Manual actions of nine- to fifteen-week-old hu-man infants during face-to-face interaction with their mothers. Child Development,56(5):1271–1279, Oct. 1985. 128

[51] Mark D. Folk and R. Duncan Luce. Effects of stimulus complexity on mental rotation rateof polygons. Experimental Psychology: Human Perception and Performance, 13(3):395–404, 1987. 115

[52] Gregory Francis and Hyungjun Kim. Motion parallel to line orientation: Disambiguationof motion percepts. Perception, 28:1243–1255, 1999. 96, 173, 176

[53] William T. Freeman, Edward H. Adelson, and David J. Heeger. Motion without move-ment. In SIGGRAPH ’91: Proceedings of the 18th annual conference on Computergraphics and interactive techniques, pages 27–30, New York, NY, USA, 1991. ACMPress. 168

[54] B. Funt, K. Barnard, M. Brockington, and V. Cardei. Luminance based multi scaleretinex. In Proceedings AIC Colour 97 Kyoto 8th Congress of the International ColourAssociation, volume 1, pages 330–333, May 1997. 166

[55] K. R. Gegenfurtner and M. J. Hawken. Interaction of motion and color in the visualpathways. Trends Neuroscience, 19(9):394–401, 1996. 154, 178

[56] J. J. Gibson. The perception of the visible world. American Journal of Psychology,63:367–384, 1950. 110

[57] J. J. Gibson. The Ecological Approach to Visual Perception. Lawrence Erlbaum Assoc.Inc., 1987. 131

186

[58] Ahna Girshick, Victoria Interrante, Steven Haker, and Todd Lemoine. Line directionmatters: an argument for the use of principal directions in 3D line drawings. In NPAR’00, pages 43–52, New York, NY, USA, 2000. ACM Press. 115, 178

[59] Amy Ashurst Gooch. Preserving Salience By Maintaining Perceptual Differences forImage Creation and Manipulation. PhD thesis, Northwestern University, 2006. 52, 174

[60] Amy Ashurst Gooch and Peter Willemsen. Evaluating space perception in NPR immer-sive environments. In NPAR ’02, pages 105–110, New York, NY, USA, 2002. ACMPress. 19, 114, 116, 117

[61] Bruce Gooch and Amy Ashurst Gooch. Non-Photorealistic Rendering. A. K. Peters,2001. 18

[62] Bruce Gooch, Erik Reinhard, and Amy Gooch. Human facial illustrations: Creation andpsychophysical evaluation. ACM Trans. Graph., 23(1):27–44, 2004. 19, 25, 49, 53, 69,81, 83, 98

[63] Cindy M. Goral, Kenneth E. Torrance, Donald P. Greenberg, and Bennett Battaile. Mod-eling the interaction of light between diffuse surfaces. In SIGGRAPH ’84: Proceedingsof the 11th annual conference on Computer graphics and interactive techniques, pages213–222, New York, NY, USA, 1984. ACM Press. 15

[64] Ian E. Gordon. Theories of Visual Perception. Psychology Press, New York, 3rd edition,Dec. 2004. 131

[65] R. L. Gregory. Eye and Brain - The Psychology of Seeing. Oxford University Press, 1994.23, 51

[66] M. H. Hansen. Effects of discrimination training on stimulus generalization. Journal ofExperimental Psychology, 58:321–334, 1959. 53, 176

[67] J. W. Harris and H. Stocker. General cylinder. In Handbook of Mathematics and Compu-tational Science, page 103, New York, 1998. Springer-Verlag. 4.6.1. 118, 131

[68] James Hays and Irfan Essa. Image and video based painterly animation. In NPAR ’04:Proceedings of the 3rd international symposium on Non-photorealistic animation andrendering, pages 113–120, New York, NY, USA, 2004. ACM Press. 19

[69] Ivan Herman and D. J. Duke. Minimal graphics. IEEE Computer Graphics and Applica-tions, 21(6):18–21, 2001. 99, 169

187

[70] Aaron Hertzmann. Introduction to 3D non-photorealistic rendering: Silhouettes and out-lines. In Non-Photorealistic Rendering (Siggraph ’99 Course Notes), 1999. 110, 114,122

[71] Aaron Hertzmann. Paint by relaxation. In CGI ’01:Computer Graphics Internat. 2001,pages 47–54, 2001. 62

[72] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin.Image analogies. In SIGGRAPH ’01: Proceedings of the 28th annual conference onComputer graphics and interactive techniques, pages 327–340, New York, NY, USA,2001. ACM Press. 18

[73] Aaron Hertzmann and Ken Perlin. Painterly rendering for video and interaction. In NPAR’00: Proceedings of the 1st international symposium on Non-photorealistic animationand rendering, pages 7–12, New York, NY, USA, 2000. ACM Press. 19

[74] D. D. Hoffman and M. Singh. Salience of visual parts. Cognition, 63(1):29–78, 1997.169

[75] Donald D. Hoffman. Visual Intelligence: How We Create What We See. W.W. Norton &Company, NY, 2000. 23, 107, 169

[76] L. Hurvich. Color Vision. Sinauer Assoc., Sunderland, Mass., 1981. 163

[77] Victoria Interrante. Illustrating surface shape in volume data via principal direction-driven 3D line integral convolution. In SIGGRAPH ’97, pages 109–116, New York, NY,USA, 1997. ACM Press/Addison-Wesley Publishing Co. 115, 178

[78] Laurent Itti and Christof Koch. Computational modeling of visual attention. Nature Re-views Neuroscience, 2(3):194–203, 2001. 33, 53, 62

[79] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attentionfor rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259,1998. 33, 53

[80] D. Jameson and L. M. Hurvich. Some quantitative aspects of an opponent-colors theory.I. chromatic response and spectral saturation. II. brightness, saturation and hue in normaland dichromatic vision. Journal of the Optical Society of America, 45(8):602–616, 1955.163

[81] J. R. Jarvis. On Fechner-Benham subjective colour. Vision Res., 17(3):445–451, 1977.162

188

[82] G. Johansson. Visual perception of biological motion and a model for its analysis. Per-ception and Psychophysics, 14(2):201–211, 1973. 179

[83] Alan Johnston and Peter J. Passmore. Shape from shading. I: Surface curvature and ori-entation. Perception, 23:169–189, 1994. 112

[84] Scott F. Johnston. Lumo: illumination for cel animation. In NPAR ’02: Proceedings ofthe 2nd international symposium on Non-photorealistic animation and rendering, pages45–52, New York, NY, USA, 2002. ACM Press. 19

[85] James T. Kajiya. The rendering equation. In SIGGRAPH ’86: Proceedings of the 13thannual conference on Computer graphics and interactive techniques, pages 143–150,New York, NY, USA, 1986. ACM Press. 15

[86] Robert D. Kalnins, Philip L. Davidson, Lee Markosian, and Adam Finkelstein. Coherentstylized silhouettes. ACM Trans. Graph., 22(3):856–861, 2003. 19

[87] Nanda Kambhatla, Simon Haykin, and Robert D. Dony. Image compression using KLT,wavelets and an adaptive mixture of principal components model. J. VLSI Signal Process.Syst., 18(3):287–296, 1998. 37

[88] G. Kayaert, I. Biederman, and R. Vogels. Shape Tuning in Macaque Inferior TemporalCortex. Journal of Neuroscience, 23(7):3016–3027, 2003. 113, 118

[89] Hyungjun Kim and Gregory Francis. A computational and perceptual account of motionlines. Perception, 27:785–797, 1998. 96, 173, 176

[90] Sunghee Kim, Haleh Hagh-Shenas, and Victoria Interrante. Conveying three-dimensional shape with texture. In APGV ’04: Proceedings of the 1st Symposium onApplied perception in graphics and visualization, pages 119–122, New York, NY, USA,2004. ACM Press. 115, 118, 178

[91] Akiyoshi Kitaoka. Trick Eyes. Barnes & Noble Books, 2005. 166

[92] Akiyoshi Kitaoka and Hiroshi Ashida. Phenomenological characteristics of the periph-eral drift illusion. Vision, 15(4):261–262, 2003. 166

[93] J. J. Koenderink. The structure of images. Biological Cybernetics, 50:363–370, 1984. 54

[94] J. J. Koenderink and A. J. Doorn. The internal representation of solid shape with respectto vision. Biological Cybernetics, 32(4):211–216, 1979. 131, 169

189

[95] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Surface perception in pictures.Perception and Psychophysics, 52(5):487–496, 1992. 108, 112, 115

[96] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Pictorial surface attitude andlocal depth comparisons. Perception and Psychophysics, 58(2):163–173, 1996. 108, 112,115

[97] Jan J. Koenderink. What does the occluding contour tell us about solid shape? Perception,13:321–330, 1984. 110, 169

[98] Jan J. Koenderink and Andrea J. Van Doorn. Relief: Pictorial and otherwise. Image andVision Computing, pages 321–334, 1995. 115

[99] Adam Lake, Carl Marshall, Mark Harris, and Marc Blackstein. Stylized rendering tech-niques for scalable real-time 3D animation. In NPAR ’00: Proceedings of the 1st inter-national symposium on Non-photorealistic animation and rendering, pages 13–20, NewYork, NY, USA, 2000. ACM Press. 19

[100] Edwin H. Land. The retinex theory of color vision. Scientific American, 237(6):108–128,1977. 164

[101] John Lansdown and Simon Schofield. Expressive rendering: A review of nonphotoreal-istic techniques. IEEE Comput. Graph. Appl., 15(3):29–37, 1995. 174

[102] Tony Lindeberg. Scale-Space Theory in Computer Vision. Kluwer, Netherlands, 1994. 54

[103] Joern Loviscach. Scharfzeichner: Klare Bilddetails durch Verformung. Computer Tech-nik, 22:236–237, 1999. 74

[104] Zhong-Lin Lu, Luis A. Lesmes, and George Sperling. The mechanism of isoluminantchromatic motion perception. Proc. Natl. Acad. Science USA, 96(14):8289–8294, 1999.124, 154, 178

[105] R. Duncan Luce and Ward Edwards. The derivation of subjective scales from just notice-able differences. Psychol. Rev., 65(4):222–237, 1958. 52

[106] Rafał Mantiuk, Scott Daly, Karol Myszkowski, and Hans-Peter Seidel. Predicting visibledifferences in high dynamic range images - model and its calibration. In Bernice E. Ro-gowitz, Thrasyvoulos N. Pappas, and Scott J. Daly, editors, Human Vision and ElectronicImaging X, volume 5666, pages 204–214, 2005. 34

190

[107] Rafał Mantiuk, Grzegorz Krawczyk, Karol Myszkowski, and Hans-Peter Seidel.Perception-motivated high dynamic range video encoding. ACM Trans. Graph.,23(3):733–741, 2004. 39

[108] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. 131

[109] D. Marr and E. C. Hildreth. Theory of edge detection. Proc. Royal Soc. London, Bio.Sci., 207:187–217, 1980. 68, 71

[110] Barbara J. Meier. Painterly rendering for animation. Proceedings of SIGGRAPH 96,pages 477–484, August 1996. 19

[111] Ross Messing and Frank H. Durgin. Distance perception and the visual horizon in head-mounted displays. ACM Trans. Appl. Percept., 2(3):234–250, 2005. 116, 117

[112] A. S. Meyer, A. M. Sleiderink, and W. J. M. Levelt. Viewing and naming objects: Eyemovements during noun phrase production. Cognition, 66(2):25–33, 1998. 169

[113] C. Moore and P. Cavanagh. Recovery of 3D volume from 2-tone images of novel objects.Cognition, 67(1):45–71, 1998. 169

[114] K. Moutoussis and S. Zeki. A direct demonstration of perceptual asynchrony in vision.Proc. R. Soc. Lond. B Biol. Sci., 264(1380):393–399, 1997. 151

[115] K. T. Mullen and C. L. Baker Jr. A motion aftereffect from an isoluminant stimulus.Vision Res., 25(5):685–688, 1985. 154

[116] Karol Myszkowski. Perception-based global illumination, rendering, and animation tech-niques. In SCCG ’02: Proceedings of the 18th spring conference on Computer graphics,pages 13–24, New York, NY, USA, 2002. ACM Press. 41

[117] D. E. Nilsson and S. Pelger. A pessimistic estimate of the time required for an eye toevolve. Proc. R. Soc. Lond. B Bio. Sci., 256(1345):53–58, 1994. 23

[118] J. F. Norman, J. T. Todd, and F. Phillips. The perception of surface orientation frommultiple sources of information. Perception and Psychophysics, 57(5):629–636, 1995.115, 118

[119] Sven C. Olsen, Holger Winnemoller, and Bruce Gooch. Implementing real-time videoabstraction. In Proceedings of SIGGRAPH 2006 Sketches. 77

[120] Victor Ostromoukhov. Digital facial engraving. Proceedings of SIGGRAPH 99, pages417–424, August 1999. 18

191

[121] Stephen E. Palmer. Vision Science: Photons to Phenomenology. The MIT Press, 1999.51, 54, 67, 111

[122] Sumanta N. Pattanaik, Jack Tumblin, Hector Yee, and Donald P. Greenberg. Time-dependent visual adaptation for fast realistic image display. In SIGGRAPH ’00: Pro-ceedings of the 27th annual conference on Computer graphics and interactive techniques,pages 47–54, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.39

[123] Patrick Perez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Trans.Graph., 22(3):313–318, 2003. 52, 174

[124] Ken Perlin. Improving noise. In SIGGRAPH ’02: Proceedings of the 29th annual confer-ence on Computer graphics and interactive techniques, pages 681–682, New York, NY,USA, 2002. ACM Press. 174

[125] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic dif-fusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(7):629–639, 1991.58, 61

[126] Tuan Q. Pham and Lucas J. Van Vliet. Separable bilateral filtering for fast video prepro-cessing. In IEEE Internat. Conf. on Multimedia & Expo, pages CD1–4, Amsterdam, July2005. 58, 66, 88

[127] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM,18(6):311–317, 1975. 108

[128] Simon Plantinga and Gert Vegter. Contour generators of evolving implicit surfaces. InSM ’03: Proceedings of the eighth ACM symposium on Solid modeling and applications,pages 23–32, New York, NY, USA, 2003. ACM Press. 114

[129] Jodie M. Plumert, Joseph K. Kearney, James F. Cremer, and Kara Recker. Distance per-ception in real and virtual environments. ACM Trans. Appl. Percept., 2(3):216–233, 2005.116, 117

[130] Claudio M. Privitera and Lawrence W. Stark. Algorithms for defining visual regions-of-interest: Comparison with eye fixations. IEEE Trans. on Pattern Analysis and MachineIntelligence, 22(9):970–982, 2000. 33, 62

[131] Thierry Pudet. Real time fitting of hand-sketched pressure brushstrokes. Eurographics1994, 13(3):277–292, August 1994. 18

192

[132] Paul Rademacher, Jed Lengyel, Edward Cutrell, and Turner Whitted. Measuring the per-ception of visual realism in images. In Proceedings of the 12th Eurographics Workshopon Rendering Techniques, pages 235–248, London, UK, 2001. Springer-Verlag. 113

[133] Z. Rahman, D. J. Jobson, G. A. Woodell, and G. D. Hines. Automated, on-board terrainanalysis for precision landings. In Visual Information Processing XIV, Proc. SPIE 6246,2006. 166

[134] V. S. Ramachandran and R. L. Gregory. Does colour provide an input to human motionperception? Nature, 275:55–56, Sep. 1978. 154

[135] V. S. Ramachandran and W. Hirstein. The science of art. Journal of Consciousness Stud-ies, 6(6–7):15–51, 1999. 24

[136] Mahesh Ramasubramanian, Sumanta N. Pattanaik, and Donald P. Greenberg. A percep-tually based physical error metric for realistic image synthesis. In SIGGRAPH ’99: Pro-ceedings of the 26th annual conference on Computer graphics and interactive techniques,pages 73–82, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co.40

[137] Ramesh Raskar. Hardware support for non-photorealistic rendering. In HWWS ’01: Pro-ceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware,pages 41–47, New York, NY, USA, 2001. ACM Press. 122

[138] Ramesh Raskar, Kar-Han Tan, Rogerio Feris, Jingyi Yu, and Matthew Turk. Non-photorealistic camera: depth edge detection and stylized rendering using multi-flashimaging. ACM Trans. Graph., 23(3):679–688, 2004. 19, 48

[139] M. M. Reid, R. J. Millar, and N. D. Black. Second-generation image coding: an overview.ACM Comput. Surv., 29(1):3–29, 1997. 36

[140] I. Rock and J. DiVita. A case of viewer-centered perception. Cognitive Psychology,19:280–293, 1987. 118

[141] T. A. Ryan and C. B. Schwartz. Speed of perception as a function of mode of represen-tation. American Journal of Psychology, 69(1):60–69, March 1956. 25, 112, 113, 117,131

[142] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-D shapes. InProc. of ACM SIGGRAPH 90, pages 197–206, 1990. 19, 47

[143] Michael P. Salisbury, Michael T. Wong, John F. Hughes, and David H. Salesin. Orientabletextures for image-based pen-and-ink illustration. In SIGGRAPH ’97: Proceedings of the

193

24th annual conference on Computer graphics and interactive techniques, pages 401–406, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 18

[144] Anthony Santella and Doug DeCarlo. Visual interest and NPR: An evaluation and mani-festo. In Proc. of NPAR ’04, pages 71–78, 2004. 19, 20, 33, 41, 49

[145] Jutta Schumann, Thomas Strothotte, Andreas Raab, and Stefan Laser. Assessing the ef-fect of non-photorealistic rendered images in CAD. In Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems: Common Ground, pages 35–41, 1996.88

[146] Claude E. Shannon. A mathematical theory of communication. Bell System TechnicalJournal, 27:623–656, October 1948. 88

[147] R. N. Shepard and J. Metzler. Mental rotation of three-dimensional objects. Science, NewSeries, 171(3972):701–703, Feb. 1971. 114, 115, 118

[148] M. Singh, G. D. Seyranian, and D. D. Hoffman. Parsing silhouettes: the short-cut rule.Perceptual Psychophysics, 61(4):636–660, 1999. 169

[149] Sarah V. Stevenage. Can caricatures really produce distinctiveness effects? British Jour-nal of Psychology, 86:127–146, 1995. 49, 53, 80, 81, 83, 98, 116

[150] Thomas Strothotte and Stefan Schlechtweg. Non-Photorealistic Computer Graphics:Modeling, Rendering, and Animation. Morgan Kaufmann, 2002. 103

[151] Kim Sunghee, H. Hagh-Shenas, and Victoria Interrante. Conveying shape with texture:An experimental investigation of the impact of texture type on shape categorization judg-ments. 2003 IEEE Symposium on Information Visualization, pages 163–170, 2003. 116,118

[152] Ivan Sutherland. Sketchpad: A man-machine graphical communication system. In Proc.AFIPS Spring Joint Computer Conference, pages 329–346, Washington, D.C, 1963.Spartan Books. 18

[153] Graeme Sweet and Colin Ware. View direction, surface orientation and texture orienta-tion for perception of surface shape. In GI ’04: Proceedings of the 2004 conference onGraphics interface, pages 97–106. Canadian Human-Computer Communications Soci-ety, 2004. 112, 115, 118, 178

[154] M. J. Tarr. Orientation Dependence in Three-Dimensional Object Recognition. PhD the-sis, Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 1989.118

194

[155] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proceedingsof ICCV ’98, pages 839–846, 1998. 59, 61

[156] J. Tumblin, A. Agarwal, and R. Raskar. Why I want a gradient camera. Computer Visionand Pattern Recognition (CVPR), pages 103–110, 2005. 52, 174

[157] Jack Tumblin, Jessica K. Hodgins, and Brian K. Guenter. Two methods for display ofhigh contrast images. ACM Trans. Graph., 18(1):56–94, 1999. 38

[158] Jack Tumblin and Greg Turk. LCIS: A boundary hierarchy for detail-preserving contrastreduction. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computergraphics and interactive techniques, pages 83–90, New York, NY, USA, 1999. ACMPress/Addison-Wesley Publishing Co. 38, 50, 61

[159] R. L. De Valois and K. K. De Valois. Spatial Vision. Oxford University Press, New York,1988. 54

[160] P. Verghese and D. G. Pelli. The information capacity of visual attention. Vision Research,32(5):983–995, May 1992. 105, 130

[161] Jue Wang, Yingqing Xu, Heung-Yeung Shum, and Michael F. Cohen. Video tooning.ACM Trans. Graph., 23(3):574–583, 2004. 48, 90, 91, 92, 93, 174

[162] Gregory J. Ward. The RADIANCE lighting simulation and rendering system. In SIG-GRAPH ’94: Proceedings of the 21st annual conference on Computer graphics andinteractive techniques, pages 459–472, New York, NY, USA, 1994. ACM Press. 17, 41

[163] Benjamin Watson, Alinda Friedman, and Aaron McGaffey. Measuring and predictingvisual fidelity. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Com-puter graphics and interactive techniques, pages 213–220, New York, NY, USA, 2001.ACM Press. 37

[164] Joachim Weickert. Anisotropic Diffusion in Image Processing. ECMI. Teubner, Stuttgart,1998. 61

[165] Andreas Wenger, Andrew Gardner, Chris Tchou, Jonas Unger, Tim Hawkins, and PaulDebevec. Performance relighting and reflectance transformation with time-multiplexedillumination. ACM Trans. Graph., 24(3):756–764, 2005. 17

[166] Turner Whitted. An improved illumination model for shaded display. Commun. ACM,23(6):343–349, 1980. 15

195

[167] Nathaniel Williams, David Luebke, Jonathan D. Cohen, Michael Kelley, and BrendenSchubert. Perceptually guided simplification of lit, textured meshes. In SI3D ’03: Pro-ceedings of the 2003 symposium on Interactive 3D graphics, pages 113–121, New York,NY, USA, 2003. ACM Press. 37

[168] A. Willis and S. J. Anderson. Separate colour-opponent mechanisms underlie the detec-tion and discrimination of moving chromatic targets. Proc. R. Soc. Lond. B Biol. Sci.,265(1413):2435–2441, 1998. 154, 178

[169] Georges Winkenbach and David H. Salesin. Computer-generated pen-and-ink illustra-tion. In Proc. of ACM SIGGRAPH 94, pages 91–100, 1994. 95

[170] Georges Winkenbach and David H. Salesin. Rendering parametric surfaces in pen andink. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graph-ics and interactive techniques, pages 469–476, New York, NY, USA, 1996. ACM Press.18, 19

[171] Holger Winnemoeller, Sven C. Olsen, and Bruce Gooch. Real-time video abstraction.ACM Trans. Graph., 25(3):1221–1226, 2006. 101

[172] Holger Winnemoller. Testing effects of color constancy for images displayed on CRTdevices. Technical Report CS03-03-00, University of Cape Town, Computer ScienceDepartment, September 2003. 165

[173] Holger Winnemoller and Shaun Bangay. Geometric approximations towards free specu-lar comic shading. Computer Graphics Forum, 21(3):309–316, September 2002. 19

[174] Holger Winnemoller and Shaun Bangay. Rendering Optimisations for Stylised Sketch-ing. In ACM Afrigraph 2003: 2nd International Conference on Computer Graphics,Virtual Reality and Visualization in Africa, pages 117–122. ACM, ACM SIGGRAPH,February 2003. 19

[175] A. P. Witkin. Scale-space filtering. In 8th Int. Joint Conference on Artificial Intelligence,pages 1019–1022, Karlsruhe, Germany, 1983. 54

[176] Eric Wong. Artistic rendering of portrait photographs. Master’s thesis, Cornell Univer-sity, 1999. 19

[177] M. Woo and M. B. Sheridan. OpenGL Programming Guide: The Official Guide to Learn-ing OpenGL, Version 1.2. Addison-Wesley Longman Publishing Co., Inc. Boston, MA,USA, 1999. 123

196

[178] G. A. Woodell, D. J. Jobson, Z. Rahman, and G. D. Hines. Advanced image processingof aerial imagery. In Visual Information Processing XIV, Proc. SPIE 6246, 2006. 166

[179] G. Wyszecki and W. S. Styles. Color Science: Concepts and Methods, Quantitative Dataand Formulae. Wiley, New York, NY, 1982. 51

[180] Hector Yee, Sumanita Pattanaik, and Donald P. Greenberg. Spatiotemporal sensitiv-ity and visual attention for efficient rendering of dynamic environments. ACM Trans.Graph., 20(1):39–65, 2001. 34, 41

[181] Ian T. Young, Lucas J. Van Vliet, and Michael Van Ginkel. Recursive gabor filtering.IEEE Trans. on Signal Processing, 50(11):2798–2805, 2002. 89

[182] R. A. Young. Some observations on temporal coding of color vision: psychophysicalresults. Vision Res., 17(8):957–965, 1977. 162

[183] R. A. Young. The gaussian derivative model for spatial vision: I. retinal mechanisms.Spatial Vision, 2:273–293, 1987. 175

[184] John C. Yuille and James H. Steiger. Nonholistic processing in mental rotation: Somesuggestive evidence. Perception & Psychophysics, 31(3):201–209, 1982. 114, 115, 118

[185] S. Zeki and M. Lamb. The neurology of kinetic art. Brain, 117:607–636, 1994. 24

[186] S. Zeki and M. Marini. Three cortical stages of colour processing in the human brain.Brain, 121:1669–1685, 1998. 24

[187] S. M. Zeki. Colour coding in the superior temporal sulcus of rhesus monkey visual cortex.Proc. R. Soc. Lond. B Biol. Sci., 197(1127):195–223, 1977. 154

[188] Semir Zeki. A vision of the brain. Blackwell Scientific Publications Oxford, 1993. 45,51, 83, 124, 175

[189] Semir Zeki. Art and the brain. Journal of Consciousness Studies, 6(6–7):76–96, 1999.21, 24, 25, 151, 175, 176

[190] Robert C. Zeleznik, Kenneth P. Herndon, and John F. Hughes. SKETCH: An interfacefor sketching 3D scenes. In SIGGRAPH ’96: Proceedings of the 23rd annual conferenceon Computer graphics and interactive techniques, pages 163–170, New York, NY, USA,1996. ACM Press. 18

197

APPENDIX A

User-data for Videoabstraction Studies

Table A.1 and Table A.2 list the per participant data values for study 1 and study 2 in

Section 3.4. Figure 3.21 visualizes data for both tables.

In these tables and the following, Std. Dev. stands for standard deviation, σ, and Std. Err.

stands for standard error (not normalized), se ≡ σ/√

n, where n is the number of samples.

Study1: RecognitionTime(msec)

Data Pair Photograph Abstraction1 1159 9652 1291 12373 1660 12814 1305 12855 1342 13306 1486 13677 1712 13788 1622 13889 1748 1435

10 1811 1520Average 1513.5 1318.5Std. Dev. 227.5 148.5Std. Err. 72.0 47.0

Table A.1. Data for Videoabstraction Study 1. This table shows the averagetime (in milliseconds) each participant took to recognize a depicted face (photo-graph or abstraction) taken over all faces presented to the participant. The datapairs are ordered in ascending abstraction time, corresponding to Figure 3.21,top graph.

198

Study2: MemoryTime(secs) Clicks

Data Pair Photograph Abstraction Photograph Abstraction1 60.5 48.7 60 422 54.1 50.8 58 523 68.1 51.7 86 644 64.6 55.4 50 405 92.5 57.0 62 526 77.7 57.1 59 427 76.9 60.0 64 528 92.0 66.2 64 449 91.2 71.2 51 4210 83.7 81.4 70 62

Average 75.5 59.4 62.4 49.2Std. Dev. 13.3 9.9 9.7 8.2Std. Err. 4.0 3.0 2.9 2.5

Table A.2. Data for Videoabstraction Study 2. This table shows the time (inmilliseconds) and number of clicks each participant used to complete a memorygame with photograph and a memory game with abstraction images. The datapairs are ordered in ascending abstraction time, corresponding to Figure 3.21,middle and bottom graphs (this ordering is not intended to correspond to Ta-ble A.1).

199

APPENDIX B

User-data for Shape-from-X Study

The tables B.1-B.5 list experimental data (aggregate values averaged over four trials) for

each display mode for all 21 participants of the shape-from-X study described in Section 4.5.

Figure B.1 shows the questionnaire given to participants after they completed the experi-

mental trials. Table B.6 lists the numerical data gathered from the questionnaire.

200

ShadingUserID success failure risk placement detectionBA044 0.868 0.018 0.525 0.885 2.970BA046 0.910 0.073 0.235 0.983 1.021BA047 0.863 0.050 0.460 0.910 2.311BA048 0.920 0.055 0.688 0.978 3.344BA049 0.928 0.020 0.403 0.948 2.135BA050 0.878 0.043 0.505 0.923 2.445BA051 0.860 0.060 0.360 0.923 1.585BA052 0.820 0.040 0.335 0.860 1.483BA053 0.855 0.020 0.895 0.880 6.654BA054 0.800 0.035 0.443 0.833 2.196BA055 0.835 0.000 0.910 0.835 5.767BA056 0.670 0.198 0.793 0.870 1.758BA057 0.810 0.035 0.813 0.845 4.565BA058 0.903 0.008 0.643 0.910 3.868BA059 0.900 0.033 0.393 0.933 1.857BA060 0.785 0.053 0.463 0.835 1.976BA061 0.778 0.073 0.625 0.850 2.957BA062 0.785 0.020 0.800 0.805 3.545BA063 0.788 0.038 0.345 0.823 1.547BA064 0.793 0.038 0.493 0.830 2.289BA065 0.880 0.010 0.465 0.890 2.354Average 0.839 0.044 0.552 0.883 2.792Std. Dev. 0.063 0.041 0.198 0.051 1.433Std. Err. 0.014 0.009 0.043 0.011 0.313

Table B.1. Shading Data. Averages of each participant over four trials for theShading display mode.

201

OutlineUserID success failure risk placement detectionBA044 0.848 0.095 0.495 0.938 2.179BA046 0.693 0.280 0.093 0.973 0.263BA047 0.723 0.185 0.283 0.908 0.883BA048 0.830 0.140 0.310 0.965 1.200BA049 0.838 0.128 0.358 0.963 1.413BA050 0.903 0.068 0.315 0.968 1.311BA051 0.765 0.213 0.235 0.978 0.808BA052 0.683 0.190 0.188 0.873 0.639BA053 0.850 0.058 0.483 0.908 2.366BA054 0.690 0.065 0.355 0.755 1.465BA055 0.808 0.038 0.648 0.848 3.080BA056 0.683 0.248 0.690 0.928 1.507BA057 0.723 0.130 0.565 0.848 2.016BA058 0.755 0.110 0.428 0.863 1.664BA059 0.753 0.170 0.230 0.923 0.732BA060 0.803 0.103 0.390 0.905 1.464BA061 0.760 0.163 0.418 0.920 1.499BA062 0.648 0.065 0.825 0.715 3.194BA063 0.743 0.095 0.288 0.838 1.157BA064 0.785 0.090 0.398 0.875 1.646BA065 0.918 0.028 0.263 0.943 1.196Average 0.771 0.127 0.393 0.897 1.509Std. Dev. 0.075 0.069 0.177 0.069 0.740Std. Err. 0.016 0.015 0.039 0.015 0.162

Table B.2. Outline Data. Averages of each participant over four trials for theOutline display mode.

202

MixedUserID success failure risk placement detectionBA044 0.853 0.085 0.533 0.938 2.287BA046 0.760 0.190 0.248 0.953 1.005BA047 0.890 0.075 0.470 0.965 2.103BA048 0.903 0.055 0.693 0.960 3.440BA049 0.933 0.038 0.430 0.973 2.083BA050 0.838 0.068 0.470 0.903 2.149BA051 0.843 0.070 0.365 0.908 1.506BA052 0.900 0.055 0.345 0.953 1.557BA053 0.885 0.043 0.633 0.928 3.231BA054 0.738 0.083 0.485 0.820 2.047BA055 0.788 0.045 1.048 0.833 5.975BA056 0.683 0.193 2.900 0.875 3.033BA057 0.743 0.078 0.698 0.823 2.833BA058 0.800 0.065 0.608 0.868 2.741BA059 0.805 0.063 0.420 0.870 1.904BA060 0.825 0.103 0.398 0.928 1.690BA061 0.798 0.075 0.630 0.873 2.669BA062 0.750 0.135 0.903 0.885 3.172BA063 0.688 0.080 0.345 0.765 1.343BA064 0.723 0.098 0.533 0.823 2.050BA065 0.860 0.040 0.503 0.898 2.635Average 0.810 0.083 0.650 0.892 2.450Std. Dev. 0.073 0.043 0.549 0.057 1.043Std. Err. 0.016 0.009 0.120 0.012 0.228

Table B.3. Mixed Data. Averages of each participant over four trials for theMixed display mode.

203

TexISOUserID success failure risk placement detectionBA044 0.668 0.223 0.235 0.888 0.718BA046 0.658 0.253 0.133 0.908 0.403BA047 0.640 0.245 0.258 0.883 0.747BA048 0.620 0.295 0.353 0.915 0.934BA049 0.648 0.300 0.303 0.945 0.743BA050 0.758 0.185 0.195 0.940 0.673BA051 0.730 0.230 0.180 0.960 0.517BA052 0.670 0.185 0.148 0.855 0.458BA053 0.740 0.150 0.390 0.888 1.214BA054 0.735 0.140 0.133 0.880 0.423BA055 0.705 0.100 0.378 0.805 1.415BA056 0.468 0.350 0.513 0.818 0.780BA057 0.673 0.233 0.300 0.905 0.789BA058 0.680 0.250 0.223 0.928 0.677BA059 0.820 0.058 0.173 0.878 0.733BA060 0.728 0.183 0.293 0.910 0.955BA061 0.543 0.398 0.235 0.940 0.544BA062 0.580 0.135 0.438 0.715 1.309BA063 0.590 0.188 0.183 0.775 0.577BA064 0.540 0.213 0.275 0.753 0.814BA065 0.718 0.138 0.235 0.855 0.811Average 0.662 0.212 0.265 0.873 0.773Std. Dev. 0.084 0.082 0.103 0.066 0.274Std. Err. 0.018 0.018 0.022 0.014 0.060

Table B.4. TexISO Data. Averages of each participant over four trials for theTexISO display mode.

204

TexNOIUserID success failure risk placement detectionBA044 0.688 0.273 0.193 0.958 0.555BA046 0.675 0.218 0.153 0.893 0.439BA047 0.510 0.385 0.240 0.895 0.533BA048 0.620 0.323 0.275 0.945 0.717BA049 0.520 0.458 0.300 0.975 0.655BA050 0.773 0.200 0.258 0.968 0.878BA051 0.645 0.258 0.193 0.900 0.532BA052 0.573 0.323 0.115 0.895 0.344BA053 0.653 0.238 0.295 0.893 0.903BA054 0.715 0.123 0.115 0.838 0.374BA055 0.690 0.120 0.383 0.808 1.404BA056 0.565 0.315 0.533 0.880 0.719BA057 0.695 0.165 0.358 0.858 1.127BA058 0.650 0.228 0.233 0.880 0.731BA059 0.690 0.135 0.170 0.828 0.577BA060 0.640 0.113 0.328 0.753 1.135BA061 0.628 0.258 0.285 0.888 0.794BA062 0.480 0.165 0.395 0.645 1.111BA063 0.710 0.170 0.145 0.880 0.489BA064 0.580 0.255 0.263 0.833 0.707BA065 0.795 0.143 0.210 0.938 0.770Average 0.643 0.231 0.259 0.874 0.738Std. Dev. 0.082 0.093 0.103 0.076 0.277Std. Err. 0.018 0.020 0.023 0.017 0.060

Table B.5. TexNOI Data. Averages of each participant over four trials for theTexNOI display mode.

205

PiG

eo

nA

tor

Qu

esti

on

nair

e

1.)

P

lease

ran

k t

he

dis

pla

y m

od

es i

n o

rder

of

dif

ficu

lty.

If s

om

e m

od

es

felt

th

e sa

me,

you

can

ass

ign

th

e sa

me

nu

mb

er.

(S

cale

: 1=

Easi

est

… 5

=M

ost

dif

ficu

lt)

(A

) Shadin

g

(B)

Lin

es

(C)

Shadin

g&

Lin

es

(D)

Tex

ture

1

(E)

Tex

ture

2

Rat

ing:

...............

Rat

ing:

...............

Rat

ing:

...............

Rat

ing:

...............

Rat

ing:

...............

2.)

Rate

how

wel

l you

th

ink

you

per

form

ed i

n t

he

exp

eri

men

t (d

id y

ou

hit

most

targ

ets?

) P

erfo

rmance

: ..

..........................

(Sca

le:

1=

Ver

y G

ood …

5 =

Poor)

3.)

Rate

how

cle

ar

the

inst

ruct

ion

s w

ere

to u

nd

erst

an

d

Inst

ruct

ions:

……

……

……

.

(Sca

le:

1=

Ver

y C

lea

r …

5=

Tota

lly

Uncl

ear)

4.)

Rate

how

dif

ficu

lt y

ou

fou

nd

th

e m

od

e of

inte

ract

ion

wit

h t

he

syst

em (

i.e.

cli

ckin

g/t

ap

pin

g)

Inte

ract

ion:…

……

……

….

(Sca

le:

1=

Ver

y E

asy

… 5

= V

ery

dif

ficu

lt)

5.)

Rate

th

e d

ura

tion

of

the

exp

eri

men

t

Length

: ..................................

(Sca

le:

1=

Too s

hort

… 5

= T

oo l

ong)

6.)

Did

th

e ex

peri

men

t ti

re/e

xh

au

st y

ou

?

Exhau

stio

n:

……

……

……

……

(S

cale

: 1=

Not

at

all

…

5 =

I w

as

very

exh

au

sted

)

7.)

Did

th

e ex

peri

men

t ca

use

you

an

y d

isco

mfo

rt?

Co

mfo

rt:

……

……

……

……

…

(Sca

le:

1=

Not

at

all

… 5

=I

was

very

unco

mfo

rtable

)

8.)

If

you

did

not

an

swer

Not

at

all

ab

ove,

ple

ase

exp

lain

: C

om

fort

exp

lanat

ion:

..........................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

9.)

Did

you

noti

ce t

hat

you

r h

an

ds

wer

e ca

stin

g a

sh

ad

ow

?

( )

Yes

(

)

No

10.)

If

, yes

, ab

ove,

do y

ou

th

ink

it

imp

air

ed y

ou

r p

erfo

rman

ce?

( )

Yes

(

)

No

11.)

P

lease

giv

e an

y a

dd

itio

nal

com

men

ts o

r su

gges

tion

s you

may h

ave

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Figure B.1. Questionnaire. Participants were asked to fill out this short ques-tionnaire after completing all trials.

206

Subjective Difficulty

UserID 1a)S

hadi

ng

1b)L

ines

(Out

lines

)

1c)S

hadi

ng&

Lin

es(M

ixed

)

1d)T

extu

re1

(Tex

ISO

)

1e)T

extu

re2

(Tex

NO

I)

2)Pe

rfor

man

ce

3)In

stru

ctio

ns

4)In

tera

ctio

n

5)D

urat

ion

6)E

xhau

stio

n

7)D

isco

mfo

rt

9)Sh

adow

Cas

t

10)S

hado

wIm

pair

BA044 1 2 2 5 4 3 1 2 4 4 3 0 0BA046 2 5 1 4 3 3 2 1 3 3 1 0 0BA047 2 3 1 5 4 4 1 1 3 1 1 0 0BA048 1 5 2 3 4 2 1 1 3 2 2 0 0BA049 3 4 1 5 5 4 2 3 3 2 1 0 0BA050 2 3 1 4 5 2 1 1 3 2 1 0 0BA051 1 4 1 5 5 3 2.5 2 5 4 2 1 0BA052 2 3 1 5 5 3 1 2 4 2 1 0 0BA053 2 4 1 3 5 2 4 2 3 3 3 1 0BA054 1 2 1 4 3 3 3 1 4 3 1 0 0BA055 2 3 1 4 5 4 1 1 3 1 1 0 0BA056 3 1 1 4 5 3 4 3 2 2 1 0 0BA057 1 3 1 5 4 2 2 3 5 4 3 0 0BA058 1 4 2 5 5 2 1 3 3 4 1 1 1BA059 1 3 2 4 5 2.5 3 1 3 2 1 1 0BA060 2 3 1 4 5 3 2 1 4 2 1 0 0BA061 2 3 1 4 5 3 1 1 3 2 2 0 0BA062 2 3 1 4 5 3 1 3 3 1 1 0 0BA063 2 3 1 4 5 2.5 4 2 2.5 2 2 1 1BA064 2 3 1 5 4 3 2 2 5 3 3 1 1BA065 2 3 1 4 5 4 5 2 3 2 2 0 0Average 1.8 3.2 1.2 4.3 4.6 2.9 2.1 1.8 3.4 2.4 1.6 0.3 0.2Std. Err. 0.1 0.2 0.1 0.1 0.2 0.2 0.3 0.2 0.2 0.2 0.2 0.1 0.1

Mode 2 3 1 4 5

Table B.6. Questionnaire Data. Numerical results for the questionnaire shownin Figure B.1. See questionnaire for meaning of each column and scales used.Display mode names in parentheses are those used in this dissertation. For ques-tions 9 and 10: 1=Yes and 0=No. Mode in the last row refers to the statisticalmeasure (most frequent number), not display mode.

207

APPENDIX C

Links for Selected Objects

Table C.1 lists publicly accessible internet URLs for several images and other objects used

in this dissertation. No guarantees can be made about the validity and availability of these links.

208

Figu

re1.

1:http://commons.wikimedia.org/wiki/Image:Glasses_800.png

Figu

re1.

2(a

):http://commons.wikimedia.org/wiki/Image:IMG_0071_-_England%2C_London.JPG

Figu

r e1.

3(b

),B

unny

mod

el:http://graphics.stanford.edu/data/3Dscanrep/

Figu

re1.

4(a

):http://commons.wikimedia.org/wiki/Image:Escaping_criticism_by_Caso.jpg

Figu

re1.

4(b

):http://commons.wikimedia.org/wiki/Image:Portrait_of_Dr._Gachet.jpg

Figu

re3.

8E

ye-t

rack

ing

data

and

sour

ceim

age:http://www.cs.rutgers.edu/˜decarlo/abstract.html

Figu

re3.

9So

urce

imag

es:http://upload.wikimedia.org/wikipedia/commons/4/4c/Pitt_Clooney_Damon.jpg

Figu

re3.

10So

urce

imag

e:http://www.indcjournal.com/archives/Lehrer.jpg

Figu

res3

.13–

3.17

Sour

ceim

age:http://www.flickr.com/photos/johnnydriftwood/115499900/

Figu

re3.

26O

rigi

nal,

stat

iona

ry:http://commons.wikimedia.org/wiki/Image:Ferrari-250-GT-Berlinetta-1.jpg

Figu

re4.

4G

irlc

ourt

esy

of:www.crystalspace3d.org

Figu

re4.

4M

an&

Tool

cour

tesy

of:http://www.3dcafe.com

Figu

re4.

4A

rchi

tect

ure

mod

elco

urte

syof

Goo

gle

3DW

areh

ouse

:http://sketchup.google.com/3dwarehouse

Figu

re4.

7R

ende

ring

engi

ne:http://fabio.policarpo.nom.br/fly3d/index.htm

Figu

re5.

4L

eft:http://commons.wikimedia.org/wiki/Image:Burger_King_Whopper_Combo.jpg

Figu

re5.

4R

ight

:http://www.flickr.com/photo_zoom.gne?id=100995096&size=o

Table C.1. Internet references. Links to selected images and 3-D models.

http://commons.wikimedia.org/wiki/Image:Glasses_800.png

http://commons.wikimedia.org/wiki/Image:IMG_0071_-_England%2C_London.JPG

http://graphics.stanford.edu/data/3Dscanrep/

http://commons.wikimedia.org/wiki/Image:Escaping_criticism_by_Caso.jpg

http://commons.wikimedia.org/wiki/Image:Portrait_of_Dr._Gachet.jpg

http://www.cs.rutgers.edu/~decarlo/abstract.html

http://upload.wikimedia.org/wikipedia/commons/4/4c/Pitt_Clooney_Damon.jpg

http://www.indcjournal.com/archives/Lehrer.jpg

http://www.flickr.com/photos/johnnydriftwood/115499900/

http://commons.wikimedia.org/wiki/Image:Ferrari-250-GT-Berlinetta-1.jpg

www.crystalspace3d.org

http://www.3dcafe.com

http://sketchup.google.com/3dwarehouse

http://fabio.policarpo.nom.br/fly3d/index.htm

http://commons.wikimedia.org/wiki/Image:Burger_King_Whopper_Combo.jpg

http://www.flickr.com/photo_zoom.gne?id=100995096&size=o

perceptually-motivated non-photorealistic graphicsholgerweb.net › phd › research › papers ›...

Documents