university of minnesota this is to certify that i...

UNIVERSITY OF MINNESOTA

This is to certify that I have examined this copy of a doctoral thesis by

TODD CAMERON WITTMAN

and have found that is is complete and satisfactory in all respects,and that any and all revisions required by the final

examining comittee have been made.

DEPARTMENT OF MATHEMATICSUNIVERSITY OF MINNESOTA

VARIATIONAL APPROACHES TO DIGITAL IMAGE ZOOMING

A THESISSUBMITTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTABY

TODD CAMERON WITTMAN

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

August 2006

Acknowledgments

I would like to thank my advisor Prof. Fadil Santosa for his patience, guidance, and

mentorship throughout my long stay at University of Minnesota. It means a lot to me to

have someone looking out for me. I owe a lot to the faculty at the University of Minnesota,

especially Prof. Jackie Shen who helped me as a teacher, friend, and tour guide in China.

I would also like to thank Prof. Song-Chun Zhu of UCLA for inviting me to the Lotus

Hill Workshop, where I started my super-resolution work. Some of my photos from that

workshop appear in Chapter 4. Much of the barcode processing work was done as part of

an industrial internship under the supervision of Dr. Miroslav Trajkovic. The pictures of

human brains and advice on working with medical images were provided by my friend Dr.

Steen Moeller at the Center for Magnetic Resonance Research. I also need to thank my

PhD committee for actually reading this thesis and making constructive comments: Gilad

Lerman, Willard Miller, and Ravi Janardan.

As preparation for this thesis, I investigated several compuational strategies and have

adapted software from various sources. Thanks to the authors who provided this code

by placing it online or sending it directly to me. Pilfered code includes Jackie Shen’s Γ-

convergence routine, Antonin Chambolle’s quantized TV minimization, Yuri Boykov’s graph

cut package, Guy Gilboa’s locally adaptive TV norm, Stanford’s BGL graph algorithm

library, and Miroslav Trajkovic’s barcode decoding software.

I would also like to thank my brothers, Scott and Andy, and my parents, Mimi and

Paul, for their support.

Finally I would like to thank you, gentle reader, for reading this thesis and the acknowl-

edgments that preface it. I hope you enjoy reading this as much as I enjoyed writing it.

Hopefully more.

i

Abstract

The purpose of this thesis is to discuss digital image resolution enhancement by varia-

tional methods and the associated computational issues. Two problems related to the basic

zooming problem are also studied: super-resolution and quantized deconvolution.

Digital zooming is important for mundane computing activities such as web browsing

as well as sophisticated applications like satellite imagery and medical diagnosis. Unfor-

tunatley, zooming is an ill-posed mathematical problem and the linear filters common in

imaging software are often not adequate for the task. Other interpolation approaches in-

clude wavelets, PDEs, machine learning, and statistical filters, but variational methods offer

computational advantages in the application and flexibility of the models. We discuss the

theoretical and compuational issues surrounding variational zooming, focusing on the To-

tal Variation (TV) and Mumford-Shah energies. The variational inpainting model is very

flexible and the interpolated result can be improved with energy modifications, including

locally adaptive fidelity weights, soft inpainting, and post-processing.

Super-resolution refers to the process of producing a single high-resolution image from

a set of low-resolution images such as a video sequence. Variational inpainting extends

naturally to the multiple-image case and is shown to be effective for video enhancement,

barcode processing, and MR image reconstruction. We propose a soft inpainting model to

handle local variation and motion within a video sequence.

Text and barcode images should appear as strictly binary-valued images, but due to

blurring and downsampling the actual images takes on many gray values and may be un-

readable by recognition systems. Given a blurred grayscale image, the goal of quantized

zooming is to produce a clean, high-resolution image taking on only a limited number of

gray values. The graph cut method has proven successful for exact minimization of the

ii

quantized TV energy. We show the graph cut method is effective for denoising, segmenta-

tion, and inpainting, but deconvolution is an open problem in the literature. We propose

an alternating minimization method for deblurring that combines graph cuts and numeri-

cal relaxation inspired by linear programming. For the zooming problem, the approach is

improved by the addition of local gradient information. We provide numerical results for

barcode imaging, text enhancement, and medical image segmentation.

iii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 1

1.1 The Digital Zooming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Organization and Contributions of this Thesis . . . . . . . . . . . . . . . . . 5

2 Survey of Zooming Approaches 7

2.1 Linear Interpolation Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Which Methods to Consider? . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 A PDE-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Heat Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 A Multiscale Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Wavelets and Holder Regularity . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Wavelet-Based Interpolation . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Locally Linear Embedding (LLE) . . . . . . . . . . . . . . . . . . . . 20

2.5.2 LLE-Based Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 21


2.6 A Statistical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iv

2.6.1 Local vs. Global Interpolation . . . . . . . . . . . . . . . . . . . . . 27

2.6.2 NL-Means Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.3 NL-Means Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 30


2.6.5 Further Research on NL-Means Interpolation . . . . . . . . . . . . . 36

2.7 Summary and Motivation for the Variational Approach . . . . . . . . . . . 38

3 Variational Zooming 40

3.1 Introduction to the Variational Approach . . . . . . . . . . . . . . . . . . . 40

3.2 The Total Variation (TV) Energy . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Theory and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Numerical Computation: The Digital TV Filter . . . . . . . . . . . . 46

3.3 The Mumford-Shah Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.1 Theory and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.2 Numerical Computation: The Γ-Convergence Approximation . . . . 53

3.4 Numerical Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 Modifications to the Inpainting Model . . . . . . . . . . . . . . . . . . . . . 60

3.5.1 Incorporating a Blur Kernel . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.2 Locally Adaptive Fidelity Weights . . . . . . . . . . . . . . . . . . . 61

3.5.3 Soft Inpainting with Nearest Neighbor Information . . . . . . . . . . 64

3.5.4 Variational Zooming as Post-Processing . . . . . . . . . . . . . . . . 67

4 Variational Super-resolution 71

4.1 Super-resolution of an Image Sequence . . . . . . . . . . . . . . . . . . . . . 71

4.2 Super-resolution by Variational Inpainting . . . . . . . . . . . . . . . . . . . 74

4.2.1 Data Fusion with Known Registration . . . . . . . . . . . . . . . . . 74

4.2.2 Simultaneous Registration and Fusion . . . . . . . . . . . . . . . . . 80

4.3 Artifact Reduction by Soft Inpainting . . . . . . . . . . . . . . . . . . . . . 83

4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

v

4.4.1 Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4.2 Barcode Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4.3 Reconstruction from MRI Sensor Data . . . . . . . . . . . . . . . . . 100

5 Quantized Zooming 108

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.1.1 Quantized Image Processing and the Quantized TV Energy . . . . . 108

5.1.2 Previous Work on Quantized Image Processing . . . . . . . . . . . . 109

5.2 Quantized TV Minimization by Graph Cuts . . . . . . . . . . . . . . . . . . 110

5.2.1 Network Flows: Definitions . . . . . . . . . . . . . . . . . . . . . . . 110

5.2.2 Network Flows: Algorithms . . . . . . . . . . . . . . . . . . . . . . . 112

5.2.3 The Quantized TV Model . . . . . . . . . . . . . . . . . . . . . . . . 115


5.3 Application to Low-Level Vision Tasks . . . . . . . . . . . . . . . . . . . . . 122

5.3.1 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.3.3 Texture Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.3.4 Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.3.5 Zooming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.4 Quantized TV Minimization with a Blur Kernel . . . . . . . . . . . . . . . . 132

5.4.1 Deblurring by Numerical Relaxation . . . . . . . . . . . . . . . . . . 133

5.4.2 Zooming Using Local Gradient Information . . . . . . . . . . . . . . 136

5.5 Extensions of the Quantized TV Model . . . . . . . . . . . . . . . . . . . . 142

5.5.1 Determining Intensity Levels . . . . . . . . . . . . . . . . . . . . . . 142

5.5.2 The TV-L1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.5.3 The 8-connected Topology . . . . . . . . . . . . . . . . . . . . . . . . 145

5.5.4 3-D Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.6 Applications of Binary TV Minimization . . . . . . . . . . . . . . . . . . . . 147

vi

5.6.1 Barcode Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 147

5.6.2 Enhancement for Text Recognition . . . . . . . . . . . . . . . . . . . 151

5.6.3 Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . 154

Bibliography 156

vii

List of Figures

1.1 Actual vs. effective resolution. Left: Original image. Center: nearest neigh-

bor zoom. Right: zoom using Navier-Stokes inpainting from [12]. The two

zoomed images have the same number of pixels. . . . . . . . . . . . . . . . . 2

2.1 Part of Lena image downsampled and then upsampled by factor M = 2. . . 8

2.2 PDE-based upsampling with zoom M = 3 and time step δt = 0.1. Top Row:

Magnification of Miller image at time T = 9. Bottom Row: Close-up of

section of image and comparison to linear filters. . . . . . . . . . . . . . . . 14

2.3 PDE-based upsampling of text image with zoom M = 5 and time step δt =

0.1. Left: original image. Right: zoomed image at time T = 25. . . . . . . . 15

2.4 Discrete 3-level wavelet decomposition of noisy sine wave signal. . . . . . . 17

2.5 4-level Wavelet decomposition of noisy step function f . . . . . . . . . . . . . 18

2.6 LLE dimensionality reduction. Left: original 3D spherical data set. Right:

2D data set computed by LLE. . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Training set used for LLE-based interpolation. . . . . . . . . . . . . . . . . 24

2.8 LLE-based interpolation of a face image with zoom M = 3 and training set

shown in Figure 2.7. Left: original image. Right: LLE interpolated image. . 25

2.9 Close-up of eye in Figure 2.8 with zoom M = 3. Top left: nearest neigh-

bor. Top right: bilinear. Bottom left: bicubic. Bottom right: LLE-based

interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

viii

2.10 Text image interpolated by LLE with zoom M = 3. The training set in

Figure 2.7 was used. Top: original image. Bottom: LLE interpolated image. 27

2.11 NL-means denoising on part of the Lena image. Left: noisy image. Right:

image after NL-means denoising. Taken from [19]. . . . . . . . . . . . . . . 30

2.12 Illustration of M -neighborhoods for a 3x3 pixel square topology. . . . . . . 32

2.13 Interpolation of Brodatz fabric texture with zoom M = 3. Left: Original

image. Center: Bicubic interpolation. Right: NL-means interpolation. . . . 33

2.14 NL-means interpolation of ringed image with zoomM = 3 compared to linear

interpolation filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.15 Interpolation of a noisy image by factor M = 3. The image at bottom-right

underwent bicubic interpolation followed by NL-means denoising. . . . . . . 35

2.16 NL-means interpolation of textured image with zoom M = 4. Left: original

image. Center: Bicubic interpolation. Right: NL-means interpolation. . . . 36

2.17 NL-means zooming of portion of MR brain image. Top: original MRI.

Bottom-left: lower left corner of brain. Bottom-right: NL-means zoom with

M = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Illustration of zooming by variational inpainting for magnification M = 3. . 42

3.2 Inpainting a simple image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 TV and Mumford-Shah zoom of checkerboard image for magnification M =

3. The fourth column is a detail view of the image in the third column. . . 56

3.4 Zoom of color image with M = 4. . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Zoom of MRI brain image with M = 3. . . . . . . . . . . . . . . . . . . . . 59

3.6 2x TV zooming of noisy image with locally adaptive fidelity weights. . . . . 63

3.7 Effect of σ on Mumford-Shah soft inpainting with λ = 20, γ = 2000, M = 5. 66

3.8 Comparison of zooming using standard and soft Mumford-Shah inpainting

with λ = 20, γ = 2000, σ = 1, M = 5. . . . . . . . . . . . . . . . . . . . . . 67

ix

3.9 Different possible inpainting masks for a single image with magnification

M = 5. Left to right: original image, standard inpainting mask, average of

soft inpainting mask, Laplacian post-processing mask. . . . . . . . . . . . . 68

3.10 Comparison of standard variational zooming and post-processing methods

with magnification M = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.11 Zooming by magnification factorM = 2π using Mumford-Shah post-processing. 70

4.1 Illustration of image registration for super-resolution. The three images

u1, u2, u3 are aligned to a common high-resolution lattice ΩM by the re-

spective geometric transformations ϕ1, ϕ2, ϕ3. . . . . . . . . . . . . . . . . . 73

4.2 Super-resolution of 5-image sequence. Top left: original third image in se-

quence. Top right: 4x TV SR with λ = 20. Bottom left: 4x MS SR with

λ = 20, γ = 2000. Bottom right: 4x MS SR with registration incorrect by

1/2 pixel on low-resolution lattice. . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 4x color image zoom of 5-image sequence with known registration. Top row:

nearest neighbor, bilinear, bicubic. Bottom row: staircased bicubic, median

image, MS SR with λ = 20, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 79

4.4 Super-resolution of 11-frame video sequence with known registration. Top

row: 4 frames from original sequence. Bottom row: corresponding 4 frames

from 4x MS SR with λ = 20, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 80

4.5 Super-resolution video sequence with known and unknown registration. Top:

one frame from original 11-frame sequence. Center: 4x MS SR using ground-

truth registration. Bottom: 4x MS SR with simultaneous translational reg-

istration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.6 Artifact reduction on three frames of 7-frame video sequence. Top row: orig-

inal video frames. Center row: 2x MS SR with λ = 5, γ = 2000. Bottom

row: 2x MS SR with soft inpainting σ = 10. . . . . . . . . . . . . . . . . . . 89

x

4.7 Frame from traffic video of intersection in Karlsruhe. The four highlighted

cars were tracked for super-resolution enhancement. . . . . . . . . . . . . . 91

4.8 Super-resolution of four 11-frame sections of video in Figure 4.7. Left to right:

original base frame, 4x bicubic zoom, 4x MS SR with λ = 5 and γ = 2000,

4x MS SR with de-interlacing. . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.9 Tested scanlines on a barcode image. . . . . . . . . . . . . . . . . . . . . . . 94

4.10 Three degrees of freedom in barcode rotation. . . . . . . . . . . . . . . . . . 95

4.11 Creating a projected signal u(t) from a barcode image u0(x, y). Left: pro-

jection with parallel bars (roll). Right: projection from focal point F for

non-parallel bars (pitch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.12 Super-resolution of a Code 128A barcode image with roll only. Top to bot-

tom: original image and final TV SR image, ideal signal, single scanline,

projected signal, TV SR signal with λ = 10. . . . . . . . . . . . . . . . . . . 99

4.13 Super-resolution of UPC barcode with severe pitch angle. Top: original

image with traced bars indicated by dots. Bottom: Scanline signal in red

superimposed on TV projected signal in blue. . . . . . . . . . . . . . . . . . 100

4.14 A image from an MRI sensor and contrast-adjusted zoom of two regions. . . 102

4.15 Positions of 16 MRI sensors found by tracing backwards from L2-norm image,

shown in center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.16 Zoom of central area of brain. Left: L2-norm image with enhanced contrast.

Right: MS SR with λ = 100, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 105

4.17 Mumford-Shah fusion of 16 MR sensor images. Top left: a sensor image.

Top right: L2-norm image. Bottom left: MS SR with λ = 100, γ = 2000.

Bottom right: MS SR with λ = 10, γ = 2000. All four images have the same

dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.1 Illustration of quantized TV graph model for neighboring pixels x ∼ y. . . . 116

5.2 Effect of λ on TV minimization with L = 4 levels. . . . . . . . . . . . . . . 120

xi

5.3 Effect of # levels L on TV minimization with λ = 1. . . . . . . . . . . . . . 121

5.4 Running time of quantized TV model with preflow-push method. Left: Log-

log plot of # pixels N vs. runtime for repeatedly downsampled Barbara

image. Right: Log-log plot of # levels L vs. runtime on 50x50 Barbara

image. Linear regressions are shown in red. . . . . . . . . . . . . . . . . . . 121

5.5 TV denoising of Barbara image. Left to right: Original image, L = 5 and

λ = 1, L = 5 and λ = 0.1, L = 2 and λ = 1 . . . . . . . . . . . . . . . . . . 123

5.6 TV Poisson denoising. Left: Original image corrupted by Poisson noise.

Center: TV minimization assuming Gaussian noise with λ = 5, L = 3.

Right: TV minimization assuming Poisson noise with λ = 5, L = 3. . . . . . 124

5.7 Quantized TV segmentation of simple images. Left 2: segmentation of nat-

ural image with λ = 0.5, L = 2. Right 2: segmentation of noisy synthetic

with λ = 0.02, L = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.8 TV seeded segmentation. Left: Original image with 3 seed pixels shown in

red. Center: Quantized TV minimization with λ = 0.5, L = 3 levels selected

by (5.5). Right: Quantized TV minimization with λ = 0.5, L = 3 using seeds. 126

5.9 TV texture segmentation. Left: Original image. Center: TV minimization

with λ = 0.2, L = 2 of entropy statistics. Right: TV minimization with

λ = 0.05, L = 2 of skewness statistics. . . . . . . . . . . . . . . . . . . . . . 128

5.10 TV inpainting. Left: Original image with mask D shown in red. Right: TV

inpainting result with L = 3, λ = 0.1. . . . . . . . . . . . . . . . . . . . . . . 129

5.11 TV zooming by inpainting with L = 2, λ = 1, and magnification factor M . . 131

5.12 TV zooming by inpainting with L = 2, λ = 1, and magnification M = 2. . . 131

5.13 Illustration of writing an image in block-raster order for M = 2, N = 4. The

resulting matrices K and A are block-diagonal. . . . . . . . . . . . . . . . . 138

5.14 The binary 0-1 image at left is convolved with a 2x2 averaging kernel K and

downsampled by factor 2 to produce the grayscale image at right. . . . . . . 140

xii

5.15 Results of 2x zoom by different methods. Top row: original image, bicubic

zoom, TV filter zoom. Bottom row: quantized TV inpainting, quantized TV

zooming by relaxation, quantized TV zooming using local gradients. . . . . 140

5.16 Quantized TV zooming on cameraman image. Left: original image. Center:

2x zoom by quantized TV inpainting with L = 2, λ = 1. Right: 2x zoom

with 2x2 averaging kernel and local gradients. . . . . . . . . . . . . . . . . . 142

5.17 Iterating on intensity levels for quantized TV minimization with λ = 1, L = 3.143

5.18 TV minimization with L = 6 levels under L1 and L2 fidelity constraints.

Top row: TV-L2 minimization removes low-contrast features as λ decreases.

Bottom row: TV-L1 minimization removes finer-scale geometric features as

λ decreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.19 TV inpainting under the 8-connected topology. The inpainting domain is

shown in red in the first image. Left to right: Original image, TV filter,

4-connected quantized TV, 8-connected quantized TV. . . . . . . . . . . . . 146

5.20 3D quantized TV denoising of simple volumes with λ = 0.005, L = 2. The

middle image slice is shown for comparison. Top row: 10x10x10 cube. Bot-

tom row: Sphere of radius 8. . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.21 Quantized TV denoising of UPC barcode with additive Gaussian noise and

Gaussian blur. Top: original image. 2nd row: thresholding at median in-

tensity. 3rd row: quantized TV denoising with λ = 0.005, L = 2. Bottom:

quantized TV denoising with anisotropic weights λy = 0.005, λx = 0, L = 2. 149

5.22 Quantized TV inpainting of damaged barcode. Left: original image with

damaged area shown in red. Center: TV inpainting with λ = 0.1, L = 2.

Right: TV inpainting with anisotropic weights λy = 0.1, λx = 0, L = 2. . . 150

5.23 Quantized TV denoising of a barcode projected signal with λ = 10, L = 2. . 150

xiii

5.24 Quantized TV zooming of large text. Top row: original image, 2x bicubic

zoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic

zoom followed by modified Niblack’s method, iterations 1 and 8 of quantized

2x TV zooming with λ = 0.1, L = 2. Assumes kernel is 2x2 averaging matrix. 153

5.25 Quantized TV zooming of small text. Top row: original image, 2x bicubic

zoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic

zoom followed by modified Niblack’s method, iterations 1 and 8 of quantized

2x TV zooming with λ = 0.1, L = 2. Assumes kernel is 2x2 averaging matrix. 153

5.26 Quantized TV segmentation of ideal brain images with λ = 0.1, L = 3. Left

2: CT image. Right 2: MR image. . . . . . . . . . . . . . . . . . . . . . . . 154

5.27 Quantized TV segmentation of low-contrast MR brain image. Left 2: Seg-

mentation of entire brain with λ = 50, L = 2. Right 2: Segmentation of

region indicated in first image with λ = 200, L = 2. . . . . . . . . . . . . . . 155

xiv

Chapter 1

Introduction

A digital image is not an exact snapshot of reality, it is only a discrete approximation.

This fact becomes apparent when an image is made much larger and the pixels become

visible to the human eye. A larger image should have higher resolution, but an enlarged

image sometimes appear less acceptable than its smaller original. The actual resolution of

an image is defined as the number of pixels, but the effective resolution that we perceive

is a much harder quantity to define as it depends on subjective human judgment. Simply

increasing the number of pixels comprising the image does not necessarily increase the

effective resolution, as illustrated in Figure 1.1. The goal of image zooming is to create an

image with higher effective resolution from a single observed image. The zooming method we

employ depends in large part on our definition of effective resolution, which is an essentially

aesthetic quantity.

The digital zooming problem goes by many names, depending on the application: inter-

polation, image resizing, image upsampling/downsampling, image magnification, resolution

enhancement, etc. The term super-resolution is sometimes used, although in the literature

this generally refers to producing a high-resolution image from multiple images such as a

video sequence. In this thesis, we will refer to the single image case as “zooming” and the

mulitple image scenario as “super-resolution.”

The applications of image zooming range from the commonplace viewing of online images

1

Figure 1.1: Actual vs. effective resolution. Left: Original image. Center: nearest neighborzoom. Right: zoom using Navier-Stokes inpainting from [12]. The two zoomed images havethe same number of pixels.

to the more sophisticated magnification of satellite images. With the rise of consumer-based

digital photography, users expect to have a greater control over their digital images. Dig-

ital zooming has a role in picking up clues and details in surveillance images and video.

As high-definition television (HDTV) technology enters the marketplace, engineers are in-

terested in fast interpolation algorithms for viewing traditional low-definition programs on

HDTV. Astronomical images from rovers and probes are received at an extremely low trans-

mission rate (about 40 bytes per second), making the transmission of high-resolution data

infeasible [40]. In medical imaging, neurologists would like to have the ability to zoom in

on specific parts of brain tomography images. This is just a short list of applications, but

the wide variety cautions us that our desired interpolation result could vary depending on

the application and user.

1.1 The Digital Zooming Problem

In this section, we will establish the notation for image zooming used throughout the paper.

Suppose our image is defined over some rectangle Λ ⊂ <2. Let the function f : Λ → < be

our ideal continuous image. In an abstract sense, we can think of f as being “reality” and

Λ as our “viewing window.” The observed image u0 is a discrete sampling of f at equally

spaced points in the plane. If we suppose the resolution of u0 is δx× δy, we can express u0

2

by

u0(x, y) = Cδx,δy(x, y)f(x, y), (x, y) ∈ Λ (1.1)

where C denotes the Dirac comb

Cδx,δy(x, y) =∑

k,l∈Z

δ(kδx, lδy), (x, y) ∈ <2

and δ denotes the two-variable Dirac delta function

δ(x, y) =

1 if x = y

0 otherwise

The goal of image interpolation is to produce an image u at a different resolution δx′×δy′.

For simplicity, we will assume that the Euclidean coordinates are scaled by the same factor

M :

u(x, y) = C δxM

, δyM

(x, y)f(x, y), (x, y) ∈ Λ (1.2)

Given only the image u0, we will have to devise some reconstruction of f at the pixel values

specified by this new resolution. We will refer to M as our zoom or magnification factor.

Obviously, if M = 1 we trivially recover u0. The image u0 is upsampled if M > 1 and

downsampled if M < 1. In this paper, we will focus on the upsampling case when M > 1

is an integer.

Let ΩM ⊂ Λ denote the lattice induced by (1.2) for a fixed zoom M . Note that the

lattice of the original image u0 in (1.1) is Ω1, written simply as Ω from this point. Also

note that for infinite magnification we obtain ΩM → Λ as M → ∞. For computational

purposes, we can shift the lattices to the positive integers Z+. So if the observed image u0

is an m× n image,

ΩM = [1, 2, . . . ,Mm]× [1, 2, . . . ,Mn] .

Many interpolation techniques impose the constraint Ω ⊆ ΩM . In this case, only a subset

of the pixels in ΩM needs to be determined and the zooming problem becomes a version of

3

the inpainting problem.

Given the notation above, we can state the image interpolation problem succinctly:

Given a low-resolution image u0 : Ω → < and a magnification M > 1, find a high-resolution

image u : ΩM → <. Obviously, this is an ill-posed problem. We need to impose assumptions

on the reconstruction of f in equation (1.2). The choice of the zooming strategy depends

on the choice of assumptions. In other words, we need a mathematical understanding of

what constitutes our perception of “reality” f .

Zooming methods differ in their mathematical description of a “good” interpolated

image. Although it is difficult to compare methods and judge their output, we propose

9 basic criteria for a good zooming method. Some of these criteria are image processing

axioms proposed by [2, 24]. The first 8 are visual properties of the interpolated image, the

last is a computational property of the zooming method.

1. Geometric Invariance: The interpolation method should preserve the geometry and

relative sizes of objects in an image. That is, the subject matter should not change

under interpolation.

2. Contrast Invariance: The method should preserve the luminance values of objects in

an image and the overall contrast of the image.

3. Noise: The method should not add noise or other artifacts to the image, such as

ringing artifacts near the boundaries.

4. Edge Preservation: The method should preserve edges and boundaries, sharpening

them where possible.

5. Aliasing: The method should not produce jagged or ”staircase” edges.

6. Texture Preservation: The method should not blur or smooth textured regions.

7. Over-smoothing: The method should not produce undesirable piecewise constant or

blocky regions.

4

8. Application Awareness: The method should produce results appropriate to the type

of image and order of resolution. For example, the interpolated results should appear

realistic for photographic images, but for medical images the results should have crisp

edges and high contrast. If the interpolation is for general images, the method should

be independent of the type of image.

9. Sensitivity to Parameters: The method should not be too sensitive to internal param-

eters that may vary from image to image.

These are qualitative and somewhat subjective criteria, but they serve as a guide for

developing and evaluating digital zooming. In a sense, the methods discussed in this paper

each present a mathematical model of these visual criteria.

1.2 Organization and Contributions of this Thesis

Simple linear filters are the most common interpolation methods in computer software such

as web browsers and photo editors. In the next chapter, we first establish that linear

filters are inadequate for the zooming problem. To motivate the variational approach and

compare it to other strategies, in Chapter 2 we examine four methods representative of

current research in image processing:

• A wavelet-based interpolation method

• A heat diffusion PDE-based algorithm

• A machine learning strategy inspired by dimensionality reduction

• An adaptation of the Nonlocal Means denoising algorithm.

While each interpolation strategy has its strengths, the variational approach has certain

theoretical and computational advantages.

In Chapter 3, we discuss the use of variations of energy for image processing, specifically

the Total Variation (TV) and Mumford-Shah energies. We discuss theoretical and compu-

5

tational aspects of image inpainting and its extension to digital zooming. To improve the

quality of zoomed images, we propose several modifications to the basic inpainting model.

These modifications include the incorporation of a blur kernel, locally adaptive fidelity

weights, soft inpainting using nearest neighbor information, and variational post-processing.

The super-resolution problem seeks to produce a single high-resolution image from a

sequence of low-resolution images. The TV and Mumford-Shah image zooming models

extend naturally to image sequences, but meaningful data fusion requires that the images

be aligned to sub-pixel accuracy. In Chapter 4, we propose an alternating minimization

strategy to accurately align and fuse the image sequence. To address the problem of local

variation or motion within the sequence, a soft inpainting model can correct for image

artifacts created by the super-resolution process. Variational super-resolution is shown to

be effective in video enhancement, MRI reconstruction, and barcode processing.

Restricting the processed image to a few discrete gray values is useful for enhancing the

components of an image and can help restore simple images corrupted by noise, blur, and

downsampling. Chapter 5 discusses quantized image processing by minimizing the quantized

TV energy by graph cuts. In a graph-theoretic model, finding the global minimum of the

quantized TV energy is equivalent to finding the minimum cut of a flow network. This model

has previously been shown to be effective for image denoising and segmentation. The graph

cut method extends to inpainting and zooming, but incorporating a blur kernel is an open

problem. To address the graph cut deconvolution problem, we propose an approximation

method inspired by numerical relaxation in linear programming. Applications to text,

barcodes, and medical images are presented.

6

Chapter 2

Survey of Zooming Approaches

The goal of this chapter is to give a brief survey of different mathematical approaches to

image zooming. To illustrate each approach, we will focus on a particular method that

is representative of the approach. We will present numerical results for each method and

discuss its strengths and weaknesses. We begin by examining simple linear filters and why

better methods need to be developed.

2.1 Linear Interpolation Filters

The simplest approach is to assume that f in equation (1.2) is reconstructed by a convolution

kernel φ : <2 → < where∫φ(x, y)dydx = 1. Then we can approximate f by f ≈ u0 ∗ φ.

Substituting this into (1.2) gives rise to a general linear interpolation filter

u(x, y) = C δxM

, δyM

(x, y)(u0 ∗ φ)(x, y), (x, y) ∈ Ω.

The simplest linear filters are the bilinear and bicubic interpolation, which assume the

pixel values can be fit locally to linear and cubic functions, respectively [64]. Along with

simple nearest neighbor interpolation, these two filters are the most common interpolation

schemes in commercial software. These methods are easy to code as matrix multiplications

of u0. However, an image contains edges and texture, in other words discontinuities. So

7

the assumptions that pixel values locally fit a polynomial function will produce undesirable

results. The bilinear and bicubic interpolation methods may introduce blurring, create

ringing artifacts, and produce a jagged aliasing effect along edges (see Figure 2.1). The

blurring effects arise from the fact that the methods compute a weighted average of nearby

pixels, just as in Gaussian blurring. The aliasing effects arise because the linear filters do

not take into consideration the presence of edges or how to reconstruct them.

Figure 2.1: Part of Lena image downsampled and then upsampled by factor M = 2.

Other linear interpolation filters include include quadratic zoom, the B-spline method,

and zero-padding. But these schemes produce the same undesirable effects as the bilinear

and bicubic methods, as documented in [72]. Linear filters differ in the choice of φ, which

essentially determines how to compute the weighted average of nearby pixels. While this is

a natural interpolation scheme for general data sets, this is not necessarily appropriate for

visual data. In order to improve upon these linear filters, we need to consider interpolation

methods that somehow quantify and preserve visual information.

2.2 Which Methods to Consider?

Generally speaking, mathematical approaches to image processing can be divided into five

categories:

1. PDE-Based Methods (e.g heat diffusion, Perona-Malik, Navier-Stokes, mean curva-

ture)

8

2. Multiscale Analysis (e.g. wavelets, Fourier analysis, Gabor analysis, Laplacian pyra-

mids)

3. Machine Learning (e.g. unsupervised learning, data mining, Markov networks)

4. Statistical / Probabilistic Methods (e.g. Bayesian inference, Natural Scene Statistics,

pattern theory)

5. Variations of Energy (e.g. Total Variation, Mumford-Shah, active contours)

We are trying to describe the field in broad terms, but not to rank or pigeonhole work

in computer vision. Indeed, many techniques such as TV-wavelets inpainting certainly do

not fit into one category. Also, these methods differ at the mathematical level, but not

necessarily at the conceptual level. For example, some versions of the TV energy can be

minimized by solving a PDE or by optimizing a variation of energy.

In our attempt to survey recent work in image interpolation and also display the variety

of mathematics used, we will highlight one method from each of the first four categories.

The fifth category, variations of energy, will be discussed in detail in the Chapter 3. In this

chapter, we will consider

1. A PDE-Based Approach: anisotropic heat diffusion [10]

2. A Multiscale Approach: wavelet-based interpolation [23]

3. A Machine Learning Approach: LLE-based neighbor embeddings [35] item A Statis-

tical Approach: NL-means interpolation [19]

These methods are, in some sense, representative of the mathematical approaches to the

image interpolation problem and, in a larger sense, to the field of image processing. For

example, the heat equation is the most studied PDE in image processing and wavelet theory

has generated hundreds of research papers. We will briefly describe the mathematics and

motivation behind each method. Then we will present numerical results and discuss each

method’s advantages and drawbacks.

9

2.3 A PDE-Based Approach

A PDE-based approach evolves an image based on a specific driving differential equation.

For example, Cha and Kim proposed an interpolation method based on the PDE form of

the TV energy [25]. In their seminal paper on inpainting, Bertalmio et. al. proposed a

fourth-order PDE based on Navier-Stokes fluid flow [12]. The most famous and well-studied

PDE in image processing is the classical heat equation. Anisotropic heat diffusion has been

successfully applied to image reconstruction and denoising and its behavior is well-known

[57, 80]. Belahmidi and Guichard have proposed an interpolation scheme based on the

classical heat diffusion model [11].

2.3.1 Heat Diffusion

The heat equation is a useful tool for smoothing noisy images. We assume that pixel values

behave like temperature values and diffuse throughout the image. Diffusion is directed by

the unit vectors ~n and ~t, which are oriented by the gradient vector Du normal and tangent

to the edges, respectively:

~n =Du

|Du|=

(ux, uy)√u2

x + u2y

, ~t =Du⊥

|Du|=

(uy,−ux)√u2

x + u2y

Following the notation of Guichard and Morel [57], an image u(t, x) is evolved according to

the PDE∂u

∂t= |Du|D2u(~t,~t) + g (|Du|)D2u(~n, ~n) (2.1)

where

D2u(~v,~v) = ~vTD2u~v.

for the 2x2 Hessian matrix D2u.

The function g(s) is an “edge-stopping function” satisfying 0 ≤ g ≤ 1 that is close to

0 when s is large and 1 when s is small. The most common choice is the Perona-Malik

10

function

g(s) =1

1 + (s/λ)2

where λ is a parameter set experimentally [80]. The effect of g is shown in the following

theorem, which can be proven by direct calculation.

Theorem 2.1 (Belahmidi and Guichard, 2004) When g ≡ 1, equation 2.1 reduces to

the heat equation∂u

∂t= ∆u. (2.2)

When g ≡ 0, equation 2.1 reduces to mean curvature motion

∂u

∂t= |Du|∇ ·

(Du

|Du|

)= |Du|curv(u). (2.3)

In smooth regions, Du is small, g is close to 1, and the two terms of (2.1) have equal

weight. The Laplacian of equation (2.2) will blur the image evenly by isotropic diffusion.

Near edges, Du is large, g is close to 0, and the diffusion will occur along edges, smoothing

the level lines but preserving the sharpness of the edges.

Belahmidi and Guichard adapted the heat equation (2.1) to image interpolation by

adding a fidelity term [10]. The heat equation will still smooth the image while preserving

edges, but the addition of a third term keeps the image u close to the original image u0.

The PDE and initial condition are

∂u

∂t= |Du|D2u(~t,~t) + g(|Du|)D2u(~n, ~n)− Pu+ Zu0

u(0, x) = Zu0

(2.4)

The operator Z : Ω → ΩM is the duplication zoom or nearest neighbor upsampling

technique. The upsampled coarse image Zu0 acts as the initialization. The projection

operator P computes the average of the image u over the M × M stencil used in the

upsampling Z. If we let N(x) denote the M ×M upsampling window containing pixel x,

11

we can write P as

P (x) =1M2

∫N(x)

u(y)dy

The classical heat diffusion (2.1) has been well-studied, but it is unclear how the addition

of the fidelity term in (2.4) affects the equation. Little is known about solution to the

PDE (2.4), although some comments can be made in the viscosity framework. Writing

H(x, u,Du,D2u) for the right-hand side of equation (2.4), a viscosity solution u satisfies

u = 0 on ∂ΩM and for all v ∈ C2(ΩM ) we have:

1. H(x0, u,Du,D2u) ≤ 0 whenever u− v has a local maximum at (t0, x0).

2. H(x0, u,Du,D2u) ≥ 0 whenever u− v has a local minimum at (t0, x0).

Under this definition, Belahmidi proved the following theorem in [10].

Theorem 2.2 (Belahmidi, 2003) Suppose g(s) is the Perona-Malik function and u0 ∈

C(Ω). Then the PDE (2.4) with boundary condition u = 0 on ∂ΩM admits a unique viscosity

solution.

The proof is similar to the proof for viscous solutions to the Hamilton-Jacobi equation

[47]. Of course, this is of limited usefulness for natural images because the original image

u0 is almost certainly not continuous.

2.3.2 Numerical Results

Equation (2.4) can be discretized in a straightforward manner using finite differences. For

choice of small time step δt, we can write

u(n+1)ij = u

(n)ij + δt

(|Du|D2u(~t, t) + g(|Du|)D2u(~n, ~n)− Pu+ Zu0

)ij.

A von Neumann analysis of the 2D heat equation ut = ∆u shows that we require δt(δx)2

< 14

to guarantee stability of an Euler numerical scheme. Using this as a guideline, an image

12

has spatial step δx = 1 so we expect a rough upper bound δt < 0.25. We used Neumann

boundary conditions at the borders of the image.

Belahmidi and Guichard make a heuristic argument for the stopping time T . Running

the heat equation on an image u at scale t is equivalent to convolution with a Gaussian

kernel of standard deviation√

2t. Since the length of the diagonal of a pixel’s upsampled

M × M window is√

2M , the authors argue that the desired standard deviation should

be√

(2)M . So we set the stopping time T = M2. Our experiments with the PDE-based

method indicate that the image does not change much after this stopping time, so the image

may have reached its steady-state by this time.

The zooming method seems to do a good job smoothing edges, while maintaining the

sharpness of the edges. In terms of aliasing edges, it seems to perform better than linear

interpolation filters (see Figure 2.2). The PDE-based method seems to perform well on

natural images, although some textures are over-smoothed.

If the parameter λ in the Perona-Malik function g(s) is set too small, the method will

over-smooth textured regions, resulting in unrealistic images. We set λ very large to avoid

this side-effect. This preserved textures, but it also preserved noise and ringing effects

present in the original image (see Figure 2.3). Another side-effect, which is barely visible

in the figure below, is that the PDE-based method changes the overall contrast of the

image. This is because the diffusion across edges is limited, but still occurs. This may be

undesirable side-effect in some applications, such as medical images where the gray value

of brain matter is crucial.

2.4 A Multiscale Approach

A multiscale approach tries to break an image down into its most basic components of

information and express the image in scales of those building blocks. Multiscale analysis

seems a natural fit for image interpolation, with image upsampling viewed as determining

finer scales of image detail to add to a low-resolution image. Wavelets and their variants

13

Figure 2.2: PDE-based upsampling with zoom M = 3 and time step δt = 0.1. Top Row:Magnification of Miller image at time T = 9. Bottom Row: Close-up of section of imageand comparison to linear filters.

have received much attention for image interpolation, although most of the work has focused

on image super-resolution: interpolating a high-resolution image from an image sequence

rather than a single image. These techniques do not necessarily carry over to single image

super-resolution, as the sequence generally contains much more information than a single

image. The techniques are also highly dependent on precise sub-pixel registration of the

low-resolution images in the sequence [76]. Most of the wavelet-based work on single image

interpolation has focused on detecting extrema and singularities in the wavelet transform. In

this section, we describe a work by Carey, Chuang, and Hemami that focuses on producing

crisp well-defined edges in the interpolant [23].

14

Figure 2.3: PDE-based upsampling of text image with zoom M = 5 and time step δt = 0.1.Left: original image. Right: zoomed image at time T = 25.

2.4.1 Wavelets and Holder Regularity

Carey et. al. begin by defining the smoothness of an image in terms of Holder regularity

of the wavelet transform. We say that a function f : < → < has Holder regularity with

exponent α = n+ r, n ∈ N , 0 ≤ r < 1 if there exists a constant C satisfying

∣∣∣f (n)(x)− f (n)(y)∣∣∣ ≤ C|x− y|r, x, y ∈ <. (2.5)

Functions with a large Holder exponent will be both mathematically and visually smooth.

Locally, an interval with high regularity will be a smooth region and an interval with low

regularity will correspond to roughness, such as at an edge in an image. To extend this

concept to edge detection in the wavelet domain, we need a technique for detecting local

Holder regularity from the wavelet coefficients.

Let ψ be a compactly-supported discrete wavelet function, such as a Daubechies wavelet.

The discrete wavelet transform is computed by projecting a signal onto translations and

dilations of the mother wavelet ψ:

ψk,l(x) = ψ(2kx− l

), k, l ∈ Z. (2.6)

15

The wavelet transform coefficients wk,l at scale k and offset l are given mathematically as

an inner product with the mother wavelet:

wk,l = (f, ψk,l) . (2.7)

Numerically, these coefficients are computed using a filter bank with a scaling function φ

appropriate to the mother wavelet. The dyadic wavelet filter bank repeatedly divides the

signal at scale k into an approximation signal ak and a detail signal dk, also called the

averages and differences (see Figure 2.4). The coefficients of dk are precisely the wavelet

coefficients wk,l.

The following theorem by Ingrid Daubechies establishes the connection between wavelet

coefficients and Holder regularity [41].

Theorem 2.3 (Daubechies, 1992) Let x0 ∈ < and S be a set of index pairs (k, l) such

that for some ε > 0 we have (x0−ε, x0+ε) ⊂ supp (ψk,l). A signal has local Holder regularity

with exponent α in the neighborhood of (x0− ε, x0 + ε) if there exists a constant C such that

max(k,l)∈S

|wk,l| ≤ C2−k(α+ 12). (2.8)

Theorem 2.3 alone is not sufficient for determining the local Holder regularity, because

it requires computation of two unknown constants C and α. It has been observed ex-

perimentally that regions in a signal with low regularity tend to have greater similarity

across scales. Let dm(t) and dn(t) denote the wavelet sub-bands at scales 2m and 2n. The

correlation between sub-bands is given by

Corr (dm(t), dn(t)) =∫<

dm(τ)dn(τ − t)dτ. (2.9)

Applying Theorem 2.3 twice to this definition yields the following theorem.

Theorem 2.4 (Carey-Chuang-Hemami, 1999) Let f : < → < be C∞, except possi-

16

Figure 2.4: Discrete 3-level wavelet decomposition of noisy sine wave signal.

bly in a neighborhood of the origin, where it has Holder regularity with exponent α. The

correlation between sub-bands dm(t) and dn(t) satisfies

|Corr (dm(t), dn(t))| ≤ C2−(m+n)(α+ 12). (2.10)

Theorem 2.4 shows that regions with high regularity will exhibit low correlation across

scales, and vice-versa. In other words, an edge will result in extrema in the wavelet coef-

ficients across several scales, while extrema in smooth regions will not persist across scales

(see Figure 2.5).

The two previous theorems give a heuristic for estimating the local regularity of a signal

by examining the correlation across wavelet sub-bands. Carey et. al. claim that at a strong

edge in a signal, the inequalities in both theorems will be close to equality. By (2.8), in

an interval containing a strong edge the logarithm of the maximum coefficient magnitudes

should be close to linear across scales. The parameters C and α can then be estimated

using equality in (2.8) and (2.10).

17

Figure 2.5: 4-level Wavelet decomposition of noisy step function f .

2.4.2 Wavelet-Based Interpolation

Suppose we are given an image u0 : Ω → < and its corresponding L-level discrete wavelet

decomposition. Synthesizing a new sub-band dL+1 will produce a new image u : Ω2 → <

that is twice the size of the original image. Since the theorems above apply to 1D data,

Carey et. al. proceed by first processing the image data across each row and appending the

signals into a “row-interpolated image.” The same processing step is then applied to the

columns of this new image, with the end result being u.

Carey et. al. suggest interpolating the sub-band signal dL−1 rather than the finest

original sub-band dL because the finest band generally contains too much noise information.

As an initialization for dL+1, the detail signal dL−1 is upsampled by a factor 4 using cubic

spline interpolation. For each subinterval of the signal, the algorithm then determines

18

similarity across scale by computing the linear regression across sub-bands of the maximum

coefficient magnitude. If the linear correlation is strong, then the interval should contain

an edge and the linear regression will predict the magnitude of the coefficient at sub-band

dL+1. The template from dL−1 is used except at edges, where the signal is modified to

achieve equality in (2.8).

On a small set of test images, [23] demonstrate that their wavelet-based interpolation

method results in higher Peak Signal-to-Noise Ratio (PSNR) than the standard bilinear

and bicubic methods. However, visually the methods exhibit little difference. The wavelet-

based method seems to sharpen the edges, but the textured and smooth regions of the

image are blurred. This effect is understood because the interpolation step is simply 1D

cubic interpolation, except at strong edges. This method requires the original image u0 to

be large enough to show a significant amount of information across many wavelet scales.

Also, the technique lacks resizing flexibility because it assumes the zoom factor M is a

multiple of 2.

This wavelet-based method reduces to the linear bicubic filter, except at strong edges.

It is best suited for images with strong, well-defined edges separating smooth regions. One

possible refinement to this method would be to incorporate textured information. Several

papers have demonstrated that wavelet coefficient magnitudes can be used to quantify and

classify textures [63, 92]. It would be interesting to incorporate this idea into texture

interpolation, resulting in a sharpened image that is visually pleasing as well.

2.5 A Machine Learning Approach

As image processing research advances, researchers are realizing that details, textures, and

other visual nuances of images cannot be expressed in compact mathematical form. In

the last few years, researchers have given more attention to machine learning approaches

to guide the computer to learn meaningful information from a set of training images. For

image interpolation, a set of high-resolution images and its corresponding downsampled

19

versions are provided, with the goal of learning how to connect the low-resolution version

to its high-resolution counterpart. William Freeman and his group at Mitsubishi Labs have

developed an approach based on Markov networks and belief propagation networks citeFree-

man. Bryan Russell, one of Freeman’s students, extended this approach by incorporating

priors into the belief propagation networks, which results in realistic textured images with

sharper edges [84]. Most recently Chang, Yeung, and Xiong developed a learning system

inspired by dimensionality reduction techniques, which we highlight below [35].

2.5.1 Locally Linear Embedding (LLE)

In the last five years, much attention has been given to mathematical non-linear dimen-

sionality reduction (NLDR) methods, also called manifold learning techniques. Given a

high-dimensional data set X, the goal is to interpolate a lower dimensional data set Y that

preserves the neighborhoods of the original data set in a geometrically meaningful way.

In 2000, Lawrence Saul and Sam Roweis proposed the Locally Linear Embedding (LLE)

manifold learning technique [86]. For each data point in X, LLE computes the nearest

neighbors and then projects the neighborhood to Y by assuming that the neighborhood is

planar. This technique has proven effective experimentally in reducing the dimensionality

of data sets in a geometrically meaningful manner (see Figure 2.6).

Once a data set Y is determined, it is possible to add a new data point x to the manifold

X and add its projection y to Y without recomputing the entire embedding from X to Y .

Saul and Roweis suggest one solution for this out-of-sample extension is to first compute the

K nearest neighbors xiKi=1 of x in X. Next we compute normalized weights wiK

i=1 that

best form a linear combination of x: x ≈∑K

i=1wixi. Finally, we construct the interpolated

point y by using these same weights in a linear combination of yiKi=1 , the data points in

Y corresponding to the xi’s: y ≈∑K

i=1wiyi . This procedure motivates a machine learning

technique for comparing a given image to the training data set. The key difference is that

we are given a low-resolution image and interpolate a high-dimensional image, so we need

to increase the dimensionality of our data points.

20

Figure 2.6: LLE dimensionality reduction. Left: original 3D spherical data set. Right: 2Ddata set computed by LLE.

2.5.2 LLE-Based Interpolation

Suppose we are given a collection of low-resolution image patches X = xiNi=1 and their

corresponding high-resolution patches Y = yiNi=1, where images are expressed in raster

order as vectors. This training set could be prepared by dividing a set of high-resolution

images into patches Y and downsampling the images to patches X. The training set images

should be carefully chosen to reflect the textures and patterns that will be seen in the

interpolation phase and the downsampling rate should be the desired zoom M . Given a

new image patch x, the goal is to find its corresponding high-resolution image y.

Inspired by LLE’s out-of-sample extension scheme, Chang et. al. propose the following

interpolation. For each low-resolution image patch x, we perform the following steps:

1. Find the K nearest neighbors xiKi=1 of x in the data set X. The metric could be

the Euclidean distance, although more sophisticated image difference metrics could

be devised.

21

2. Compute weights wiKi=1 that minimize the reconstruction error:

err =

∣∣∣∣∣x−K∑

i=1

wixi

∣∣∣∣∣2

(2.11)

subject to the constraintK∑

i=1

wi = 1.

Note that this minimization is only performed over the neighborhood of x, so we

could enforce wi = 0 for any data point xi not in the neighborhood. We can solve

this constrained least squares problem by computing a Gram matrix

G =(x~1T −A

)T (x~1T −A

)

where A is a matrix containing the neighbors xiKi=1 as its columns. Expressing the

weights as a vector ~w, the closed form solution of (2.11) is

~w =G−1~1~1TG−1~1

. (2.12)

Equivalently, we could solve G~w = ~1 and then normalize the weights∑K

i=1wi = 1.

3. Project x to its high-dimensional image patch by computing

y =K∑

i=1

wiyi

where the yi’s are the high-dimensional patches corresponding to the xi’s.

After completing these steps on each image patch, we have obtained a collection of

upsampled image patches which can be arranged into a high-resolution image. However,

these patches are not independent since they should form a single image. To help enforce

continuity between adjacent patches, the training set X is formed by selecting overlapping

patches from the training images. Overlapping the image patches is a common trick that

22

is used in many machine learning vision algorithms [51]. Since this still does not guaran-

tee continuity between the computed high-resolution patches, the high-resolution image is

constructed by averaging pixel values in the overlapping regions.

Note that if we use the raw pixel values, as suggested above, our method will be sensitive

to changes in luminance. That is, if we supply a test image that is brighter than the training

images, the first step of the interpolation will not match correct textures to the given image

patch. Chang et. al. work around this problem by using the relative luminance changes

in their low-resolution patches X. Each pixel in a low-resolution patch is replaced with a

4D feature vector consisting of finite difference approximations of the first and second order

gradients. This helps the algorithm find neighbors with similar patterns rather than similar

luminances. Since this will prevent us from determining the overall luminance value of the

interpolated high-resolution image, the mean luminance value of each high-resolution patch

in the training set Y is subtracted from all pixel values. In step 3, the target high-resolution

patch y is constructed and the mean luminance value of the original low-resolution patch x

is added to y.


We implemented the LLE-based interpolation by selecting a set of high-resolution pho-

tographs to form the set Y and downsampling them by a specified zoom factor M to obtain

our low-resolution training set X. For our image patch sizes, we used 3x3 windows for the

low-resolution images and 3M × 3M windows for the high-resolution images. Using the

relative luminance changes as our feature vector, each vector in X had length (4)(32) = 36

and each high-resolution vector in Y had length 9M2. The low-resolution patches were

selected from the images with an overlap of 1 pixel width between adjacent patches. The

high-resolution patches necessarily had an overlap width of M pixels. These are the same

window sizes used in [35]. As suggested by the authors, we quadrupled the size of the train-

ing set by using rotations of the image patches (0, 90, 180, 270). This makes use of the

assumption that texture patches are often rotationally invariant. One of our training sets

23

consisting of face images in shown in Figure 2.7. The training time is very time-consuming.

This particular data set of 5 images took 91 minutes to prepare.

Figure 2.7: Training set used for LLE-based interpolation.

In the interpolation phase, we use K = 5 nearest neighbors. Figure 2.8 shows the inter-

polation result on a face image with zoom M = 3. The interpolation phase is rather slow,

since the nearest neighbor computation is rather expensive and grows quadratically with

the size of the training set. This particular image took roughly 20 minutes to compute.

Some aliasing and discretization effects can be seen in the original image and the interpo-

lated image is noticeably smoother. The interpolated image is somewhat blocky however,

reflecting the square windows used in the interpolation.

If we magnify a piece of the image in Figure 2.8, we can better see the blocky nature

of the reconstruction. Figure 2.9 shows a close-up of the eye. The LLE interpolation

definitely exhibits some aliasing, whereas the bilinear and bicubic filters smooth the image

better. Despite this effect, LLE does seem to do a good job interpolating texture.

The major drawback of LLE interpolation and machine learning methods in general is

that we require the generation of a good training set. In this case, the training set should

reflect the textures that will be seen in the test image. That is, if we want to interpolate

24

Figure 2.8: LLE-based interpolation of a face image with zoom M = 3 and training setshown in Figure 2.7. Left: original image. Right: LLE interpolated image.

faces then the training set should consist of face images. Figure 2.10 shows the result of

interpolating a text image using the face image training set. The text in the image is blurred

and the overall contrast of the image is changed.

Not only should the training set reflect the type of image interpolated, but the selected

images should also reflect the order of magnitude resolution desired. For example, the

texture of a brick wall will change drastically depending on the viewer’s distance from the

wall. After images are selected, generating the training manifolds X and Y is very time-

consuming. This data preparation could be done as a pre-processing step, provided the

zoom factor M is known. The downsampling rate and patch sizes depend on M , so this

factor must be fixed before training begins. Selecting and preparing a training set requires

prior knowledge of the type of image to be interpolated, the resolution of the images, and

25

Figure 2.9: Close-up of eye in Figure 2.8 with zoom M = 3. Top left: nearest neighbor.Top right: bilinear. Bottom left: bicubic. Bottom right: LLE-based interpolation.

the desired zoom factor. While this information is not generally available beforehand, there

are applications in which these parameters are known, such as MR image interpolation.

2.6 A Statistical Approach

Many, if not all, approaches to image processing can be interpreted as having a statistical

or probabilistic motivation. Certainly, the linear interpolation filters mentioned earlier are

statistical in nature, devising a convolution kernel that produces a weighted sum of neigh-

boring pixels. Several researchers, particularly in psychology, have focused on developing

a statistical theory of images and patterns [56] and recent efforts have tried to incorporate

this information into interpolation [51]. Interpolating textured images is related to the

problem of texture synthesis, which is based on computing local statistics that segment and

classify the image textures [98]. Several efforts have been made to develop Bayesian and

26

Figure 2.10: Text image interpolated by LLE with zoom M = 3. The training set in Figure2.7 was used. Top: original image. Bottom: LLE interpolated image.

MAP estimators for constructing a super-resolved image from a sequence of low-resolution

images [8]. In this section, we will present a simple statistical filter based on global image

statistics that simultaneously denoises and interpolates a textured image.

2.6.1 Local vs. Global Interpolation

All of the previous methods we have discussed thus far are based on local image statistics

and properties. The PDE and variational methods are based on very local finite difference

calculations. The wavelet interpolation method seeks to detect local singularities in the

image. Even LLE-based interpolation uses only a small 3x3 window of the given image as

its basis for interpolation, even though this window is compared to other small windows in

a large training set. However, textured images will often contain repeatable and identifiable

patterns throughout the image.

27

Although the previous methods preserved edges and structures well, they had a much

harder time interpolating texture. Except for LLE interpolation, all the methods tended

to over-smooth textured regions. This may be because the PDE, variational, and wavelet

methods can be written in a simple closed form, but natural textured images defy com-

pact mathematical explanation. LLE interpolation could only reproduce the texture if the

texture was present in the training set at the desired order of resolution.

In summary, we have observed two simple facts:

1. Most interpolation schemes are local.

2. Most interpolation schemes do not preserve textures.

This motivates the creation of an interpolation scheme based on global image statistics.

Conveniently, a statistical filter based on global statistics has already been recently been

developed for image denoising. Appropriately, its creators refer to it as the Non Local (NL)

filter.

2.6.2 NL-Means Denoising

Buades, Coll, and Morel proposed a new statistical filter for denoising images that uses the

information present in the entire image [19, 20]. Suppose we are given a noisy grayscale

image u0 : Ω → <. For each pixel x ∈ Ω, we define a local neighborhood N(x) ⊆ Ω as a

subset satisfying two simple properties:

1. x ∈ N(x)

2. x ∈ N(y) ⇒ y ∈ N(x)

There are many possible choices of topology that will satisfy these two properties. Note

that a simple N × N window, with N > 1 odd, centered over pixel x will suffice. Each

neighborhood describes the local pattern or texture surrounding x. If x is a noise point, then

to determine the proper value of u0(x) we should consider the pixel values u0(y) surrounded

by neighborhoods N(y) similar to N(x). Not knowing the pattern of the image a priori, we

28

assume that the image neighborhoods are distributed according to a Gaussian distribution.

This gives rise to the NL-means filter:

u(x) =1

Z(x)

∫Ω

u0(y) exp(−|u0(N(x))− u0(N(y))|2

h2

)dy (2.13)

where Z(x) is the normalization factor

Z(x) =∫Ω

exp(−|u0(N(x))− u0(N(y))|2

h2

)dy.

The norm in (2.13) can be any matrix norm, such as the Frobenius norm or any Lp matrix

norm. Buades et. al. recommend the L2 matrix norm.

The filtering parameter h controls the weight pixel values receive and needs to be set

carefully. If h is too small, the image u will closely resemble the original image. If h is too

large, then pixel values with dissimilar neighborhoods will contribute to the value of u(x)

and the result will resemble Gaussian blurring. Intuitively, h acts like a standard deviation

of the neighborhood distribution. Given the Gaussian nature of equation (2.13), we found

experimentally that a good choice is h =√

2σ where σ is the standard deviation of the pixel

values in u0.

Buades et. al. demonstrated that the NL-means filter successfully denoises textured

images. The authors showed that on test cases NL-means outperformed classical denoising

methods, including Gaussian smoothing, the Wiener filter, the TV filter, wavelet thresh-

olding, and anisotropic heat diffusion. They showed that under certain assumptions on the

noise distribution, NL-means minimizes the additive noise present in the original image [19].

In a follow-up paper, the authors showed that NL-means smooths edges and reduces the

staircasing effect in aliased images [20].

29

Figure 2.11: NL-means denoising on part of the Lena image. Left: noisy image. Right:image after NL-means denoising. Taken from [19].

2.6.3 NL-Means Interpolation

Based on the simple, elegant denoising filter in (2.13), we formulate a statistical filter for

image interpolation. Suppose we have a low-resolution, possibly noisy image u0 : Ω → <

and a version of u0 upsampled by a factor M , v : ΩM → <. Here, v will act as our

reference image on the finer lattice ΩM . The interpolation method for obtaining v could be

any chosen method, although the nearest neighbor interpolation would be a suitable choice

since it does not introduce any image artifacts or additional noise.

Similar to the NL-means denoising case, we wish to compare neighborhoods in the image

u0 that will allow us to interpolate pixel values that reproduce local patterns and textures.

To interpolate to the finer lattice, we should compare neighborhoods in v to neighborhoods

in the original image u0. Locally, we may not be able to correctly interpolate the texture

of an image. However, the downsampling process that created the image u0 may have

sampled the texture in a non-uniform fashion so that texture information may be present

in one portion of the image that is not present in another.

To motivate this comparison, consider the following scenario. Suppose we are interpolat-

30

ing a low-resolution photograph of a brown-eyed woman. Suppose that the downsampling

procedure that transferred the real scene to a camera image did not sample the black pupil

in the left eye. When we attempt to zoom in on the left eye, we will have to decide what

pixel value to fill in-between the brown pixels of the iris. If we use any of the interpolation

schemes described previously, they will use local information to fill in the missing pixel

with brown. However, a more natural way to fill in the missing pixel value is to look at

the right eye which, if we’re lucky, may have sampled the black pupil in the low-resolution

photograph. NL-means would compare the neighborhoods throughout the image, decide

that the right eye’s neighborhood closely resembles the left eye’s, and give a large weight

to the black pupil contained in the right eye. Note that this example is different than the

denoising case described in the last section. The missing pupil in the left eye was not due

to noise, it was due to the coarse lattice of the original image.

There is a slight difficulty in comparing neighborhoods on the coarse lattice Ω to neigh-

borhoods on the finer lattice ΩM . Since the interpolation procedure essentially places empty

pixels between pixels, we should think of the neighborhoods as being spread out in a sim-

ilar manner when we move to a finer grid. Suppose we have a fixed zoom factor M and

a neighborhood topology on the original image lattice given by N(x). We define a M -

neighborhood NM (x) ⊆ ΩM as the set of pixels mapped from N(x) by the upsampling

Dirac comb in equation (1.2). Note that for M = 1, we have N1(x) = N(x) ⊆ Ω. Figure

2.12 illustrates M -neighborhoods where the original neighborhood topology is a 3x3 pixel

square. We can think of the M -neighborhood as placing M − 1 empty pixels between each

pixel in the original neighborhood.

Taking the image v as our initialization on the finer lattice ΩM and using our definition

of the M -neighborhood, we can adapt equation (2.13) to interpolate an image u : ΩM → <.

The NL-means interpolation filter becomes

u(x) =1

Z(x)

∫Ω

u0(y) exp(−|v (NM (x))− u0 (N1(y)) |2

h2

)dy, x ∈ ΩM (2.14)

31

Figure 2.12: Illustration of M -neighborhoods for a 3x3 pixel square topology.

where again Z(x) is the normalization factor

Z(x) =∫Ω

exp(−|v (NM (x))− u0 (N1(y)) |2

h2

)dy.

Again, the fitting parameter h needs to be set experimentally. We used h =√

2σ as a

starting guess, where σ is the standard deviation of the pixel values in u0. However, we

found that this value did not work for all images. If h was too small, the image u would show

little change from its initialization v. If h was too large, the resulting interpolated image

would be blurred. But for an appropriate h, NL-means would simultaneously upsample and

denoise the image.

We found that for some natural images with little patterned or textured data, NL-means

would perform very poorly regardless of the value of h and would fill in many pixels with

value zero. Quite simply, if there is no pattern to learn, then NL-means will return a value

0. So we adjusted the filter slightly by adding the initialization point v(x) to the calculation

of u(x). The pixel v(x) will necessarily have weight 1 in the filter calculation, so the filter

equation (2.14) becomes

u(x) =1

1 + Z(x)

1 +∫Ω

u0(y) exp(−|v (NM (x))− u0 (N1(y)) |2

h2

)dy

(2.15)

32

with the same normalization constant Z(x) as before. With this adjustment, even if the

image contains no discernible pattern, then NL-means should return the image v. Note

that we could use any interpolation scheme to determine v, so we may view NL-means

interpolation as a refinement step that can be added to another interpolation scheme.


For our experiments we used a 5x5 pixel square topology or, when the original image was

large enough, a 7x7 pixel square. We used Neumann boundary conditions to determine

the neighborhoods of pixels at the border. The nearest neighbor interpolation scheme was

used to produce the initial image v. Because each pixel compares its neighborhood to the

neighborhood of every other pixel, NL-means interpolation is quadratic in the number of

image pixels. The computation time is high and may take several minutes to run, depending

on the image size.

For most images, we used the parameter value h =√

2σ, although we needed to adjust

this value for some images. Figure 2.13 shows the result of applying NL-means with h =√

2σ

to a Brodatz texture. The NL-means image appears less discretized than the bicubic image,

but is also more blurred.

Figure 2.13: Interpolation of Brodatz fabric texture with zoom M = 3. Left: Originalimage. Center: Bicubic interpolation. Right: NL-means interpolation.

Figure 2.14 shows the ability of NL-means to simultaneously remove noise and interpo-

33

late an image. The original image contained ringing artifacts from the image conversion

process. The edges are sharper than in nearest neighbor interpolation and the ringing

artifacts are removed.

Figure 2.14: NL-means interpolation of ringed image with zoom M = 3 compared to linearinterpolation filters.

Executing the interpolation and denoising processes simultaneously may have certain

advantages over performing them separately. If denoising is performed first, then denoising

may also remove fine structures which will be on the level of noise in a low-resolution

image. If interpolation is performed first, then noisy data will also be interpolated and the

larger noise points will be harder to remove. Figure 2.15 illustrates this concept. In the

image at bottom right, the salt and pepper noise points are made larger by the bicubic

interpolation and also blurred into the background. The NL-means denoising algorithm is

unable to remove the noise, creating a stippled black background. NL-means interpolation

is more successful in recovering the pure black background. However, it was harder to

34

remove the noise near the edges because fewer neighborhoods in the image matched these

neighborhoods of pixels near the edge.

Figure 2.15: Interpolation of a noisy image by factor M = 3. The image at bottom-rightunderwent bicubic interpolation followed by NL-means denoising.

If the parameter h is too large, the image will be blurred and fine structures may be

lost. In Figure 2.16, NL-means preserves the edges on the striped texture well and the

stripes are smoothed. However, the fine detail of the shirt collar is lost. The original image

resolution was a mere 60x60 pixels, which limited the number of neighborhoods that could

be compared to. This is the paradox of relying on global information for image zooming:

in order to correctly interpolate a high-resolution image, the low-resolution image must be

fairly large to begin with.

When a portion of an image is zoomed upon, NL-means could use the entire original

image for its comparison neighborhoods. Figure 2.17 shows the result on a MR brain image.

The entire MR image was used for the neighborhood calculation. The resulting image has

35

Figure 2.16: NL-means interpolation of textured image with zoom M = 4. Left: originalimage. Center: Bicubic interpolation. Right: NL-means interpolation.

smoothed homogeneous regions, while still giving some hint at texture. The edges, fine

structures, and contrast of the image are preserved.

2.6.5 Further Research on NL-Means Interpolation

Although the algorithm does is not appropriate for all images, NL-means interpolation is

promising and could yield a truly global approach to interpolation. The algorithm is very

sensitive to the value of the filter parameter h and this warrants more investigation. We will

also experiment with different interpolation schemes for producing the initialization image

v.

It may be possible to incorporate the downsampling or camera model into the algorithm.

For example, suppose we know the downsampling is preceded by convolution with a Gaus-

sian point spread function (PSF). When comparing neighborhoods, it may be worthwhile

to replicate the camera model by convolving v with the Gaussian PSF. Another adjustment

which may be promising is to use neighborhoods in our comparisons. In the case when x

is a noise point, it might prove more meaningful to compare the neighborhood surrounding

the point but not the point itself.

As in the Figure 2.17, NL-means can use image information that is not part of the

portion of the image to be zoomed. It might be feasible to extend this idea to consider

36

Figure 2.17: NL-means zooming of portion of MR brain image. Top: original MRI. Bottom-left: lower left corner of brain. Bottom-right: NL-means zoom with M = 3.

neighborhoods in other images, as LLE-based interpolation does. We might also use ro-

tated neighborhoods to better interpolate texture. NL-means might also prove useful for

super-resolution: producing a single high-resolution image from a sequence of low-resolution

images. Most super-resolution schemes require accurate registration of the image sequence,

which can be troublesome if the resolution is very low or the objects in the image undergo

more than translations and rotations. NL-means does not require image registration, only

a large set of neighborhoods to compare.

37

2.7 Summary and Motivation for the Variational Approach

In this chapter, we discussed 3 existing interpolation techniques and presented 1 new tech-

nique that are representative of the different approaches to image processing. Keeping in

mind the criteria we introduced in Chapter 1, we briefly summarize the advantages and

drawbacks of the methods as follows:

• Heat diffusion interpolation: Preserves and smooths edges well, but may over-smooth

textured regions and change contrast levels.

• Wavelet-based interpolation: Preserves and sharpens edges, but not textures. Reduces

to bicubic interpolation in textured or smooth regions. Can only double the resolution

of the image.

• LLE-based interpolation: The training set needs to be carefully selected to represent

the type of images, textures, and order of resolution that will be needed. Tends to

create small blocky regions. Unclear if it outperforms bilinear or bicubic interpolation.

• NL-means interpolation: Interpolates texture, but not specifically set up to sharpen

edges. Best suited for large, textured images. Can simultaneously interpolate and

remove noise, but may remove fine structures as well. May result in aliasing or

blurring. Sensitive to value of parameter h.

As mentioned earlier, most interpolation schemes act only on local information and fail

to interpolate texture well. This motivated the idea behind the NL-means interpolation.

However, the NL-means interpolation scheme does not always produce satisfactory results,

especially on small natural images.

For images or applications where texture is not important, we should employ the oppo-

site strategy of concentrating on local information. A good model should be self-contained,

not relying on detecting patterns within the image or on a database of images. Since the

best interpolation results focused on edges, our model should specifically account for dis-

continuities in the image. We saw that some zooming methods work well for certain types

38

of images, but not others. Therefore our model should be flexible, with parameters or com-

ponents that can be tuned to the image and task at hand. Finally, the model should be

robust to noise, ideally removing the noise during the zooming process. These properties of

a good model motivate the variational approach, which we will define in the next chapter.

39

Chapter 3

Variational Zooming

3.1 Introduction to the Variational Approach

The motivation for the variational approach is best understood in terms of Bayes’ Rule, the

starting point for all of computer vision [32, 74]. The goal is to recover an ideal, noise-free

image u : Ω → < from an observed, noisy image u0. Bayes’ Rule seeks the image u that

maximizes the probability

maxu

Pr (u|u0) =Pr (u) Pr (u0|u)

Pr (u0). (3.1)

Note that the denominator Pr (u0) is a constant and can be ingnored in the optimization. If

we assume that the image u0 is corrupted additive Gaussian white noise n with zero mean

and variance σ2, we can write u0 = u+ n and at a pixel x

Pr (u0(x)|u(x)) ∝ exp

(−(u(x)− u0(x))

2

σ2

).

Furthermore, we can express Pr (u) as a Gibbs energy in terms of a functional R(u)

Pr (u) ∝ exp (−βR(u))

40

for constant β. In statistical mechanics, β is related to the temperature of the system and

Boltzmann’s constant. The maximization in (3.1) is then equivalent to

maxu

Pr (u|u0) = exp

−∫Ω

(u− u0)2

σ2dx− βR(u)

.

Taking the negative log likelihood of both sides and dropping constants, maximizing the

Bayesian probability becoms a minimization of an image energy

minuE [u|u0] = R(u) +

λ

2

∫Ω

(u− u0)2 dx. (3.2)

The constant λ is proportional to 1σ2 . The first term on the right-hand side is called the

regularization term or image prior and generally describes the smoothness of the image u.

The second term is called the fidelity or matching term and forces the computed image to

remain close to the original image in the least squares sense.

The variational approach can be seen as a form of Tikhonov regularization used in the

context of ill-posed problems. Tikhonov and Arsenin proposed the regularization R(u) =∫Ω

|∇u|2dx, assuming smoothness as an image prior [93]. Many image priors have been

developed since, often developed specifically for the image processing task or application.

The regularization need not be explicit and can be learned from data, as done in [8] for

human faces. In this thesis we will focus on the two most popular regularization strategies:

the Total Variation (TV) norm [83] and the Mumford-Shah energy [75].

The minimization in (3.2) has proven effective for image smoothing, denoising, deblur-

ring, and segmentation [32]. The image inpainting problem, as first described by Bertalmio

et. al., seeks to fill in missing or corrupted information in a damaged image while also

possibly denoising the image as a whole [12]. Let D ⊆ Ω denote the damaged region of the

image. Variational inpainting minimizes the energy


λ

2

∫Ω\D

(u− u0)2 dx. (3.3)

41

The idea is that no information is available within D so the fidelity term is set to zero in

this region, while the regularization term smooths the image as a whole.

If we view image zooming as “filling in pixels in between pixels,” image inpainting

extends naturally to image zooming. For a magnification factor M ≥ 1, let Ω be the

domain of the original image u0 : Ω → < and ΩM denote the high-resolution domain of the

zoomed image u : ΩM → <. Assume for notational convenience that Ω ⊆ ΩM . For digital

images on integer lattices, this can be accomplished by inserting M − 1 pixels between

the pixels of the low-resolution lattice (see Figure 3.1). The inpainting domain becomes

D = ΩM \ Ω and the inpainting model (3.3) becomes the zooming model


λ

2

∫Ω

(u− u0)2 dx. (3.4)

Figure 3.1: Illustration of zooming by variational inpainting for magnification M = 3.

In this chapter, we will give a brief of survey of the existing mathematical theory behind

the TV and Mumford-Shan energies, with specific attention to results on the inpainting and

zooming problems. We will then discuss numerical computation of the minimimum energy

zooming using the digital TV filter and the Mumford-Shah Γ-convergence approximation.

42

Numerical results will be presented and compared to the zooming processes discussed in

Chapter 2. Finally, we suggest modifications to the basic inpainting/zooming model that

can improve the quality of the image interpolant.

3.2 The Total Variation (TV) Energy

The TV regularization was first proposed for image processing in the seminal paper by

Rudin, Osher, and Fatermi [83]:

RTV (u) =∫Ω

|∇u|dx.

TV regularization encourages image smoothness while allowing for the presence of jumps

and discontinuities, a key feature for image processing because of the importance of edges

in human vision. The norm | · | is generally assumed to be the L2-norm

|∇u| =√u2

x + u2y.

In the literature, this is often referred to as the isotropic norm, as it is rotationally variant.

In Chapter 5, we will discuss quantized TV minimization under the anisotropic L1-norm.

3.2.1 Theory and Theorems

As discussed by Chan and Shen [32], the TV norm can be derived from a level set viewpoint

by builiding it from statistics on level curves

Γα = x ∈ Ω : u(x) = α.

43

Note that if u is smooth, then each Γα will be a smooth curve. Suppose we take the length

L of the curve as a measure of smoothness, then the regularization should be:

R(u) =

∞∫−∞

L(Γα)dα.

The curve length is a natural choice for measuring smoothness and is exploited specifically

by the Mumford-Shah energy. Chan and Shen proved that the length is the only Euclidean

invariant, linear additive curve energy that can be expressed as a two-point accumulation:

e(x1≤i≤n) = cn−1∑i=0

|xi+1 − xi| = cL(Γα), x1≤i≤n ∈ Γα.

Parametrize the level set Γα by orthogonal flows s and t that are tangent and normal to

the curve. Then we have

dα = |∇u|dt, L(Γα) =∫Γα

ds, dsdt = dx1dx2 = dx.

So the regularization becomes

R(u) =

∞∫−∞

∫Γα

|∇u|dtds =∫Ω

|∇u|dx.

A more formal derivation leads to the famous co-area formula expressing the TV norm as

the sum of level set perimeters.

The gradient in the TV norm implicitly assumes that the image u ∈ C1(Ω), although a

general image will contain corners and discontinuities at the edges. Computationally, this

does not pose a problem because the image is digital and the gradient is discretized by finite

differences. But theoretically, we should discuss the gradient in the distributional sense Du

and functions in BV space.

44

Definition 3.1 (BV (Ω)) For a bounded open set Ω ⊂ <2 and a function u ∈ L1(Ω), set

∫Ω

|Du| = sup

∫Ω

u

(∂φ1

∂x1+∂φ2

∂x2

)dx : φ = (φ1, φ2) ∈ C1

0 (Ω)2, |φ|L∞(Ω) ≤ 1

.

under the Lesbesgue measure dx. Define BV (Ω), the space of functions of bounded variation,

to be

BV (Ω) =

u ∈ L1(Ω) :∫Ω

|Du| <∞

.

Note that for u ∈ C1(Ω), we have∫Ω

|Du| =∫Ω

|∇u|dx. Most theoretical results concern-

ing the TV norm are for functions in BV , which possesses several desirable space properties

such as lower semicontinuity and compactness. For example, the TV energy (3.2) does not

generally attain a minimum in Sobolev space W 1,1(Ω), but does in BV space [83].

Theorem 3.1 (Rudin-Osher-Fatemi, 1992) For an observed image u0 ∈ L2(Ω), then

the minimizer of the TV energy

ETV [u|u0] =∫Ω

|Du|+ λ

2

∫Ω

(u− u0)2 dx

exists and is unique in BV (Ω).

However, TV minimization for the inpainting problem is in general unstable, as the

example illustrated in Figure 3.2 shows. Suppose a binary 0-1 image consists of a black

rectangle with height h and width 3h. Suppose the inpainting domainD is a long rectangular

strip of width d centered over the rectangle. Note that for a 0-1 image, the TV norm

corresponds to the total perimeter of the geometric shapes, up to the choice of norm for the

image corners. For d < h, simple geometry shows that the minimum TV energy is attained

by a solid h × 3h black rectangle. For d > h, the minimizer will consist of two separated

black d× d squares. For the special case d = h, the minimum energy is attained by both of

the images just described. While it is somewhat upsetting that TV inpainting is unstable

45

for such a trivial example, this is actually consistent with Gestalt principles of human vision

psychology.

Figure 3.2: Inpainting a simple image.

Instability becomes even more troublesome for the zooming problem, where the known

domain consists of isolated pixels. In a continuous TV zooming model, the existence of a

minimum is not guaranteed and the interpolant depends on the chosen numerical scheme

[24]. One solution is to minimize the TV discrete energy, as presented in the next section.

3.2.2 Numerical Computation: The Digital TV Filter

The Euler-Lagrange equation associated with the TV energy is

−∇ ·(∇u|∇u|

)+ λ(u− u0) = 0 (3.5)

with Neumann boundary conditions

∂u

∂~n= 0 on ∂Ω.

This equation can be solved by numerical methods such as gradient descent

∂u

∂t= −∇ ·

(∇u|∇u|

)+ λ(u− u0).

46

PDE-based methods have to control the size of the time step to make sure the computation

is stable while still converging in an efficient manner. Malgoyres and Guichard outlined

a stable gradient-based implementation of TV zooming in Fourier space which enhances

edges, but produces slight ringing artifacts [71, 72]. These methods assume a continuous

model and may not be appropriate for inpainting / zooming problems.

Since the inputs and ouptputs to the TV energy are digital images defined on discrete

square lattices, Chan, Osher, and Shen proposed calculating the digital TV norm [33]. Let

N(x) ⊆ Ω denote a neighborhood of the pixel x, consisting of pixels near x but not not

including x itself. The standard topology is the 4-connected neighborhood: the neighbors

of pixel (i, j) are the pixels (i± 1, j), (i, j ± 1). The digital TV norm is defined as

RTV (u) =∫Ω

|∇u|dx =∑x∈Ω

√ ∑y∈N(x)

(u(x)− u(y))2.

The Euler-Lagrange equation associated with the digital TV energy is

∑y∈N(x)

1|∇u|

(u(x)− u(y)) + λ(u− u0) = 0.

To solve this Euler-Lagrange equation, the authors suggest a lagged diffusivity fixed-point

iterative scheme. The term 1|∇u| is frozen for one iteration and treated as a constant in the

update u(n) → u(n+1):

∑y∈N(x)

1|∇u(n)|

(u(n+1)(x)− u(n+1)(y)

)+ λ(u(n+1) − u0) = 0.

Solving for u(n+1) yields the digital TV filter

u(n+1)(x) =

∑y∈N(x) h

(n)(y)u(n)(y) + λu0(x)∑y∈N(x) h

(n)(y) + λ

h(n)(y) =1

|∇u(n)(y)|.

47

There are several methods for discretizing the gradient |∇u(x)| in h(n). Chan, Osher, and

Shen suggest a central difference scheme centered around the midpoint between pixel x and

its neighbor. For example, the discretization of |∇u| around pixel x = (i, j) for the neighbor

to the right(i+ 1

2 , j)

is

√(u(i+ 1, j)− u(i, j))2 +

(u(i+ 1, j + 1) + u(i+ 1, j)− u(i, j − 1)− u(i+ 1, j − 1)

4

)2

.

The discretization for the other three directions is very similar. To avoid division by zero

in smooth regions, a lifting parameter a is introduced

|∇u|a =√|∇u|2 + a2.

The authors claim the algorithm is stable for a = 10−4.

The algorithm is known to be stable for all input u0. An interesting feature is that the

filter satisfies a maximum principle, in the sense that the values of u will not exceed the

maximum of u0. Other interpolation filters, such as bicubic, can overshoot the maximum

when attempting to fit the given data to a smooth function.

To adapt the model to inpainting, an indicator function 1(x) is added to the fidelity

term to enforce matching only on the undamaged pixels. For an inpainting domain D, the

Euler-Lagrange equation is

−∇ ·(∇u|∇u|

)+ 1Ω\D(x)λ(u− u0) = 0.

The digital TV filter is

u(n+1)(x) =

∑y∈N(x) h

(n)(y)u(n)(y) + 1Ω\D(x)λu0(x)∑y∈N(x) h

(n)(y) + λ

h(n)(y) =1

|∇u(n)(y)|.

For the zooming case, the domain Ω\D is replaced with the low-resolution lattice Ω ⊆ ΩM .

48

Since the original data is finite-dimensional, the digital TV energy always permits a solution.

The value of the parameter λ balances the smoothness and fidelity terms and has a large

effect on the resulting image. As λ→ 0, the image becomes a constant image corresponding

to the mean of u0. As λ → ∞, the minimizer u → u0. There are several computational

methods for setting the parameter, including generalized cross-validation and the L-curve

method [97]. However, there is no “optimal” parameter value since the desired result de-

pends on the noise level, the image, the application, and the user’s subjective expectations.

We set the parameter experimentally by inspection.

3.3 The Mumford-Shah Energy

The model introduced by Mumford and Shah in 1989 simultaneously tracks the minimum

image u and the edge set Γ of the image [75]. The regularization term is

RMS(u,Γ) =∫

Ω\Γ

|∇u|2dx+ γH1(Γ)

where H1(Γ) denotes the one-dimensional Hausdorff measure and quantifies the total length

of the edges. The regularization smooths the image away from edges while controlling the

size of the edge set. Compared to the TV norm, the exponent 2 on the gradient in the

Mumford-Shah regularization enforces greater smoothing. The exponent 1 in the TV norm

gives equal preference to sharp edges and smooth gradients. Because it also tracks the

edges, the Mumford-Shah functional prefers smoother gradients away from the edges.

3.3.1 Theory and Theorems

Because the minimization is a free boundary problem, less is known theoretically about

the minimizers of the Mumford-Shah energy than the TV energy. In the orginal paper,

Mumford and Shah proved an interesting result about the geometry of the minimizer.

49

Theorem 3.2 (Mumford-Shah, 1989) Let(u ∈W 1,2(Ω),Γ ⊂ Ω

)be a minimizer of the

Mumford-Shah energy

EMS [u,Γ|u0] =∫

Ω\Γ

|∇u|2dx+ γH1(Γ) +λ

2

∫Ω

(u− u0)2 dx. (3.6)

Suppose Γ = ∪Γi where Γi is a simple C1,1-curve and each curve meets another curve or

the boundary ∂Ω only at its endpoints. Then any vertex of Γ must be one of the following:

1. a point on ∂Ω and Γi meets ∂Ω perpendicularly.

2. a point where three Γi’s meet with angle 2π3 (a triple junction).

3. a point where Γi ends and meets nothing (a crack-tip).

This theorem, however, does not establish the existence of such a minimizer where Γ

consists of C1,1-curves. The existence is called the Mumford-Shah Conjecture and has been

studied by several researchers, notably Braides and Bonnet, but remains an open problem

[18]. Ambrosio established the existence of a minimizing image in a special subset of BV

[3]. This minimizer was later shown by other researchers to be a minimizer of (3.6), but

uniqueness of the image and the precise nature of Γ have not been established.

For inpainting, the minimum image is in general non-unique. If we assume that the

edges of a binary 0-1 image occur exactly at the discontinuities, then the Mumford-Shah

and TV energies are equivalent up to the choice of parameters. The binary image will be

perfectly smooth away from the edges and the length of the edges equals the magnitude

of the TV norm, except perhaps at corners. So the trivial example presented in the last

section also shows Mumford-Shah inpainting is non-unique.

Asymptotically, the Mumford-Shah model also has uniqueness issues for the zooming

problem. If we let γ →∞, it can be shown that the minimizing edge set Γ vanishes to ∅ to

50

compensate [32, 75]. The energy (3.6) becomes the Tikhonov or Sobolev smoothing

E [u|u0] =∫Ω

|∇u|2dx+λ

2

∫Ω

(u− u0)2 dx.

Note that inpainting a domain D changes the limit on the second integral to Ω \D. If we

further let λ→∞, then we obtain harmonic inpainting:

∆u = 0 in Ω, u(x) = u0(x) for x ∈ D, ∂u

∂~n= 0 on ∂Ω. (3.7)

For the zooming problem, D is the low-resolution lattice. Finding a harmonic function with

boundary conditions and zero-dimensional data is an ill-posed problem [47]. This suggests

that Mumford-Shah zooming may produce undesirable results, at least in the continuous

model.

An error estimate for harmonic inpainting was developed by Chan and Shen [31] and

later by Chan and Kang [30]. A Green’s function G for a given domain D solves the

harmonic inpainting problem (3.7), if such a G exists. As described in [47], the Green’s

function satisfies

−∆G = δ(y − x) for x ∈ D, G = 0 on ∂D.

Then the harmonic function uh satisfying (3.7) is given by

uh(x) = −∫

∂D

u0(y(s))∂G(x, y)∂~n

ds (3.8)

where ~n is the outward normal along ∂D and s is the arclength parameter of ∂D. Suppose

the ideal image is utrue, with the image matching the given data outside D: utrue = u0 on

Ω \ D. The true image can be expressed in terms of the Green’s function by the double

layer potential

utrue(x) = −∫

∂D

u0(y(s))∂G(x, y)∂~n

ds−∫D

∆utrueG(x, y)dy. (3.9)

51

Subtracting equation (3.8) from (3.9) cancels the first term, yielding

utrue(x)− uh(x) = −∫D

∆utrueG(x, y)dy.

This establishes the following bound on the inpainting error.

Theorem 3.3 (Chan-Shen, 2002) Suppose uh, u0, utrue ∈ C2(Ω) and the inpainting do-

main D has a smooth continuous boundary. Then for any point x ∈ D

|utrue(x)− uh(x)| ≤ L

∫D

G(x, y)dy

where L is a constant satisfying |∆utrue| ≤ L in D.

This theorem is quite elegant because it shows that the error arises from natural and in-

tuitive sources. Inpainting error depends on three factors: the smoothness of the underlying

image (∆utrue), the size of the inpainting domain (integration over D), and the geometry

of the domain (G). By applying the Green’s function for an ellipse and the comparison

principle for Green’s functions, Chan and Kang obtained the following error bound.

Corollary 3.4 (Chan-Kang, 2005) If the inpainting domain D can be covered by an

ellipse with minor diameter d, then

|utrue(x)− uh(x)| ≤ 2Ld2.

This corollary proves that harmonic inpainting is best for long and narrow domains, such

as scratches on a photograph. This is a well-known phenomenon that was observed in the

first research article on image inpainting [12]. Since the Mumford-Shah energy asymptoti-

cally approaches the harmonic inpainting model (3.7) as λ, γ →∞, we can intuitively expect

the Mumford-Shah inpainting error to be bounded by the error for harmonic inpainting, at

least under some choice of parameters.

52

If we view the zooming process for magnification factorM locally as inpainting a distance

M between pixels, then we can conjecture based on Corollary 3.4 that the Mumford-Shah

zooming error is O(LM2). But unfortunately, the error analysis does not extend to the

zooming problem. To cover the inpainting domain in Figure 3.1 with an ellipse, we would

need to span the entire image. Also the given data consists of isolated points, so the Green’s

function could not exist in the classical sense for the simple reason that ∂D = ∅. However,

the underlying data is digital and the pixels are only zero-dimensional in the continuous

sense, indicating that perhaps a modified error error estimate is applicable to the discrete

computation.

3.3.2 Numerical Computation: The Γ-Convergence Approximation

The standard minimization approach in the calculus of variations is to solve the Euler-

Lagrange equation, but this is difficult for the Mumford-Shah model because the energy

(3.6) is not differentiable. There are two main approaches to minimizing the Mumford-Shah

energy: level set methods [79] and approximating the energy by a suitable functional [7].

We will discuss the latter technique, specifically the approximation developed by Ambrosio

and Tortorelli [4]. This Ambrosio-Tortorelli (AT) approximation has been shown to be

equivalent to the Mumford-Shah energy in the Γ-convergence sense, defined below.

Definition 3.2 (Γ-convergence) A sequence fj : X → < ∪ ∞ Γ-converges in X to

f : X → <∪ ∞ if for all x ∈ X the following two properties hold:

• For every sequence xj converging to x, we have f ≤ lim inf fj(xj) .

• There exists a sequence xj converging to x such that f ≥ lim sup fj(xj).

Under reasonable assumptions on the set X, the minimizer of the functional fj coincides

with the minimizer of its Γ-limit f [18]. The idea behind the AT approximation is to

replace the edge set Γ, which is difficult to track numerically, with an edge canyon function

53

z : Ω → [0, 1]. For a fixed parameter ε > 0, the function z ∈ L1(Ω) is designed to be

z(x) =

0 if x ∈ Γ

1 if d(x,Γ) > ε

with all remaining values in Ω defined by L1-extension. The AT approximation is then

given by

EAT [u, z|u0] =∫Ω

z2|∇u|2dx+ γ

∫Ω

(ε|∇z|2 +

(1− z)2

4ε

)dx+

λ

2

∫Ω

(u− u0)2 dx. (3.10)

Comparing this functional term by term to EMS in (3.6), the first term coincides with∫Ω\Γ

|∇u|2dx because z = 0 on the edge set. This term also has the effect of forcing z to zero

in regions with large variation where |∇u| is large. The second and third terms correspond

to the length H1(Γ), with the second term smoothing z and the third forcing z = 1 almost

everywhere. As ε → 0, the functional EAT Γ-converges to EMS in L1(Ω). Furthermore,

EAT admits a minimizer uε that converges in L1(Ω) to a minimizer u of EMS in a special

subset of BV (Ω). For an overview of Γ-convergence and the theory surrounding the AT

approximation, we refer the reader to Braides’ monograph [18] and Chapter 4 of the book

by Aubert and Kornprobst [7].

The AT approximation is differentiable and makes standard variational approaches pos-

sible. The Euler-Lagrange equations are

−∇ ·(z2|∇u|

)+ λ(u− u0) = 0

|∇u|2z + γ

(−2ε∆z +

z − 12ε

)= 0.

(3.11)

We impose boundary Neumann boundary conditions

∂u

∂~n=∂z

∂~n= 0 on ∂Ω.

54

To phrase this as an elliptic system, Esedoglu and Shen [46] introduced the differential

operators

Lz = −∇ · z2∇+ λ

Mu =(

1 +2εγ

)− 4ε∆.

Then the Euler-Lagrange equations in (3.11) can be written

Lzu = λu0, Muz = 1. (3.12)

This system can be solved with an iterative solver such as Gauss-Jacobi, alternating the

minimization of u and z.

To adapt this problem to inpainting, we simply restrict the fidelity parameter λ to be

zero on the damaged region D. The operators in (3.12) are

Lzu(x) = 1Ω\D(x)λu0(x), Muz(x) = 1.

Zooming replaces Ω\D above with the low-resolution domain Ω ⊆ ΩM . Esedoglu and Shen

note that for the inpainting problem, ε = 1 will suffice [46].

The other parameters γ and λ need to be set carefully, balancing the edge length, fidelity,

and implicit parameter 1 on the smoothness term. As before, the parameter λ should be

inversely proportional to the amount of noise in the image: λ = O(

1σ2

). The paramter γ

essentially determines how much of the image can be designated as an edge. As γ → ∞,

the edge canyon function z → 1 a.e. and the edge set vanishes. As γ → 0, z → 0a.e. to

make the smoothness term z2|∇u|2 smaller, effectively designating the entire image as an

edge.

3.4 Numerical Results and Discussion

Variational zooming was implemented using the digital TV filter and the Mumford-Shah

Γ-convergence model described in the previous sections. For TV zooming, we set the lifting

55

parameter a = 10−4. For Mumford-Shah zooming, we used the value ε = 1. The variational

models are sensitive to the other parameters, as shown in Figure 3.3. A simple checkerboard

image is zoomed by a factor M = 3. The interpolant should visually match the original

image, but the results vary widely depending on the choice of parameters.

Figure 3.3: TV and Mumford-Shah zoom of checkerboard image for magnification M = 3.The fourth column is a detail view of the image in the third column.

For the TV model, the interpolant becomes blurred if λ is set too small, with the

blurring most noticeable at corners. In a noise-free image, the value of λ can be set very

large. An artifact known as “scalloping” or the “zipper” artifact is shown in the image

at the top right. Under the TV L2 norm, the inpainted regions between known pixels are

not necessarily piecewise constant and an oscillating gray values appear along the edges.

The zipper artifacts become more prominent for large values of λ, because the known pixel

values cannot be smoothed.

The Mumford-Shah interpolant depends on the balance between the fidelity weight λ

and the edge length parameter γ. As with the TV model, the image becomes more blurred

56

as λ→ 0. If γ is too large relative to λ, the checkerboard images have wavy edges because

the model is minimizing the total edge length. Also, for large γ the image becomes more

blurred because the edge set is small and the smoothing term can blur a larger portion of

the image. Note that the first two Mumford-Shah images show the effect of Theorem 3.2.

Rather than the edges meeting perpendicularly, the corners are rounded off to form several

triple junctions with angle 2π3 between the edges. The third image shows the best balance

of the parameters, but the detail reveals the presence of zipper artifacts.

In general, the value of λ should be set as large as possible while still removing any

noise or unwanted features. For natural images with little or no noise, a value λ = 100

usually works well for the digital TV filter. We found that the Mumford-Shah parameters

λ = 20 and γ = 2000 work well for natural images. Throughout this thesis, the variational

parameters are adapted for the situation, as the parameters depend on both the image and

the application.

To zoom a color image, each of the RGB color channels is enhanced separately. This

assumes the color channels are uncorrelated, which of course they are not. There has been

some research on adapting variational methods for color spaces, notably the work by Sapiro

and Ringach for the vector-valued TV norm [85]. Figure 3.4 shows the result of 4x image

zoom on a color image. Note that the variational methods will smooth out fine structures

such as the glasses on the face and the text on the board. Isolated pixels can be seen in

the text, which may actually correspond to a local minimum of the energy. We expect

more isolated pixels to appear in the interpolant as the magnifcation M gets larger and the

corresponding domain D to inpaint also grows. As can be seen on the face, textures are

over-smoothed and the resulting image may appear “plastic.”

This suggests variational zooming may not be appropriate for producing photo-realistic

images. Instead, the methods are best suited for applications where image smoothing is a

desired result, such as in medical image enhancement or preparing images for automatic

recognition routines [6]. Figure 3.5 shows the results of 3x image zoom of a noisy MRI

brain image. The bicubic interpolation actually enhances the noise, since each pixel in the

57

original image is given equal weight. The TV and Mumford-Shah methods help smooth

out the noise, while also smoothing out the texture to make the anatomical features more

distinct. Note that simultaneously removing noise and enhancing the resolution can be

difficult, because the zooming procedure isolates the noise points and uses them as guides

for inpainting. Whenever possible, the noise should be removed from the low-resolution

image first before zooming.

Figure 3.4: Zoom of color image with M = 4.

Images produced by variational zooming may contain artifacts, including:

1. Over-smoothing of textured regions and fine structures.

58

Figure 3.5: Zoom of MRI brain image with M = 3.

2. Zipper artifacts along edges.

3. Isolated pixels in the final image.

In the next section, we will suggest some modifications to the inpainting model to help

correct for these artifacts. Compared to the zooming methods discussed in Chapter 2, the

varitational approach offers the following advantages:

1. Genericity: The variational method can be described completely in one energy equa-

tion and does not suppose prior knowledge of the image or image class. The LLE-

based zoom in Section 2.5 required a database of image textures, while the NL-means

zooming in Section 2.6 assumed the image contained detectable patterns. As noted

on page 258 of [32], variational inpainting is both local and functional. That is, the

inpainting is based only on information in the vicinity of the missing data and the al-

gorithm treats the images only as functions, not data that requires high-level pattern

recognition.

2. Flexibility: The parameters of variational zooming can be fine-tuned in an intuitive

59

manner to best suit the image and application, e.g. increasing the smoothing for

medical images. This tuning is harder to accomplish with the wavelet-based method

in Section 2.4 or the PDE-based method in 2.3. However, this property may make it

more difficult to select the appropriate parameters for a given image. We found that

values of parameters worked well across classes of images, e.g. medical images or text

images, so that the parameters did not need to be tuned for every new image.

3. Edge preservation: The TV norm is designed to allow discontinuities in the image.

The Mumford-Shah energy can enhance edges by smoothing the regions in the vicinity

of the edge. The Mumford-Shah edge length term will also smooth the edges, so

aliasing or staircasing effects should be less prominent than with linear filters. One-

dimensional edges, such as the glasses in Figure 3.4, may be smoothed out.

4. Stability: Variational zooming is robust to image noise and can even help remove

noise points. The variational approach is also somewhat robust to blur, although it

is difficult to remove without accurate knowledge of the blurring process. Algorithms

based on local pattern detection, such as the LLE and NL-means zoom, will be very

sensitive to noise and blur.

3.5 Modifications to the Inpainting Model

3.5.1 Incorporating a Blur Kernel

In the previous discussion, we assumed the image was corrupted by noise but not blur. A

more accurate image degradation model would be

u0 = K[u] + n

for some blur operator K. The blur operator could involve image corruption from many

physical sources: camera blur, optical blur, motion, atmospheric effects, etc. [32]. Generally,

60

K is assumed to be a convolution with a shift-invariant kernel k(x)

K[u] = k ∗ u(x) =∫Ω

k(x− y)u(y)dy.

The TV inpainting model incorporating K becomes

minuE [u|u0,K] = R(u) +

λ

2

∫Ω\D

(K[u]− u0)2dx.

The associate Euler-Lagrange equation is

−∇ ·(∇u|∇u|

)+ 1Ω\D(x)λK∗(K[u]− u0) = 0.

where K∗ denotes the adjoint of K. For a convolution operator, K∗ is the convolution

with the kernel k reflected about the origin. Incorporating a blur operator into the TV

zooming model has been shown to be effective in reducing blur in the interpolant [1]. The

fidelity term and Euler-Lagrange equations for the Mumford-Shah model are very similar.

In practice, the blur operator K is not known and needs to be estimated from the data

or the camera model. Chan and Wong developed a blind TV deblurring algorithm that

imposes a TV smoothness constraint on the blur operator

minu,K

E [u|u0] = R(u) +λ

2

∫Ω\D

(K[u]− u0)2dx+ β

∫Ω

|∇K|dx.

The algorithm alternately deblurs the image and smooths the blur operator. The algorithm

is known to converge for suitable pre-conditioners [34].

3.5.2 Locally Adaptive Fidelity Weights

While variational methods remove image noise by smoothing, they may also smooth out

textured regions in an image. The resulting images are often said to appear “plastic.” In

the TV energy the amount of smoothing is controlled by the parameter λ, so one solution

61

might be to relax the constant λ to a spatially varying function λ(x)

minuETV [u|u0] =

∫Ω

|∇u|dx+λ(x)

2

∫Ω\D

(u− u0)2dx. (3.13)

The value of λ(x) should adapt to the neighborhood of x, large in noisy or smooth regions

and small in textured regions with large variation. A simple first attempt is to set

λ(x) ∝σ2

loc(x)σ2

where σ2 is the variance of the noise in u0 and σ2loc(x) is the local variance of the noise,

calculated over a fixed neighborhood size. The constant of proportionality needs to be

determined, so there is still a parameter to fine-tune for the image and application. If the

variance in a neighborhood of pixel x is large, then the value of λ(x) should be large to

smooth the image more in the noisy region. The problem with this approach is that it

requires estimates of the noise variance, as opposed to the variance of the image gray values

which would not distinguish noise from texture.

Gilboa, Zeevi, and Sochen proposed a solution that iteratively updates the fidelity

weights of the TV norm [52]. An initial image u(0) is calculated from the minimum TV

energy (3.13) using a constant λ(0) ∝ 1σ2 . Then the fidelity weights for the next iteration

are calculated by

λ(n+1)(x) =σ2

loc(x)σ2

Q(n)(x), Q(n)(x) = (u(n) − u0)∇ ·(∇u|∇u|

). (3.14)

The local variance is estimated by the variance over an N ×N window of u0 convolved with

a Gaussian. The iteration stops when the image update is below some threshold. Note the

formula for Q appears in the TV Euler-Lagrange equation (3.5). The idea is to take the

variance of the gray values as an estimate of the noise variance, then rework this estimate

into the TV minimization. If the value of Q(x) is large, then that pixel’s region had a large

update in the last iteration’s TV minimization. The value of λ(x) is then set larger in the

62

next iteration, so that the smoothing will not be as great in the next iteration. In this

manner, the weights are adapted to even out the smoothing.

Figure 3.6: 2x TV zooming of noisy image with locally adaptive fidelity weights.

For experimentation, the noisy image in Figure 3.6 was synthetically generated with

additive Gaussian noise with mean zero and known variance. The TV zooming result for

constant λ and M = 2 is shown in the third image. The fourth image shows the result

of TV zooming using the fidelity weights λ(x) calculated by (3.14) on the original low-

resolution domain. The window used for calculation of the local variance was 11x11. Note

that the noise is removed in both TV zooms, but the model with adaptive weights shows

more texture in the fur and shirt. The SNR of the noisy image is 10.01, the TV zoom with

constant λ has higher SNR 15.02, and the adaptive TV zoom improves the SNR further to

15.48.

Because of the difficulty in estimating the noise variance in practice, the locally adaptive

63

TV method should only be used for images corrupted with a large amount of noise. We

should note that the results are not as sharp for zooming as they are for the simple denoising

problem (M = 1). It is difficult to track textures across scales, as discussed in Section 2.5.

One possible correction would be to incorporate the change in resolution into (3.14) to

reflect the change in scale. Almansa et. al. suggested a locally adaptive TV zooming based

on Chambolle’s TV denoising algorithm [1].

3.5.3 Soft Inpainting with Nearest Neighbor Information

The zipper artifacts and isolated pixels that arise in variational zooming are due to the

absence of a fidelity weight on the unknown pixels. In the inpainting region D, the reg-

ularization term encourages smoothness and minimum edge length, but the artifacts may

actually correspond to local minima under these priors. For the fidelity term to have an

effect in the domain D, a natural choice is to have an unknown pixel weakly correlated with

its nearest neighbor in the known region Ω \D. Let x denote the nearest neighbor in the

known region of a pixel x ∈ Ω:

x = argminx∈Ω\D

d (x, x) .

Trivially, for a known pixel x ∈ Ω \D we have x = x. Inspired by [91], we propose a “soft”

inpainting model of the form

minuE[u|u0] = R(u) +

λ

2

∫Ω

P (x) (u(x)− u0(x))2 dx.

Here P (x) is a weight function that determines how strongly a pixel correlates with its

nearest neighbor. We would generally expect 0 ≤ P (x) ≤ 1 with P (x) = 1 for known

pixels x ∈ Ω \D and the value of P decaying as the distance d(x, x) grows. Note that the

standard inpainting model is a subset of the soft model with P (x) = 1Ω\D(x), which can be

thought of as “hard” inpainting. One possible choice for a soft weight function is a negative

64

exponential

P (x) = exp(−d

2(x, x)σ2

)where σ is a sensitivity parameter. As σ → 0, the model becomes the traditional hard

inpainting. As σ →∞, P → 1 identically and the fidelity term assigns equal weight to the

pixels in the known and unknown regions. As with the other model parameters, the value

of σ needs to be set carefully to balance the two extremes.

The soft inpainting model generalizes to the K nearest neighbors of a pixel. Let

x1≤i≤K ⊆ Ω \ D denote the K nearest neigbhors of a pixel x ∈ Ω. Averaging over

the K nearest neighbors, the soft inpainting model becomes

minuE[u|u0] = R(u) +

λ

2

∫Ω

1K(x)

K(x)∑i=1

Pi(x) (u(x)− u0(xi))2 dx.

where Pi(x) defines the correlation between pixel x and its ith nearest neighbor xi. A

natural choice is again the exponential function

Pi(x) = exp(−d

2(x, xi)σ2

).

Note that the number of nearest neighbors K(x) could be spatially varying. In particular,

we expect K(x) = 1 for known pixels x ∈ Ω \D.

While the soft inpainting model is not necessarily appropriate for general domains, it

appears well-suited for the zooming problem. In particular, it seems reasonable to use

K = 4 nearest neighbors for pixels in the interior of the inpainting domain and K = 2

neighbors for pixels in a row or column of a known pixel. With respect to a low-resolution

lattice Ω contained in the high-resolution lattice ΩM , we define K(x) for x = (x1, x2) ∈ ΩM

to be

K(x) =

1 if x ∈ Ω

2 if x /∈ Ω, (x1, y) ∈ Ω or (y, x2) ∈ Ω for some y

4 otherwise

. (3.15)

65

Figure 3.7 shows the result of zooming by soft inpainting on a trivial binary image.

Bicubic interpolation completely blurs the edges and staircasing is visible along the diagonal

edge. The magnification factor M is large enough so that the standard Mumford-Shah

interpolant is a set of isolated white pixels. As expected, when σ is too small the result

coincides with standard hard inpainting. If σ is too large, the result is essentially an average

of the neighbors. For an appropriate choice of σ, the soft inpainting model produces a binary

image with well-defined edges. Note that for σ = 1 the diagonal edge is smooth and there

are no staircasing artifacts, although the corners are rounded off.

Figure 3.7: Effect of σ on Mumford-Shah soft inpainting with λ = 20, γ = 2000, M = 5.

Using the value of σ suggested by the last example, Figure 3.8 shows the zooming result

on a natural image. Hard Mumford-Shah inpainting produces isolated pixels and the legs

of the camera tripod almost completely disappear. The soft inpainting model using the

same variational parameters removes these image artifacts. The edges are smoother and

the image regions are more distinct than in the bicubic zoom.

66

Figure 3.8: Comparison of zooming using standard and soft Mumford-Shah inpainting withλ = 20, γ = 2000, σ = 1, M = 5.

3.5.4 Variational Zooming as Post-Processing

In the soft inpainting model, if we let K = 1 and σ →∞ then the fidelity term is equivalent

to matching the image u to the interpolant under the nearest neighbor or duplication zoom.

Similarly if we define K as in (3.15) and define the weight function to be polynomial in

the distance, the fidelity term could match u to a bilinar or bicubic zoom of the image.

This suggests that a special case of the soft inpainting model is equivalent to matching the

image u to some zoomed image v, rather than matching u to the original image u0 on the

low-resolution lattice. Suppose the image v : ΩM → < is a zoomed version of u0 under

some standard interpolation filter, such as bicubic zooming. Then variational zooming can

be seen as a post-processing step on the zoomed image v

minuE[u|v] = R(u) +

λ

2

∫ΩM

P (x) (u− v)2 dx.

The weight function P (x) essentially quantifies the confidence that the pixel x was zoomed

correctly by the process that created v. For post-processing the bicubic zoom, Cha and

67

Kim developed a fourth-order PDE method using a weight function that is polynomial in

the Laplacian of the zoomed image [26]. Adapting this weight function for the variational

approach and normalizing 0 ≤ P (x) ≤ 1, we set

P (x) =Q(x)

maxx∈ΩM

Q(x), Q(x) = (∆v)4 . (3.16)

The inpainting masks for the standard, soft, and post-processing models are illustrated

in Figure 3.9. Note that the first two masks are independent of the image data. The

post-processing weight function is largest at the corners of the zoomed image and small in

homogeneous regions and along smooth edges.

Figure 3.9: Different possible inpainting masks for a single image with magnification M = 5.Left to right: original image, standard inpainting mask, average of soft inpainting mask,Laplacian post-processing mask.

Figure 3.10 compares zooming under the standard inpainting and post-processing mod-

els. The bicubic zoom blurs the edges and staircasing artifacts are clearly visible. TV and

Mumford-Shah inpainting maintain clear smoothed edges, but zipper artifacts are visible in

the TV zoom and Mumford-Shah rounds the corners. Post-processing the bicubic zoom and

using the weight function (3.16), both the TV and Mumford-Shah models produce sharper

68

edges with less artifacts. In particular, the Mumford-Shah model seems well-suited to post-

processing. The zipper artifacts are no longer present, the edges are more distinct, and the

staircasing artifacts are removed. Because the Mumford-Shah model minimizes edge length,

the boundary of the circle appears piecewise linear rather than a smooth curve. To correct

this result, Esedoglu and Shen suggested adding a curvature term to the Mumford-Shah

energy [46].

Figure 3.10: Comparison of standard variational zooming and post-processing methods withmagnification M = 5.

Another key advantage of the post-processing model is that the magnification factor is

no longer restricted to integers M ≥ 1. Figure 3.11 shows the result of zooming a natural

image by a factor M = 2π. Compared to the bicubic zoom, Mumford-Shah post-processing

produces flatter regions bounded by sharper, smoother edges.

69

Figure 3.11: Zooming by magnification factor M = 2π using Mumford-Shah post-processing.

70

Chapter 4

Variational Super-resolution

4.1 Super-resolution of an Image Sequence

The goal of super-resolution (SR) is to produce a high-resolution image u : ΩM → < from

a sequence of N low-resolution images ui : Ωi → <1≤i≤N . We call the array of points

(grid of pixels) from which the image is formed the lattice. Here Ωi denotes the lattice

of the ith low-resolution image and ΩM is the high-resolution lattice that is a factor M

times larger than the original lattice. The input images u1≤i≤N are generally images of the

same visual scene from slightly different perspectives, such as a panning camera filming a

stationary object. Huang and Tsai were the first to notice that sub-pixel motion in the

sequence and image aliasing gave the potential for the construction of higher resolution

images. The authors described two basics steps in the super-resolution process: image

registration and data fusion [60]. These processes are sometimes treated separately in the

literature, although recent papers have addressed the steps jointly [58].

The first and probably most difficult step of super-resolution is to properly align the

images to the same grid ΩM . Let ϕi : Ωi → ΩM denote the coordinate transformation

mapping each image ui to the high-resolution grid. Then for a pixel x ∈ Ωi

ui(x) = Ki (u ϕi) (x) + ni(x)

71

where Ki is the linear blur operator and ni is additive noise for the ith image. If the

magnification M = 1, the transformation ϕi describes the registration between the images.

For M > 1, ϕi describes both the motion and downsampling processes for the ith image.

These transformations are generally restricted to the class of planar homographies. If the

two-dimensional (x, y) is represented in homogeneous coordinates as x = (x, y, 1), a planar

homography H can be expressed as a 3x3 matrix:

x’ = αHx, H ∈M33, α 6= 0

where α is an arbirtrary scaling factor. Because of the scaling, a planar homography has 8

degrees of freedom. Capel and Zisserman outlined three real-world situations in which the

planar homography assumption is appropriate [22].

1. The visual scene or object being viewed is planar and the camera motion is arbitrary.

2. The visual scene is three-dimensional but the camera motion is restricted to rotation

about the optic center and zooming.

3. The camera is sufficient distance from the visual scene that the parallax effects caused

by the three-dimensional nature of the scene are negligible.

In this chapter, we will assume that the input image sequence satisfies one of the above

assumptions reasonably well. Computationally we will restrict the camera/scene motion

to translations in Section 4.2.2, although the method could be extended to general planar

homographies. There exist several methods for image registration under a translational

model, notably the method by Irani and Peleg [61]. However, for a magnification factor

M > 1 the registration needs to be precise to the sub-pixel level, often a very difficult if

not insurmountable task. In general, to increase the resolution by factor M the registration

needs to be accurate to 1/M pixels on the high-resolution grid ΩM . It is assumed that

the transformation ϕi maps to the discrete gridpoints of ΩM , so for a continuous warping

it may be necessary to round the position of pixel ϕi(x) to its nearest gridpoint on ΩM .

72

Alternatively, the gray value at the point x ∈ ΩM could be interpolated from the pixel

neighborhood in ui surrounding ϕ−1i (x). Capel and Zisserman note that in addition to the

geometric registration, it may be necessary to perform a photometric registration between

the images to correct for changes in illumination and camera parameters. We will assume

the photometric differences between the images are neglible. Once the images are aligned

to a common high-resolution lattice ΩM , we obtain an image-like data set on ΩM with

some pixels having known value, some unknown, and some pixels having multiple values

addressed to them (see Figure 4.1). If the desired image grid ΩM is not large enough to

contain all mapped pixels of image ui, we will restrict attention to pixels in ΩM ∩ ϕi (Ωi).

Figure 4.1: Illustration of image registration for super-resolution. The three imagesu1, u2, u3 are aligned to a common high-resolution lattice ΩM by the respective geomet-ric transformations ϕ1, ϕ2, ϕ3.

Next, the registered images are fused into a single high-resolution image u. Note that

even if the transformations ϕi and blur operators Ki are known, the fusion problem is ill-

posed due to noise. The simplest image fusion approach is take the median through all

pixel values

u(x) = medianui ϕ−1i (x)|ϕ−1

i (x) ∈ Ωi, x ∈ ΩM .

73

The median image is commonly used as the benchmark for super-resolution algorithms. A

better approach is to use the Maximum Likelihood Estimation (MLE), as used by Irani and

Peleg [61]. Other researchers have developed Maximum A Posteriori (MAP) models that

incorporate image priors with desirable properties. For example, Schultz and Stevenson

proposed an image prior measuring image smoothness as a function of local second deriva-

tives [89]. The image prior need not be explicit and could be learned from the data, as in

Baker and Kanade’s method designed specifically for super-resolution of human faces [8].

The variational approach proposed in the next section can be viewed as a type of MAP

estimation.

4.2 Super-resolution by Variational Inpainting

4.2.1 Data Fusion with Known Registration

The variational inpainting model for a single image u0 extends naturally to multiple images

u1≤i≤N . Instead of the fidelity term matching to one image, the final image should match

on average all images in the sequence in the least squares sense. For a magnification factor

M , known registration functions ϕ1≤i≤N , and regularization term R(u), the variational

super-resolution model is

minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u) +

λ

2N

N∑i=1

∫ΩM∩ϕi(Ωi)

(u− ui ϕ−1

i

)2dx. (4.1)

For convenience, denote the registered image domain for the ith image in the limit of the

last integral by

Di := ΩM ∩ ϕi(Ωi).

Referring to Figure 4.1, the model will perform variational smoothing on known pixels and

inpainting in unknown regions. For pixels with multiple values, the fidelity term will drive

the image toward the mean value. However the model is not equivalent to matching to

74

the mean image, as pixels with multiple consistent values will receive more weight in the

minimization. That is, a pixel with multiple assignments of the color black will be more

likely to be black in the final image u, compared to a pixel with only one assignment.

The computation for both TV [83] and Mumford-Shah [75] regularization is a simple

modification of the single-image minimization. For the TV regularization

R(u) =∫

ΩM

|∇u|dx

the corresponding Euler-Lagrange equation is

−∇ ·(∇u|∇u|

)+λ

N

N∑i=1

1Di(x)(u− ui ϕ−1

i

)= 0

with Neumann boundary conditions. This equation can be solved by standard gradient

descent or level set methods. Another approach is to modify the digital TV filter of Chan,

Osher, and Shen described in Section 3.2.2 [33]. The digital TV energy is minimized by

iterating for n ≥ 1 the formulas

u(n+1)(x) =

∑y∈N(x) h

(n)(y)u(n)(y) + λN

∑Ni=1 1Di(x)ui ϕ−1

i (x)∑y∈N(x) h

(n)(y) + λN

∑Ni=1 1Di(x)

h(n)(y) =1

|∇u(n)(y)|

where N(x) is the 4-connected neighborhood of pixel x. The finite differences used for

discretizing h(n) are the same as in Section 3.2.2. To avoid division by zero, a lifting

parameter a > 0 is introduced into the norm

1|∇u(y)|a

=1√

a2 + |∇u(y)|2.

The digital TV filter computation is generally stable for a = O(10−4

)[33].

75

The Mumford-Shah regularization for an image u : ΩM → < and edge set Γ is

RMS(u,Γ) =∫

ΩM\Γ

|∇u|2dx+ γH1(Γ)

where H1 denotes the one-dimensional Hausdorff measure. The first term smooths the

image away from the edges and the second term minimizes the total edge length. Using the

Ambrosio-Tortorelli Γ-convergence approximation, let z : ΩM → [0, 1] denote the continuous

edge canyon function with z = 0 on the edge set Γ and z = 1 otherwise [4]. For a parameter

ε > 0, the Γ-convergence approximation to the Mumford-Shah regularization is

RMS [u,Γ|u1≤i≤N ] =∫

ΩM

z2|∇u|2dx+ γ

∫ΩM

(ε|∇z|2 +

(1− z)2

4ε

)dx.

The associated Euler-Lagrange equations and boundary conditions are

−∇ ·(z2|∇u|

)+λ

N

N∑i=1

1Di(x)(u− ui ϕ−1i ) = 0

|∇u|2z + γ

(−2ε∆z +

z − 12ε

)= 0

∂u

∂~n=∂z

∂~n= 0.

These equations can be solved by an elliptic solver such as Gauss-Jacobi, alternating the

minimization of u and z. For inpainting problems, setting the parameter ε = 1 will generally

suffice [46].

This model assumes knowledge of each registration function ϕi : Ωi → ΩM aligning

image ui to the high-resolution lattice. Because precise alignment is generally difficult to

obtain, much of the super-resolution literature obtains the registration from synthetic data

and focuses on the data fusion step. One strategy is to sample the sequence from a single

high-resolution image. Some researchers generate video of stationary objects with a precisely

calibrated slow-moving camera, obtaining the registration from the known camera motion.

A third data generation technique is to align high-resolution images, downsample the images

76

maintaining the registration parameters, and then work with the low-resolution images. For

a first numerical experiment, we employ this last strategy. For the two image sequences

shown in Figures 4.2 - 4.5, in each high-resolution frame in the sequence we manually selected

several control points, generally corners and other distinct image features. The images were

downsampled, tracking the downsampled control points as well. The registration functions

ϕ1≤i≤N are the affine transformations that best the match corresponding control points

between images in the least squares sense. This procedure is completely synthetic and, of

course, not reproducible in practice. The goal of this experiment is simply to establish that

the variational model is effective in fusing data when accurate image registration is known.

Figure 4.2: Super-resolution of 5-image sequence. Top left: original third image in sequence.Top right: 4x TV SR with λ = 20. Bottom left: 4x MS SR with λ = 20, γ = 2000. Bottomright: 4x MS SR with registration incorrect by 1/2 pixel on low-resolution lattice.

Figure 4.2 shows the result of super-resolution of a 5-image sequence of a sign with

77

Chinese characters. The images were aligned to the upsampled lattice of the third image,

which we will call the base frame. Both the TV and Mumford-Shah models produce an

image that is clearly higher resolution than could be obtained from a single image. The

TV image is slightly more blurred; the Mumford-Shah model generally produces sharper

edges with smoother regions away from the edges. This suggests the Mumford-Shah model

is more appropriate for images with sharp edges such as text. However, the TV model

may better for preserving texture in natural images. Because it generally provides sharper

images, most of the results in this chapter are obtained from the Mumford-Shah model.

Figure 4.2 also highlights the importance of precise image registration. To produce the

fourth image, the control points used for alignment were shifted in a random direction by 12

pixel on the low-resolution lattice. While a sub-pixel shift would not affect the result if the

resolution stayed the same, after magnification by a factor M = 4 the control points are off

by 2 pixels in the common high-resolution lattice ΩM . The resulting image is blurred and

the artifacts of the mis-alignment are clearly visible. In general, a magnification by a factor

M requires the images are aligned to 1/M pixel accuracy on their original lattices.

For color RGB images, the simplest solution is to find the minimum energy image over

each color channel separately. Figure 4.3 shows the result of color super-resolution on a

5-frame color video sequence, using the third frame as the base. Note the super-resolution

result recovers features that are not present in the original image, such as the small gaps in

the last character. Processing the channels separately is somewhat naive because it assumes

the image information is uncorrelated across color channels. There has been some research

on redefining the TV energy for color images, notably the work by Sapiro and Ringach

[85]. Some researchers indicate that for super-resolution it is sufficient to enhance only the

luminance channel of the YIQ color space and use a simple interpolation filter for the two

chrominance channels [35]. Farisu, Elad, and Milanfar developed an image prior specifically

for color super-resolution designed to force correlated edge location and orientation between

the color channels [49].

The super-resolution procedure extends naturally to video. Each frame of the video is

78

Figure 4.3: 4x color image zoom of 5-image sequence with known registration. Top row:nearest neighbor, bilinear, bicubic. Bottom row: staircased bicubic, median image, MS SRwith λ = 20, γ = 2000.

79

repeatedly selected as the base frame, aligning all other frames to the upsampled lattice

of the base. Figure 4.4 shows video super-resolution of an 11-frame video sequence with

known registration. The text is not legible in any of the 11 frames, but becomes much

clearer after the super-resolution. The features of the woman’s face are also improved,

but the face appears somewhat unrealistic. Because it minimizes the edge length, the

Mumford-Shah model is well-suited for lines and text, but tends to oversmooth textured

regions. This suggests variational super-resolution is best suited for applications that do

not require photo-realistic images.

Figure 4.4: Super-resolution of 11-frame video sequence with known registration. Top row:4 frames from original sequence. Bottom row: corresponding 4 frames from 4x MS SR withλ = 20, γ = 2000.

4.2.2 Simultaneous Registration and Fusion

Note that for a fixed image u, minimizing the general energy in (4.1) with respect to ϕi

requires just the unweighted fidelity term

minϕ1≤i≤N

E [ϕ1≤i≤N |u, u1≤i≤N ] =∫Di

(u− ui ϕ−1

i

)2dx = |u− ui ϕ−1

i |2. (4.2)

80

The registration functions ϕi should be restricted to a suitable class of spatial transfor-

mations for which registration methods exist. For example, Irani and Peleg outline an

iterative refinement based on a truncated Taylor series for affine transformations consisting

of rotations, translations, and scalings [61]. We found that the iterative refinement method

minimizing the L2 norm (4.2) worked well on the low-resolution lattices, but the result was

not accurate enough on the high-resolution lattice ΩM to produce acceptable SR results.

That is, the registration was accurate at the pixel level but not at the sub-pixel level.

To refine the registration, we propose an alternating minimization model. Suppose

one of the images uB : ΩB → < in the sequence is identified as the base frame and the

high-resolution lattice ΩM is generated by upsampling the lattice ΩB. Each low-resolution

image ui is aligned to the low-resolution image uB by a function τi : Ωi → ΩB. The aligned

images are then upsampled to the lattice ΩM . The minimum energy u is computed from

this registration, followed by minimizing over the registration functions for this image. The

process continues, alternately freezing and minimizing the image and registration functions,

until the registration functions are no longer updated.

Super-resolution by Alternating Minimization

Input: Original image sequence u1≤i≤N , base frame uB, update threshold δ > 0.

Output: Super-resolved image u.

Compute intial registration τ1≤i≤N aligining images to base image uB.

Upsample τ1≤i≤N to create ϕ(0)1≤i≤N .

Repeat

Fix ϕ(n)1≤i≤N and compute image u by minimizing energy (4.1).

Fix u and compute functions ϕ(n+1)1≤i≤N that minimize (4.2).

until max1≤i≤N

|ϕ(n+1)

i − ϕ(n)i |< δ .

Note that if the initial registration is accurate to the pixel level on the low-resolution

lattice, then this registration will be accurate within bM2 c on the high-resolution lattice.

For rigid transformations, the update to the registration functions can be computed by

a local search of pixel mappings on ΩM . We implemented the method above using the

81

Figure 4.5: Super-resolution video sequence with known and unknown registration. Top:one frame from original 11-frame sequence. Center: 4x MS SR using ground-truth registra-tion. Bottom: 4x MS SR with simultaneous translational registration.

82

Mumford-Shah model and restricting the transformations to simple translations

ϕi(x, y) = (x+ a, y + b) ↑M.

where ↑M denotes upsampling by a factorM . We assume the upsampling includes rounding

to the closest lattice point of ΩM , unless a more accurate gray value is interpolated from

ui. The initial registration was computed by the Irani-Peleg method and the updates were

computed by a local enumerative search over [a − bM2 c, a + bM

2 c] × [b − bM2 c, b + bM

2 c].

For most sequences, the process converged within two or three iterations and resulted in a

better image than using the initial registration. However, if the initial registration was not

accurate enough, the resulting image u was poor and the iterates became increasingly more

blurred. This is because the alternating minimization is driven toward a local minimum

close to the initialization which may not correspond to the global minimum over u and ϕi

jointly. The alternating minimization helps refine the registration and corresponding image,

but the initial registration still needs to be precise.

Figure 4.5 compares the alternating minimization SR method to SR using known reg-

istration. Both SR images are clearly an improvement over the original image, but the

second image is less blurred than the third. However, the second image was produced syn-

thetically, using known registration parameters. The third image is based on only the input

video sequence and is reproducible in practice. The blur derives from convergence to a local

minimum as well as the possibility that translations are not sufficient for describing the

motion between frames.

4.3 Artifact Reduction by Soft Inpainting

Variational super-resolution offers several computational and practical advantages, includ-

ing:

• Reconstruction from limited data: The convential wisdom for super-resolution is to

use O(M2) images for a magnification factor M , the idea being that this is the num-

83

ber of images required to fill in all pixels in ΩM . Browsing the literature shows that

using roughly 2M2 frames is the most common practice [8, 49]. Experimentally the

performance appears to level off at this limit [82]. The Mumford-Shah and TV reg-

ularization terms smooth the image, inpainting the unknown regions. The number

of images required to be produced an adequate result appears to be much less than

2M2. The results presented in this chapter use 5-11 images for a magnification factor

M = 4.

• Flexibility: Depending on the application and the input data, the parameters in the

variational model can be tuned to give desired output. For example, for text im-

ages the parameter γ can be increased to give straight lines with sharp edges. For

images with low SNR, the parameter λ can be decreased to increase the smooth-

ing, although this will result in more image blur. Additional image priors are easily

added to the energy, for example Shen and Esedoglu suggest adding a curvature term

to the Mumford-Shah inpainting model to encourage curved edges [46]. Of course,

the sensitivity to the parameters also means that fine-tuning the parameters can be

troublesome for input images where the noise, blur, and image type are unknown.

• Edge enhancement: The edge length term in the Mumford-Shah functional encourages

smooth well-defined edges, while the smoothing term enhances the edges by decreasing

local variation near the edges. To a lesser extent, the TV norm also enhances edges

because the minimization tends to yield piecewise constant regions, sometimes called

“blocky” images [42].

• Registration refinement: The alternating minimization method can iteratively refine

an imprecise initial registration. However, the minimization may converge to a local

minimum which produces an unacceptable blurred image. To avoid this, the initial

registration should be precise as possible. The alternating minimization can correct

the registration with sub-pixel shifts, but it cannot correct an initial registration incor-

rect at the pixel level or a geometric transformation that is inadequate for describing

84

the motion in the given sequence.

Variational SR can make image features clearer, but the resulting images tend not to be

photo-realistic and contain image artifacts. These artifacts derive from the variational SR

process, the underlying data, our assumptions on the data, and the inherent computational

limits of SR. Part of the problem derives from the binary decision that a pixel x ∈ Ωi

either counts in the final SR image or not, with no room for adjusting for local properties

or differences in the images. Inspired by [91], we refer to our data fusion formula (4.1) as

the “hard” inpainting model:


λ

2N

N∑i=1

∫ΩM

1Di(x)(u− ui ϕ−1

i

)2dx.

We can relax the characteristic function 1Di(x) to a “soft” inpainting model:


λ

2N

N∑i=1

∫ΩM

Pi(x)(u− ui ϕ−1

i

)2dx

where Pi : ΩM → < is a weight function, or sensitivity profile, that determines how much

weight the gray value ui ϕ−1i (x) exerts in the final image. The function Pi can be viewed

as a probability function and we generally assume 0 ≤ Pi(x) ≤ 1. Note the hard inpainting

model is a subset of the soft model. Below we discuss different image artifacts that arise in

the SR process and briefly suggest how the soft inpainting model can help correct for these

errors.

• Texture oversmoothing: Images produced by Mumford-Shah SR tend to consist of

smooth regions with sharp boundaries. Although the TV norm performs less smooth-

ing because the exponent on the gradient is smaller, textured regions will also be

smoothed in TV SR. The texture can be preserved by increasing the value of λ, but

this also emphasizes noise and misaligned pixels. One solution is to increase the weight

of the fidelity term in textured regions and decrease the weight in noisy regions. The

85

difficulty is that locally texture resembles noise. Gilboa et. al. suggested a locally

adaptive fidelity term for the TV energy that reduces noise while preserving texture

[52]. Along similar lines, He and Kondi recently proposed a SR scheme with the

fidelity weight varying across the image frames proportional to the amount of noise

in the frame [59]. Combining these ideas, the fidelity weight can be locally adap-

tive within each low-resolution image frame. For example, similar to [52] the weight

function could be

Pi(x) ∝σ2

i

σ2loc(x)

, x ∈ Di(x) (4.3)

where σ2i is the variance of the noise in image ui and σ2

loc(x) is the local variance of the

noise in a neighborhood around pixel x. The constant of proportionality needs to be

determined, although this constant could be absorbed into the parameter λ. Besides

the local variance, other local statistics such as entropy and geometric moments could

be used to differentiate texture from noise [55].

• Camera and motion blur: Suppose Ki is a blur operator that describes the camera

and motion blur in image ui. The blur can be incorporated into the variational model

as:


λ

2N

N∑i=1

∫ΩM

1Di(x)(Kiu− ui ϕ−1

i

)2dx.

Estimating the blur operator, the so-called blind deconvolution problem, is an open

research problem. There have been some results in the variational framework, notably

the work by Chan and Wong for the TV energy that minimizes the TV of the blur

kernel. However, this method requires accurate pre-conditioners or else the algorithm

converges to blurred local minima [34].

• Isolated pixels: Each upsampled image ui on the high-resolution lattice consists of

isolated pixels and the inpainting model does not always connect these single pixels

to other pixels. These isolated pixels are visible in the shadows of the SR images in

86

Figure 4.5 and are very prominent in the inaccurate SR in Figure 4.6. Decreasing the

value of λ will increase image smoothness, while also increasing blurring. The locally

adaptive model (4.3) should treat such pixels similar to noise points and should help

remove these pixels, although this may not be a desirable result in some images.

Another possibility is to use interpolated images ui that completely fill the high-

resolution lattice ΩM . A simple interpolation filter such as bilinear zoom or single-

image variational zooming could be used. The weights Pi(x) could reflect the distance

from a known pixel:

Pi(x) = exp(−d

2(x, x)2

σ2

)where x denotes the nearest known pixel in the original data.

• Inaccurate registration: Imprecise image registration results in blur and misaligned

pixels resembling noise points. Also, the chosen class of geometric transformations

may not be adequate for describing the motion between frames. The iterative refine-

ment method proposed in the last section helps correct for registration errors. The

soft inpainting model can help reduce these errors by making it proportional to the

registration energy (4.2) with respect to the base frame uB:

Pi(x) ∝∫Di

(uB − ui ϕ−1

i (x))2dx.

This function makes the weight functional proportional to the average alignment mis-

match with the base image. Alternatively, we could replace uB with the last image

iterate u(n−1) computed in the alternating minimization.

• Dynamic visual scenes: Super-resolution is effective for a very limited number of nat-

ural image sequences, for the simple reason that the real world is not static. Certain

types of motion can be accounted for by the registration functions, such as an object

moving in a plane parallel to the plane of the camera motion. However, planar pro-

jective transformation cannot account for non-rigid motion, such as moving limbs and

87

changing facial expressions. One approach is to incorporate temporal information and

assume that video frames will more closely resemble the base frame when they are

closer in time. Assuming the images u1≤i≤N are given in temporal order, a natural

weight function is a Gaussian centered over a base frame uB in the sequence:

Pi(x) =1

σ√

2πexp

(−(i−B)2

σ2

).

The value of σ should be inversely proportional to the rate of scene change. That is,

a video of a fast-changing scene should have a very low σ, indicating that only the

single frame uB gives an accurate depiction of the current scene. The weight function

Pi can also be modified to address dramatic changes, such as a movie cut. Another

approach for addressing scene change is to detect local deviations from the base image

uB:

Pi(x) = exp

(−∑

y∈Ni(x)(Ni(y)−NB(y))2

σ2

)(4.4)

where Ni(x) is the pixel neighborhood in image uiϕ−1i around x. Here we assume the

image uB has been upsampled to fill the lattice ΩM using some interpolation method,

such as bilinear interpolation or variational zooming. This function will detect regions

in which ui does not match the base image uB and decreases the weight of the fidelity

term in this region. A region featuring a large amount of variation, e.g. the path

of a fast-moving object, will cause the SR model to default to single-image zooming.

To some extent, this weighting can also correct for registration errors and noise. The

disadvantage of this model is that it also limits the amount of new information that

can be introduced to the base frame uB.

• Parallax effects: The grand challenge of SR research is to account for the three-

dimensional nature of the real world [48]. If the camera is distant from the visual

scene, these effects will be small. The soft inpainting model (4.4) can minimize the

distortion caused by parallax effects, but an ideal model would make use of all the

88

information provided. Such a model would require 3D scene reconstruction from 2D

images coupled with a 3D inpainting model.

Figure 4.6: Artifact reduction on three frames of 7-frame video sequence. Top row: originalvideo frames. Center row: 2x MS SR with λ = 5, γ = 2000. Bottom row: 2x MS SR withsoft inpainting σ = 10.

The video sequence in Figure 4.6 features a moving person tracked by a moving camera

and contains MPEG compression artifacts, motion blur, and aliasing. Super-resolution of

this sequence will result in many of the artifacts mentioned above: oversmoothed texture

in the face, isolated pixels surrounding the head, parallax effects from the head turning,

independent motion from the moving hand, and registration errors from the translational

89

model being insufficient. The soft inpainting function 4.4 was introduced using bilinear

zoom of the base video frame in the comparison with NB(x). The value of σ should be

chosen to balance the introduction of information from other frames with the removal of SR

artifacts. We found that the value σ = 10 removed the super-resolution artifacts, however

the SR images are over-smoothed and appear “plastic.”

4.4 Applicatons

4.4.1 Video Enhancement

Super-resolution has numerous applications to enhancing video: surveillance, tracking and

recognition, converting between movie formats such as DVD and HDTV, etc. As noted

earlier, the variational SR method can produce an entire video sequence by repeatedly

registering the images to a base frame. Note that the registration only needs to be performed

once for all frames, which is important because the registration step generally inolves more

computational effort than the fusion step. For the alternating minimization method, the

iterative registration refinements could also be calculated once for all images although it is

probably best not to do so to account for discrepancies between base frames.

One interesting application is enhancing traffic video for vehicle tracking and recognition.

Figure 4.7 shows one frame of a video sequence taken from a high stationary camera over an

intersection in Karlsruhe, Germany. Performing SR on the original video would accomplish

little, as the streets would be blurred by moving vehicles and the stationary buildings do

not exhibit sub-pixel shifts to permit enhanced resolution. On the other hand, tracking a

moving vehicle would give a good candidate for SR. The camera is far enough from the

scene that parallax effects are negligible as long as the vehicle does not change direction.

To test the parallax effects, one of the four vehicles selected was the white van turning the

corner. The four vehicles identified in Figure 4.7 were tracked manually for 11 consecutive

frames. The tracking was not very accurate, which should not affect the result as long as

each frame is large enough to contain the vehicle but small enough so that registration will

90

Figure 4.7: Frame from traffic video of intersection in Karlsruhe. The four highlighted carswere tracked for super-resolution enhancement.

align to the vehicle rather than other features such as the white lines in the street.

Each of the four vehicle sequences was enhanced by a factor M = 4 with the Mumford-

Shah alternating minimization method. The registration assumes a translational model,

which may not be entirely appropriate for vehicles moving towards or away from the camera.

The vehicle scale should be fairly consistent since the camera is very distant from the street

and the video sequences are very short. As Figure 4.8 shows, the SR enhances the vehicle

shape as well as features such as the windows and tires. However, the images appear

blurred with a horizontal jitter effect. Without knowing the technical specifications of

the video camera, we conjecture that the video was interlaced: the odd and even lines

were acquired separately and the vehicle changed position slightly during the acquisition

phase. To extend SR to de-interlacing, each frame is separated into two images consisting

91

Figure 4.8: Super-resolution of four 11-frame sections of video in Figure 4.7. Left to right:original base frame, 4x bicubic zoom, 4x MS SR with λ = 5 and γ = 2000, 4x MS SR withde-interlacing.

of alternating horizontal lines and this new set consisting of twice the number of frames is

super-resolved. To maintain the aspect ratio of the original frame, a blank row is inserted

on alternating lines for the inpainting mask. This has the same effect as increasing the

dimension vertically by a factor 2M . The de-interlaced SR images are much crisper and

appear more realistic than the original frames. In the second row of Figure 4.8, the white

van is partially occluded by a road sign in the base frame. The van’s rear tire is correctly

recovered by the SR images, a result that would be impossible with single image inpainting.

The other vehicles were also partially occluded in some frames by pedestrians, street lights,

and trees. This shows that SR is effective for disocclusion of objects, assuming the entire

92

object becomes visible over the course of the sequence.

4.4.2 Barcode Image Processing

A linear barcode is a series of alternating black and white stripes encoding information

in the relative widths of the bars. The most common barcode scanners are laser scanners

that read a 1D signal from the barcode. Imaging scanners that obtain a full 2D image of

the barcode are also used to accomodate nonlinear barcodes that encode information in

both the horizontal and vertical directions. However, this accomodation results in lower

decoding performance on linear barcodes for the imaging scanners as compared to the laser

scanners. One method for decoding linear barcode images is to repeatedly acquire signals

called scanlines which are perpendicular to the bar orientation until a scanline is decoded.

This method was a natural choice for industry because this allows the software to use the

existing decoding routines used by the laser scanners. Unfortunately, the imaging scanners

cannot obtain signal resolution that is as high as the laser scanners; the current imaging

scanners have not yet reached the mega-pixel level. Figure 4.9 shows a barcode image that

is not decoded by the current state-of-the-art software and all the many scanlines that were

tested. This method is somewhat wasteful, since an entire 2D image is acquired and stored

in memory but only small 1D portions of the image are used for decoding. Our goal is to

outline a computationally efficient method that uses the entire image to prepare a 1D signal

that is sent to the decoding software.

If we think of each of the scanlines as a one pixel high image, SR can be used to create

a single high-resolution signal from the acquired scanlines. Suppose the original image

u0(x, y) is cropped to contain just the barcode region. This cropping can be accomplished

by automatic barcode detection methods based on local line statistics [5]. Exploiting the

unique geometry of barcodes, it seems natural to register the scanlines by tracing each pixel

along a scanline down the bars in u0 to the base of the image, which we will refer to as the

t-axis (see Figure 4.11). The resulting 1D signal u(t) will be called the projected signal,

as it consists of the projection of all scanlines onto a common axis. To understand how to

93

Figure 4.9: Tested scanlines on a barcode image.

project the scanlines, first note that the orientation of the bars depends on the position of

the barcode image in three-dimensional space.

Although the barcode itself is assumed to planar, the image surface can exhibit three

types of rotations: roll, pitch, and yaw (see Figure 4.10). Image roll occurs within a plane

parallel to the imaging plane of the camera. The bars will remain parallel under image

roll. Image pitch occurs when the top or bottom of the barcode is moved towards or away

from the camera. Under image pitch, the bars are no longer parallel and instead should

converge to some focal or vanishing point. Note that in practice an acquired image will

almost surely exhibit roll and/or pitch. The presence of these rotations will affect how the

projection is done and we will show their presence is actually necessary for super-resolution.

The third image rotation, yaw, occurs when the left or right side of the barcode is pulled

from the camera. Yaw affects only the relative widths of the bars and hence should not

affect our projection. Decoding software corrects for yaw distortion using the “self-clocking”

feature built into linear barcodes – the information is encoded so that an edge occurs at set

intervals.

94

Figure 4.10: Three degrees of freedom in barcode rotation.

For the case of image roll, the bars will be parallel and the projection can proceed by

tracing each pixel in u0 along a vector parallel to the bar orientation. Suppose the roll angle

θ with respect to the y-axis is known. In practice, this angle is the first thing computed by

the imaging scanner software because the scanlines are oriented at this angle θ. The t-axis

will be perpendicular to the bar orientation and is given by y = x tan θ, where the origin is

given at the lower left corner of the image. Some trigonometry shows that the point u(t)

on the projected signal is obtained from the image pixel u0(x, y) by

t = x sec θ + (y − x tan θ) sin θ. (4.5)

An example of the projected signal u(t) for the roll case is shown in the third signal in

Figure 4.12. Note that if the roll angle θ = 0 or π2 , the projection will trace pixels to the

same position along the t-axis. This is called degenerate sampling and the resolution of the

95

Figure 4.11: Creating a projected signal u(t) from a barcode image u0(x, y). Left: projectionwith parallel bars (roll). Right: projection from focal point F for non-parallel bars (pitch).

signal will not be improved. Degenerate sampling will only occur when tan θ is a rational

number. In practice, the chances of obtaining such a roll angle is very small. Thus, image

distortion is actually essential for super-resolution.

For image pitch, the bars will not be parallel and instead should converge to some focal

point F = (xF , yF ). Each image pixel u0(x, y) should be traced along a vector connecting

F and (x, y) to the t-axis at the barcode base. Suppose the base points indicating the left

and right lower corners of the image are P1 = (x1, y1) and P2 = (x2, y2), respectively (see

Figure 4.11). By similar triangles, the pixel u0(x, y) is projected to u(t) by

t = x1 + d (x2 − x1)

d =(x− xF )(y1 − yF )− (y − yF )(x1 − xF )(x− xF )(y1 − y2)− (y − yF )(x1 − x2)

.

To calculate the position of the focal point F , the bar edges are traced to the point that

best matches the intersection in the least squares sense. This calculation is given by the

following simple theorem.

96

Theorem 4.1 Given a set of n lines Li : y = mix+bi, 1 ≤ i ≤ n, let the point F = (xF , yF )

be

F = argminF∈<2

n∑i=1

d2(F,Li)

for the Euclidean distance d. Then F is given by

xF =CD −BE

AD −B2, yF =

AE −BC

AD −B2

where

A =n∑

i=1

m2i

m2i + 1

, B =n∑

i=1

−mi

m2i + 1

,

C =n∑

i=1

−mibim2

i + 1, D =

n∑i=1

1m2

i + 1,

E =n∑

i=1

bim2

i + 1.

The proof follows immediately by taking the first derivative of∑n

i=1 d2(F,Li) and solving

for the coordinates of F . Computationally, the focal point is surprisingly small, generally

on the order O(103) for barcodes with slight natural pitch angles. The computational

difficulty comes in accurately tracing the lines Li. We found a good strategy was to only

count distinct lines consisting of very low (black) or very high (white) pixel values. Note

that the procedure outlined for non-parallel bars handles images rotated by both pitch and

roll. The bars will be parallel if the image is distorted by roll only.

If the bars in the image are parallel, the projected signal u(t) is calculated by estimat-

ing the roll angle θ and using (4.5). Otherwise, the non-parallel projection is given by

tracing lines to the focal point F and using (4.4.2). The resulting projected signal u(t) is

generally very noisy due to discretization, camera blur, hand jitter, electrical noise, defects

in the barcode paper, and inaccurate estimation of the parameters used in the projection

(θ, P1, P2, F ). Also, the projection points u(t) are non-uniformly spaced along the t-axis.

Our solution is to divide the t-axis into N equally spaced intervals. Just as in the image SR

case, each interval could contain 0, 1, or more pixel values. The signal is smoothed using the

97

1D version of the variational minimization for multiple images with known registration (4.1),

where ΩM is simply the discretized t-axis. We opted for the TV regularization, because the

TV norm has been shown to be effective in denoising 1D barcode signals [45, 100].

Figure 4.12 shows a barcode image that was not decoded by the current software. The

first signal is the ideal signal for this barcode. The second signal is a scanline taken from the

center of the image. The scanline has length 150, corresponding to the pixel width of the

barcode. Since the bars were detected to be parallel, the projection was calculated using the

procedure for image roll. The projected signal is very noisy and consists of several thousand

non-uniformly spaced values. The projected signal is divided into N = 500 equally spaced

intervals and smoothed with the digital TV filter super-resolution. The final SR signal

appears smoother and higher resolution than the scanline signal, but the key fact is that

the SR signal is decoded by the decoding software. This example shows that variational SR

can be used to decode barcodes that were previously not decodable.

We found that the projection method for non-parallel bars was less reliable than for the

parallel case, because the pitch distortion requires accurate tracing of the bars to the focal

point. In applications, reliable line tracing algorithms like the Hough transform are too

expensive computationally. Figure 4.13 shows a barcode image with a severe pitch angle.

We used a simple tracing technique that follows paths of the darkest pixels in the image,

indicated by the red dots in the figure. The resulting projected signal was higher quality

than the scanline through the center, notably at positions 150-200.

As an experiment, we obtained a particularly troublesome database consisting of 71

misdecoded barcode images. That is, the software was able to decode the barcode but the

result was not the correct encoded information. Misdecodes are very rare, since most bar-

code symbologies contain error detection features such as a check-sum digit. The projected

signal was calculated for each image, using the parameters and barcode region provided by

the decoding software. Because the scanner typically has very low computational power

and the algorithm has to run very quickly, we were unable to implement TV minimization

on an actual scanner. Instead, we took the mean of all gray values u(t) in each interval

98

Figure 4.12: Super-resolution of a Code 128A barcode image with roll only. Top to bottom:original image and final TV SR image, ideal signal, single scanline, projected signal, TVSR signal with λ = 10.

99

Figure 4.13: Super-resolution of UPC barcode with severe pitch angle. Top: original imagewith traced bars indicated by dots. Bottom: Scanline signal in red superimposed on TVprojected signal in blue.

along the t-axis. Note that as λ→∞, the TV minimization is equivalent to this averaging

process. The mean filtered projected signals were then sent through the signal decoding

software. Of the 71 misdecoded images, 28 (39%) were decoded properly and the remaining

43 were detected as no-decodes. Although decoding 28 of the images is certainly a success,

the important result is that none of the images were misdecoded. This indicates the SR

method may be useful for decoding images that were previously not decodable and also for

checking for misdecodes.

4.4.3 Reconstruction from MRI Sensor Data

Magnetic Resonance Imaging (MRI) is an increasingly important tool for detection and

diagnosis of medical conditions. In a phased-array MRI apparatus, N independent receiver

elements (coils) are placed around the subject, generally at equally spaced intervals along

100

a circle or ellipse. In the presence of a strong magnetic field, atoms with magnetic dipoles

align parallel to the magnetic field. All atoms in the sample are excited by a brief burst

of non-ionizing radiation. For the atoms processing at the frequency of the excitation, this

results in the atoms being displaced from a state of equilibrium. A coil sensor measures

the “relaxation time” of the atoms, indicating the time it takes for the nucleus to return

to its equilibrium energy state. The biological relaxation time is dependent on how the

atoms are bound to to the molecules and can be used to differentiate different tissue types

[6]. With the aid of spatial encoding, the local differences in relaxation can be used to

generate images representing both concentration and biochemical properties. For standard

anatomical images simple grayscale images are produced that illustrate differences in proton

density. From each of the N sensors, a grayscale image ui : Ωi → < can be constructed.

The processing ensures that the N independent images are spatially aligned at the pixel

level. For an underlying (real) image of the subject u : Ω → <, the image ui is theoretically

derived from u by multiplication by a sensitivity profile Pi : Ωi → < with additive Gaussian

noise ni:

ui(x) = Pi(x)u(x) + ni(x). (4.6)

The profile Pi(x) is the transverse component of the magnetic field from the receiver element

and reflects the sensitivity or confidence of the ith sensor at pixel x. An example of an actual

sensor image is shown in Figure 4.14, with zoomed images of a region close to the sensor and

another distant. Note that close to the sensor, the image is well-defined with strong edges

and good contrast. As we move away from the sensor the image grows darker, indicating

the sensitivity profile decays to zero and only the noise remains.

The standard approach for combining the N sensor images into one MR image v is to

take the L2-norm through the images:

v(x) =

√√√√ N∑i=1

[ui(x)]2.

101

Figure 4.14: A image from an MRI sensor and contrast-adjusted zoom of two regions.

Near a sensor, the L2-norm is close to the maximum gray value corresponding to the value

from that sensor image. In the center of the image, the L2-norm is roughly the mean of all

sensor images at that position. Larsson et. al. showed that among all known reconstruction

techniques without knowledge of the sensitivity profiles, the L2-norm produces images with

the highest SNR [67].

The soft inpainting model presented in Section 4.3 suggests that the reconstruction of u

from the sensor images u1≤i≤N could be accomplished by variational SR. For a magnification

M ≥ 1 and known sensitivity profiles in (4.6), the model is

minuE [u|u1≤i≤N , ϕ1≤i≤N , P1≤i≤N ] = R(u) +

λ

2N

N∑i=1

∫Di

(Piu− ui ϕ−1

i

)2dx.

However, this model requires knowledge of the sensitivity profile. Unfortunately, it is cur-

rently not possible to measure the sensitivity profiles, since they are dependent on the

sample. The intensity of ui(x) is proportional to a negative exponential of the relaxation

time. The relaxation time is, in turn, proportional to the strength of the magnetic field

102

exerted by the ith coil. Since magnetic force decays with the distance squared from the

source, we propose the sensitivity profile

Pi(x) = exp(−d

2(x, si)σ2

)

where si is the position of the ith sensor and σ is a parameter indicating the rate of decay.

Note that P → 1 as the position approaches the sensor and P → 0 as the pixels grow more

distant. The sensor positions s1≤i≤N could be measured directly on the MRI apparatus.

We can also try to interpolate the sensor positions by tracing backwards from the L2-norm

image v to the sensor images ui. Matching Piv and ui in the least squares sense gives the

sensor positions and sensitivity parameter σ by

minsi,σ

N∑i=1

(exp

(−d

2(x, si)σ2

)v(x)− ui(x)

)2

. (4.7)

Assuming the sensors are placed evenly in a circle around the image center, we can write

the sensor position in polar coordinates as si =(r, θ + 2π(i−1)

N

). Then the minimization

(4.7) need only find three parameters: r, θ, and σ. Since the functional is differentiable, the

minimization can be performed by gradient-based techniques. Figure 4.15 shows the sensor

positions found by backtracking from an L2-norm brain image for a system with N = 16

sensors.

Inserting the sensitivity profile and sensor positions given by (4.7), the variational SR

model ismin

uE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u)

+λ

2N

N∑i=1

∫Di

(exp

(−d

2(x, si)σ2

)u(x)−

(ui ϕ−1

i

)(x))2

dx.(4.8)

Since the sensor images are aligned at the pixel level, for M = 1 we immediately have

Ωi = ΩM and ϕi = Id. For M > 1, the sensor images may have relative sub-pixel shifts

that need to be determined. Note that to use the computational methods outlined in Section

103

Figure 4.15: Positions of 16 MRI sensors found by tracing backwards from L2-norm image,shown in center.

4.2.1, we rewrite (4.8) as

minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u)

+λ

2N

N∑i=1

∫Di

exp(−2d2(x, si)

σ2

)(u(x)− exp

(d2(x, si)σ2

)(ui ϕ−1

i

)(x))2

dx

This version of the minimization more closely resembles the soft inpainting model presented

in Section 4.3.

Figure 4.17 shows the result of Mumford-Shah SR using the sensor positions shown in

104

Figure 4.15. The resolution is not changed between the images (M = 1), because we want

to show a reconstruction rather than an interpolation. Hence there is no need for image

registration. This highlights the difference between effective and actual resolution. Even

though the SR image contains the same number of pixels as one of the 16 sensor images,

the SR images is clearly higher resolution. The four images in Figure 4.17 are displayed

from the minimum to maximum gray value, so all 4 images are equally bright at the sides

closest to the sensors. The L2-norm image is bright around the edges but very dark in

the center, a well-known problem in MR image processing. Note that the SR images are

considerably brighter in the center, making the raw image much easier to examine as a

whole. As λ is decreased, the image becomes brighter. This is partly because decreasing

λ results in greater smoothing and removes outliers that would affect the image display.

Figure 4.16 shows zooms of the central area of the brain of the L2-norm and SR image,

with the contrast enhanced in the L2-norm image to match that in the SR image. The SR

image clearly contains less noise, but the edges and shapes are still distinct.

Figure 4.16: Zoom of central area of brain. Left: L2-norm image with enhanced contrast.Right: MS SR with λ = 100, γ = 2000.

Compared to the standard L2-norm reconstruction, the variational SR reconstruction

105

from sensor data offers three main advantages. First, the contrast is enhanced to make

the central portion of the image more visible. Of course, a spatial contrast equalization

scheme could be used to achieve the same effect. But medical diagnosis is based not only

on shape, but also texture and color. The SR reconstruction enhances the contrast in a

physically meaningful way, so the intensities reflect biological tissue not just mathematical

normalization. Second, the Mumford-Shah functional smooths the image, removing noise

while preserving edges and fine image structures. This makes the medical image potentially

better suited for diagnosis, as well as further image processing such as segmentation and

automatic quantitative shape analysis. Third, the SR image values are on the same intensity

scale as the original sensor images, unlike the L2-norm gray values. This could potentially

make it easier to correlate values directly with signal strength and identify the type of tissue.

106

Figure 4.17: Mumford-Shah fusion of 16 MR sensor images. Top left: a sensor image. Topright: L2-norm image. Bottom left: MS SR with λ = 100, γ = 2000. Bottom right: MS SRwith λ = 10, γ = 2000. All four images have the same dimensions.

107

Chapter 5

Quantized Zooming

5.1 Introduction

5.1.1 Quantized Image Processing and the Quantized TV Energy

Up to this point, we have considered images u : Ω → < that take on a continuum of intensity

levels. Realistically, a digital image should map to a finite set of gray values. The range of

the image is only relaxed to the entire real line for computational and theoretical purposes.

An 8-bit image, the standard for JPEG, maps to integer values between 0 and 255. In

certain situations we may wish to restrict the range of possible values even further, such as

when processing a text image that should theoretically be binary-valued. For segmentation

and recognition tasks, cutting down the number of gray levels is a common image processing

trick for better defining the components of an image. Quantized image processing refers to

any transformation on a grayscale real-valued image u0 : Ω → < that produces a digital

image u : Ω → I, where I is a given set of discrete gray values.

To adapt the variational approach to the quantized case, define the quantized TV energy

minimization as:

minu∈I1,...,IL

ETV [u |u0 ] =∫Ω

|∇u| dx+λ

2

∫Ω

(u− u0)2dx (5.1)

108

where I1, . . . , IL represent the L fixed intensity levels. For the moment, we will consider

the intensity levels to be specified a priori. In Section 5.5.1, we will discuss the problem

of determining the intensities I1, . . . , IL. For small values of L, the TV energy should

closely resemble the Mumford-Shah energy. Note that for a binary 0-1 image, the TV and

Mumford-Shah energies coincide up to choice of parameters.

In this chapter, we will discuss how to minimize the quantized TV energy using the graph

cut approach developed by Boykov et. al. [17]. This method gives an exact minimization of

the quantized TV energy and can be computed in low-order polynomial time. In Sections 5.3

and 5.4, we will show how to apply this model to various low-level vision tasks, including

novel applications to inpainting, zooming, and deconvolution. Extensions of the model

to other energies will be presented in Section 5.5. We conclude this thesis by presenting

applications to the enhancement of text images, barcodes, and MR brain images.

5.1.2 Previous Work on Quantized Image Processing

There have been several PDE/variational approaches to quantized image processing, mainly

focusing on binary denoising and segmentation. The most common strategy has been to

drive the image toward the gray values 0 and 1 by introducing a “double-well” penalty term

u2(u − 1)2 into the energy. Esedoglu proposed adding this term to the 1D TV energy for

partially blind deconvolution of barcode signals [45]. Nikolova intoduced a binary denoising

method that uses the anisotropic TV norm to encourage 0-1 values [78]. Modifying the

Mumford-Shah Γ-convergence model, Shen intoduced the double-well function on the edge

set z, effectively restricting the image to two levels [90]. Lie et. al. offered a level set

implementation of the binary Mumford-Shah model [68]. Chan, Esedoglu, and Nikolova

multiplied the fidelity term of the Mumford-Shah model by the double-well function to

drive the image to 0-1 [29]. Most recently Bertozzi, Esedoglu, and Gillette proposed a binary

inpainting model based on the Cahn-Hilliard equation, a fourth-order PDE containing the

double-well function in its time evolution. This last model is particularly interesting because

it can complete isophotes across large inpainting domains [14].

109

While each of the variational models above has its strengths, introducing a penalty term

drives the image towards specific binary values but the resulting image is still grayscale.

Of course, the result could be thresholded to a pure binary image, but this could introduce

rounding errors and the thresholded result may no longer correspond to a minimum energy

image. The goal is to incorporate the quantization and image enhancement steps into one

procedure. This suggests working in a combinatorial optimization structure rather than

finding a minimum energy image by the calculus of variations.

Incorporating an energy similar to TV into a network flow model, Boykov, Veksler, and

Zabih showed the energy can be minimized by computing a minimum graph cut [17]. The

key feature of this method is that the minimization is exact, both in the fact that the

minimum is global and that the resulting image takes on only the specified intensity values.

The graph cut strategy has been applied to many problems in image processing including

segmentation, object recognition, disocclusion, and multi-camera scene reconstruction. We

refer the reader to [16] for an overview. Darbon and Sigelle adapted the graph cut method

to the TV energy and proposed a fast solution to the L-level problem that repeatedly finds

binary cuts [39]. Chambolle studied the binary TV model in the context of binary Markov

Random Fields and outlined a fast implementation of the multi-level model [27]. We will

discuss the graph cut method in the next section, but first we need to establish some basic

graph theory definitions and alogithms.

5.2 Quantized TV Minimization by Graph Cuts

5.2.1 Network Flows: Definitions

In this section, we establish the basic graph theory definitions necessary for describing the

quantized TV model. Most of the concepts and theorems stated were first stated in the

1956 book by Ford and Fulkerson [50]. For a review, we refer the reader to the introduction

to graph theory textbook by West [99]. We use the standard graph theory notation of

describing a graph G by a vertex set V and an edge set E. Denote a directed edge from

110

vertex u to vertex v by uv.

Definition 5.1 (Flow Networks) A two-terminal flow network is a connected directed

graph G = (V,E) equipped with a nonnegative edge weight function c : E → <≥0 indicating

the capacity of each edge. A vertex s ∈ V is identified as the source and another vertex

t ∈ V as the sink. There should be no edges entering the source or leaving the sink:

c(vs) = c(tv) = 0 ∀ v ∈ V .

For notational convenience, we assume that missing edges have zero capacity: c(uv) =

0 ∀ uv /∈ E. Ford and Fulkerson first described flow networks by analogy to a system

of pipes with water flowing from a source to the sink. The term “capacity” refers to the

amount of water that can pass through each pipe. Beyond plumbing design, flow networks

have found applications to transportation networks and assignment problems. We can now

formalize our definition of the minimum cut of a network.

Definition 5.2 (Minimum Cut) A cut [S, T ] is a partition of the vertex set V such that

S ∪ T = V, S ∩ T = ∅, s ∈ S, t ∈ T . The value of a cut [S, T ] is the total capacity of edges

between S and T :

val([S, T ]) =∑

u∈S,v∈T

[c(uv) + c(vu)] .

The minimum cut of a network has the minimum value among all possible cuts.

Note that the minimum cut for a given network is not guaranteed to be unique. The

Min-Cut Problem is to find the minimum cut of a given flow network. This is a classical

combinatorial optimization problem that can be solved in low-order polynomial time. In

the plumbing analogy, if all pipes in a network had the same capacity, the Min-Cut Problem

would remove as few pipes as possible to sever the source from the sink. Most approaches to

this problem actually solve the related problem of finding the maximum flow of the network,

defined below.

Definition 5.3 (Maximum Flow) A nonnegative edge weight function f : E → <≥0 is

called a feasible flow if f satisfies the two properties:

111

1. Feasibility: f(uv) ≤ c(uv) ∀ uv ∈ E

2. Conservation of Flow:∑

v∈V f(uv) =∑

v∈V f(vu) ∀ u ∈ V \ s, t.

The value of a flow is equal to the net flow into the sink:

val(f) =∑v∈V

f(vt).

The maximum flow is a feasible flow with the maximum value.

The Max-Flow Problem is to find the maximum flow of a given flow network. By analogy,

the goal is to determine the total amount of water that could reach the sink. It turns out

that this amount of water equals the amount that would spill out if the pipes were severed

by a minimum cut. Ford and Fulkerson proved that the Max-Flow and Min-Cut problems

are equivalent, as stated in the Min-Cut Max-Flow Theorem below.

Theorem 5.1 (Ford-Fulkerson, 1956) In every network, the value of a maximum flow

equals the value of a minimum cut.

There is a stronger version of this theorem that says the minimum cut can be recovered

from a maximum flow. This procedure is called the Ford-Fulkerson labeling algorithm,

described in the next section.

5.2.2 Network Flows: Algorithms

The first and simplest approach to finding the maximum flow was proposed by Ford and

Fulkerson. It relies on repeatedly finding available paths to increase the flow. These available

flow paths are called augmenting or special paths.

Definition 5.4 (Augmenting Path) An edge uv is said to be saturated if f(uv) = c(uv)

and unsaturated if f(uv) < c(uv). An augmenting path for a flow f is a path P from

s to t consisting entirely of unsaturated edges: f(uv) < c(uv) ∀ uv ∈ P .

112

It is clear from the definition that a flow is at maximum if and only if there is no

augmenting path for the flow. The idea behind the Ford-Fulkerson algorithm is to add

augmenting paths to the flow until no paths remain.

Ford-Fulkerson Algorithm

Input: Network G = (V,E) with edge capacity function c.

Output: Maximum flow f .

For each uv ∈ E, initialize f(uv) = f(vu) = 0.

while there exists an augmenting path P from s to t

Set df = minc(uv)− f(uv), uv ∈ P.

For each uv ∈ P , set f(uv) = f(uv) + df and f(vu) = −f(uv).

The order in which the augmenting paths are found is flexible and could affect the flow

found if the maximum is non-unique. The search procedure can also drastically affect the

running time. For a poorly chosen search the worst case performance is O(Eval(f)), which

can be unreasonably large even for trivial networks (see p. 596 of [36]). Using a breadth-first

search gives preference to finding the shortest paths in terms of the total number of path

edges. Edmond and Karps proved that this implementation runs in O(V E2) time [43].

An alternate algorithm called the preflow-push method was developed by Goldberg and

Tarjan [54]. The idea is to assign each vertex a “height” that determines how quickly the flow

streams downhill from each junction. The algorithm has a running time of O(V Elog(V 2

E )).

A review and comparison of maximum flow algorithms for image processing can be found

in [16].

Boykov, Veksler, and Zabih developed an approximation method for find near-optimal

solutions to the min-cut problem [17]. Theoretically the α-expansion algorithm finds the

minimum cut [S, T ] in O(V 2Eval([S, T ])) time, although the authors claim that in practice

the running time is O(V ). But to achieve this faster runtime, the solution is no longer

guaranteed to be exact. For image processing problems, the authors observed that the

numerical results are within 1% of the optimal solution.

113

The following corollary to the Ford-Fulkerson Theorem follows from the definition of an

augmented path [99].

Corollary 5.2 For a minimum cut [S, T ], every edge joining a vertex in S and a vertex in

T is saturated.

Using this corollary, we can determine the minimum cut from a maximum flow by

checking saturated edges. The algorithm, a variant of the Ford-Fulkerson algorithm above,

keeps track of a set of reached vertices R and a set of searched vertices S. The search traces

forward from the source along unsaturated edges and backwards (towards the source) along

edges with positive flow. The algorithm terminates when there are no more vertices to

search. Note that the algorithm does not reach the sink t if and only if the given flow f is

maximum [99].

Ford-Fulkerson Labeling Algorithm

Input: Network G = (V,E) with capacity c and maximum flow f .

Output: Minimum cut [S, T ].

Set R = s, S = ∅.

while R 6= S

Choose v ∈ R \ S.

For each vu ∈ E, if f(vu) < c(vu) then set R = R ∪ u.

For each uv ∈ E, if f(uv) > 0 then set R = R ∪ u.

Set S = S ∪ v.

Return minimum cut [S, S ∪ t].

The labeling alogrithm runs in O(V E) time. It is possible to incorporate the labeling

algorithm into the max-flow algorithm by tracking edges that become saturated as the flow

is assigned. So the minimum cut does not need to be found as a separate step.

114

5.2.3 The Quantized TV Model

The quantized TV energy in (5.1) can be modeled by a flow network under an L1 regular-

ization term

|∇u|1 = |ux|+ |uy|.

This version is often called the anisotropic TV norm because, unlike the classical L2 TV

norm, it is not rotationally invariant. The L1 norm gives preference to edges parallel to the

axes and tends to produce blockier images with sharp corners. Using this norm is necessary

for the graph cut framework, but given the applications to barcodes and text images it

seems appropriate to use a model that prefers blocky images.

Let x ∼ y denote that two pixels x, y ∈ Ω are adjacent under the standard 4-connected

cross topology. Rewrite the TV energy in (5.1) in discrete form under the L1 regularization

as

minu∈I1,...,IL

ETV [u |u0 ] =∑

2≤j≤L

∑x∼y∈Ω

u(x)≥Ij>u(y)

(Ij − Ij−1) +λ

2

∑1≤j≤L

∑x∈Ω

u(x)=Ij

(Ij − u0(x))2 (5.2)

where the intensity levels I1, . . . , IL are given in ascending order. Although (5.2) appears

more cumbersome, it will illuminate how to model the TV energy as a flow network. For

neighboring pixels x and y with u(x) 6= u(y), we add a regularization penalty Ij − Ij−1

corresponding to the number of levels separting u(x) and u(y). For the fidelity term, we

add the amount (Ij − u0(x))2 if u(x) = Ij .

For each image pixel, create a directed path of L + 1 vertices corresponding to the L

intensity levels plus a terminal. Each pixel’s path starts at a common vertex corresponding

to level I1, which we identify as the source s. The paths also terminate at a common

sink t, which we identify with the dummy variable IL+1. For pixel x ∈ Ω, we denote its

corresponding vertex at level Ij , 1 ≤ j ≤ L + 1, by xj . Define the capacity function c for

this graph by

c(xj , xj+1

)= λ (Ij − u0(x))

2 , 1 ≤ j ≤ L, x ∈ Ω (5.3)

115

c(xj , yj

)= Ij − Ij−1, 2 ≤ j ≤ L, x ∼ y, x, y ∈ Ω. (5.4)

All edges in the graph not specified by these two equations are assumed to have capacity

zero (or non-existent). Matching these equations to (5.2) shows that a minimum cut of the

network also minimizes the TV energy. This network set-up is sometimes called a “ladder”

system and is illustrated in Figure 5.1. The fidelity term (5.3) is given on the sides of the

ladder and the regularization term (5.4) on the rungs.

Figure 5.1: Illustration of quantized TV graph model for neighboring pixels x ∼ y.

Note that there are two regularization edges in (5.4) for each pair xj , yj , one going each

direction. So the total regularization penalty is actually twice the TV regularization in

(5.2). Hence, the fidelity weight in (5.3) is also doubled by parameter λ instead of λ2 as

in (5.2). Defining the value of u(x) to correspond to where the cut crosses the path for

pixel x, the minimum cut produces the minimum energy image u. Note that the minimum

116

cut will cross each pixel’s path exactly once assuming that λ > 0 and the intensity levels

I1, . . . , IL are distinct. We credit the following theorem to Boykov et. al., who developed

the “ladder” graph cut framework for the Potts model [17].

Theorem 5.3 (Boykov-Veksler-Zabih, 2001) Given an image u0 : Ω → <≥0, parame-

ter λ > 0, and intensity levels I1, . . . , IL with I1 < I2 < · · · < IL. Let [S, T ] be a minimum

cut of the flow network defined by (5.3)-(5.4). Then the image u : Ω → <≥0 defined by

u(x) = Ij if xj ∈ S, xj+1 ∈ T, 1 ≤ j ≤ L

is the minimizer of the anisotropic quantized TV energy given by (5.2). Furthermore,

val ([S, T ]) = 2ETV [u|u0].

Note that the reason the L1 regularization is appropriate for the graph model is because

it can be written as a sum of pairwise interactions, which cannot be done with the L2

TV norm. Furthermore, the L1 norm can be represented as a summation of levels. For

neighboring pixels x ∼ y at levels Ij and Ij + k k ≥ 1, respectively, the anisotropic TV

regularization RTV = |u(x)− u(y)| can be written

RTV (u(x) = Ij , u(y) = Ij+k) =∑

j≤i<j+k

RTV (u(x) = Ii, u(y) = Ii+1) .

Note this formula is essentially a discrete version of the TV norm’s co-area formula. This

is called the “levelable” property of the regularization term. Darbon and Sigelle proved

that the anisotropic TV norm is the only convex and levelable image prior that is invariant

to gray-level shifts [39]. Kolmogorov and Zabih gave more general crieria describing what

types of energies can be minimized by graph cuts [66].

For an image with N pixels and L desired levels, the number of vertices in the flow

network is N(L − 1) + 2. Except for pixels on the image border, each vertex has six

outgoing edges for the directions up, down, left, right, sink, and source. So the number of

117

edges is also O(NL). The adjacency matrix is technically of order NL × NL, but if the

graph is stored as a sparse matrix the memory requirement is only O(NL), equivalent to

storing L copies of the original image. The running time of the Ford-Fulkerson algorithm

for this network is O(N3L3). The preflow-push method takes O(N2L2log(NL)

)time and

the α-expansion approximation algorithm runs in O(NL) time.

Compared to a continuous-valued TV implementation such as the TV filter or gradient

descent, the graph cut TV model offers the following advantages:

• Exact minimization: The graph cut method computes the global minimum of the

quantized TV energy. A continuous TV implementation generally converges to a

local minimum and may face convergence issues such as controlling the time step

or pre-conditioning. There is no convergence in the graph cut method; when the

minimum cut algorithm terminates the quantized image is at minimum.

• Quantization: Thresholding a continuous-valued image or quantizing a discrete-valued

image further can introduce round-off errors. There is no guarantee that a minimizer

of the continuous-valued TV energy can produce the minimizer of the quantized TV

energy. By design, the output of the graph cut method is the minimizer of the

quantized TV energy and the resulting image takes on only the specified intensity

levels.

• Speed: Graph cuts can be computed in time that is low-order polynomial in the

number of pixels by algorithms with fairly low coding complexity. Approximation

algorithms run in linear time in practice.

• Derivative-free: Beyond the gradient in the regularization term, the graph cut method

does not require the computation of any derivatives. The method is not dependent

on a discretization or a computational adjustment like lagged diffusivity.

• No artificial boundary conditions or parameters: Most continuous TV minimization

routines assume the image has Neumann boundary conditions for computational pur-

118

poses. To avoid division by zero in the Euler-Lagrange equation, most routines intro-

duce a lifting parameter to the norm of the gradient. For a time-stepping method,

the TV minimization can be thrown off if the size of the time step ∆t is too large.

Disadvantages of the graph cut TV model include:

• Anisotropic TV: The regularization term must be the anisotropic L1 TV norm. It is

well-known that L2 TV minimization tends to produce images that are “blocky,” in

the sense that the minimization favors piecewise constant regions [42]. The images

resulting from L1 quantized TV minimization will appear even more blocky, with

sharp corners and quantized gray levels. In Section 5.5.3, we show that the image will

more closely resemble the isotropic energy if we use a different topology.

• Deconvolution: Incorporating a blur kernel into the quantized TV model is an open

problem, while it is relatively straightforward in the continuous case. We propose an

approximation strategy to the deconvolution problem in Section 5.4.

• Pre-determined intensity levels: The resulting image depends heavily on the choice

of intensity levels, which are assumed to be specified a priori. In Section 5.5.1, we

discuss how to update the levels for a given image, but the initial choice of intensities

is still important. In some applications, it may be easier to use the continuous TV

model for image enhancement and then determine the quantization levels afterwards.

• Large number of levels: The memory requirements and running time of the graph

cut method both scale with the number of intensity levels L, making the graph cut

method computationally expensive for large values of L. For more than a few levels,

say L > 10, the quantized image closely resembles the continuous-valued image (see

Figure 5.3). Hence, we expect the possible thresholding errors in the continuous case

to be small for large L.

119


We implemented TV graph cut minimization in Matlab using the preflow-push algorithm

in Stanford’s boost graph library [53]. This implementation was found to be significantly

faster than the Ford-Fulkerson method using a breadth-first search. To save memory, the

graph weights were stored in sparse matrices. For simplicity, the number of levels L is

specified as a parameter and the intensity levels were set by a linear increment:

Ij = min(u0) + (j − 1)max(u0)−min(u0)

L− 1, 1 ≤ j ≤ L. (5.5)

One drawback of the preflow-push algorithm is that it assumes all edge capacities are

integer-valued. If λ is rational, we can place integer weights on the regularization and

fidelity terms so that their relative importance is preserved. That is, if λ = αβ for α, β ∈ Z+

then

minu

∫Ω|∇u| dx+ λ

∫Ω

(u− u0)2 dx = min

uβ

∫Ω|∇u| dx+ α

∫Ω

(u− u0)2 dx.

Then assuming the original image u0 is integer-valued, as is usually the case for 8-bit images,

all capacities defined by (5.3)-(5.4) will also be integer-valued.

Figure 5.2: Effect of λ on TV minimization with L = 4 levels.

The value of the fidelity weight λ has a large effect on the resulting image. The appro-

priate value of λ is hard to determine, because it depends on the original image, the desired

120

Figure 5.3: Effect of # levels L on TV minimization with λ = 1.

Figure 5.4: Running time of quantized TV model with preflow-push method. Left: Log-logplot of # pixels N vs. runtime for repeatedly downsampled Barbara image. Right: Log-logplot of # levels L vs. runtime on 50x50 Barbara image. Linear regressions are shown inred.

121

intensity levels, the amount of noise, and the application. We found that λ = 1 worked well

for 8-bit images. For large values of λ, the result resembles thresholding to the specified

intensities. As λ→ 0, the image becomes smoother and will eventually result in a constant

image. Note that the rightmost image in Figure 5.2 takes on only 2 gray values even though

the image was specified for L = 4 levels. The quantized TV energy restricts the output to

intensities I1, . . . , IL, but realistically the resulting image u is only guaranteed to take on

a subset of I1, . . . , IL.

If the number of gray levels L is large, the resulting image looks very similar to the

result of continuous-valued TV minimization. This suggests that the quantized TV model

is best for small L. The applications presented in this chapter will be for images at few

levels, especially binary-valued images.

Theoretically, the running time of the preflow-push method is O(N2L2log(NL)

)for an

N -pixel image processed over L levels. As a numerical experiment, we repeatedly ran the

graph cut procedure on an image with fixed λ and intensity levels. The “Barbara” image

was repeatedly downsampled and we compared the number of pixels N to the running time

of the TV minimization at L = 2 levels. A linear regression of the log-log plot has a slope

of roughly 2.29. Next, a 50x50 Barbara image was processed at different values of L. The

log-log linear regression has slope 2.38. This confirms numerically that the running time is

slightly worse than quadratic in both N and L.

5.3 Application to Low-Level Vision Tasks

5.3.1 Denoising

Quantized TV minimization denoises an image in two ways: the TV norm minimizes the

amount of local variation and the reducing the number of gray levels results in a smoother

image. Quantization is a denoising process by itself. As discussed earlier, the value of

the fidelity weight λ is supposed to be inversely proportional to the variance of the noise.

Reducing the value of λ results in a smoother image, but reducing the number of levels L

122

seems to have an even greater smoothing effect. In Figure 5.5, the rightmost image appears

the least noisy because the number of levels is the lowest even though the value of λ is fairly

large. However, features are obscured in the binary image. Quantized TV denoising seems

to make the most sense when the underlying noise-free image is quantized, such as with

barcodes or text. The quantized TV denoising model has been studied by Chambolle, who

presented an efficient adjoint formulation implementation [27].

Figure 5.5: TV denoising of Barbara image. Left to right: Original image, L = 5 and λ = 1,L = 5 and λ = 0.1, L = 2 and λ = 1

The Bayesian rationale for the TV energy assumes that the image was corrupted by

additive Gaussian noise. Jonsson, Huang, and Chan studied the TV energy assuming a

Poisson noise model, which has applications to denoising Positron Emission Tomography

(PET) images [62]. At a pixel x, the Poisson noise model assumes

Pr (u0(x)|u(x)) =e−u(x)[u(x)]u0(x)

[u0(x)]!.

Taking the negative log likelihood and ignoring constants in the minimization, the fidelity

term becomesλ

2

∫Ω

(u− u0 log u) dx.

This can be implemented in the graph cut method by changing the fidelity capacity in (5.3)

123

to

c(xj , xj+1

)= λ (Ij − u0(x) log Ij + CL) , 1 ≤ j ≤ L, x ∈ Ω

where CL is a constant satisfying CL ≥ max(u0) log IL and all intensity levels Ij ≥ 1. The

constant CL is added to ensure all edge capacities are non-negative and its presence will

not affect the minimization. Since the TV norm is shift-invariant, we can temporarily shift

the image to handle gray values less than one.

Figure 5.6 shows the result of denoising an image corrupted by Poisson noise. For the

same value of λ and intensity levels, the Poisson TV model is much better at removing the

noise than the standard Gaussian model. The noise points in the interior of the shapes

will persist in the Gaussian model until λ = 0.03, but this low fidelity weight results in

oversmoothing of the shapes’ corners. This simple example shows that knowledge of the

noise process can be incorporated into the quantized TV model and can result in better

images.

Figure 5.6: TV Poisson denoising. Left: Original image corrupted by Poisson noise. Center:TV minimization assuming Gaussian noise with λ = 5, L = 3. Right: TV minimizationassuming Poisson noise with λ = 5, L = 3.

124

5.3.2 Segmentation

Identifying each connected region at constant gray value as an object, quantization can be

thought of as a segmentation procedure. For example, foreground/background segmenta-

tion might suggest processing at L = 2 levels. The quantized TV energy will divide the

image into segments, removing noise and giving preference to connected regions with sharp

boundaries and little local variation [39]. This approach works well for simple images with

constant intensity shapes and high contrast, as in Figure 5.7. However, the segmentation

may not perform well for natural images with texture, low contrast, and complicated ob-

jects taking on many intensity levels. Another problem is that the intensity levels need to

be specified a priori, which may cause difficulties when the objects are close together both

in proximity and gray value.

Figure 5.7: Quantized TV segmentation of simple images. Left 2: segmentation of naturalimage with λ = 0.5, L = 2. Right 2: segmentation of noisy synthetic with λ = 0.02, L = 4

Boykov and Jolly suggested user interaction to guide the segmentation process [15].

Suppose the user selects L “seed” pixels from the image, each pixel belonging to a different

region in the desired segmentation. Assume also that the seeds are at different gray levels

125

so that we can assign the intensities Ij to match the seed pixel values, with I1, . . . , IL

then sorted in ascending order. For a seed s ∈ Ω, we then define the fidelity capacity for

levels 1 ≤ j ≤ L to be

c(sj , sj+1

)=

0 if u(s) = Ij

∞ otherwise.

This forces the minimum cut to pass through the edge corresponding to the gray value of

u(s) at pixels s. This solves the problem of determining the intensity levels, while also

forcing the seed pixels to be in different regions.

Figure 5.8: TV seeded segmentation. Left: Original image with 3 seed pixels shown in red.Center: Quantized TV minimization with λ = 0.5, L = 3 levels selected by (5.5). Right:Quantized TV minimization with λ = 0.5, L = 3 using seeds.

The implicit assumption is that the selected seeds are not noise points in the image,

which would result in a poor choice of intensity levels. One solution would be to assign the

intensity to be an average over a local neighborhood surrounding the seed pixel. Another

possibility would be to have the user select several pixels in each region and average those

gray values. Such interactive segmentation systems have been developed where the user

“scribbles” in the selected regions [70].

126

Note that the seeded minimization in Figure 5.8 is superior visually, but the image

is still not segmented properly. An ideal segmentation at L = 3 levels would put the

person, camera, and background into 3 different regions. However, such a segmentation is

not possible using just intensity information; more sophisticated segmentation techniques

are required. The quantized TV energy should be understood as a pre-processing step

for segmentation, rather than the final segmentation result. Darbon and Sigelle showed

that quantization can improve the performance of edge detection and object recognition

algorithms [39].

5.3.3 Texture Segmentation

The image intensities alone do not appear to be enough to truly segment an image u0, but

segmentation could be achieved by applying a statistical filter to u0 designed to discriminate

certain properties of the image. For example, an object detection filter could assign a value

to each pixel indicating the probability that a pixel belongs to a certain object. Minimizing

the quantized TV energy of the filtered image would be similar to a classification system,

such as a support vector machine. The quantized levels divide the image into regions, while

the TV regularization tries to maintain connected components. The applications could

include object detection, recognition, motion tracking, and texture segmentation, the last

of which is examined in this section.

For texture segmentation, a simple texture filter is based on the intensity histogram.

Suppose the range of gray values is divided into n intervals zi, 1 ≤ i ≤ n. Let Pr (zi)

represent the relative frequency of interval zi in the histogram of the image u0. The entropy

e of the image is defined by

e = −n∑

i=1

Pr(zi) log2 Pr(zi).

The entropy measures the amount of randomness in the image [55]. Note that the entropy

could be calculated over segments of the image, rather than the whole image. For a fixed

127

odd integer N , let e(x) denote the entropy calculated over an N ×N window of u0 centered

over pixel x. To handle pixels at the border, impose Neumann boundary conditions on the

image.

Figure 5.9: TV texture segmentation. Left: Original image. Center: TV minimization withλ = 0.2, L = 2 of entropy statistics. Right: TV minimization with λ = 0.05, L = 2 ofskewness statistics.

Another basic histogram statistic used for texture discrimination is the skewness s. The

skewness is the third moment of the histogram, calculated with respect to the mean µ:

s =n∑

i=1

(zi − µ)3 Pr (zi) , µ =n∑

i=1

zi Pr (zi) .

As the name implies, this statistic measures the amount of symmetry in the histogram. A

symmetric histogram gives s = 0, a right-skewed histogram s > 0, and left-skewed s < 0

[55]. Let s(x) denote the skewness of the N ×N neighborhood of x.

Figure 5.9 shows the result of 2-level quantized TV minimization on the entropy and

skewness images of u0. The window size was 5x5 and the value of λ was tuned to give the best

result. The results are quite good, considering the simplicity of the filters. Better results

should be obtained from more sophisticated texture filters, such as a linear combination of

histogram-based statistics.

128

Figure 5.10: TV inpainting. Left: Original image with mask D shown in red. Right: TVinpainting result with L = 3, λ = 0.1.

5.3.4 Inpainting

Suppose image information is missing or damaged in a set D ⊆ Ω. The basic variational

inpainting model defines the fidelity term to be zero in the region D:

minu∈I1,...,IL

ETV [u |u0 ] =∫Ω

|∇u| dx+λ

2

∫Ω

1Ω\D(x) (u− u0)2dx. (5.6)

In the graph framework, this suggests setting all fidelity capacities to zero in the unknown

region D. However, this could potentially allow the minimum cut to cross the chain of a

pixel x ∈ D more than once. To define the resulting image u in 5.3, the cut should cross

each pixel’s chain exactly once. This is easily remedied by setting all fidelity capacities

along the chain to a positive constant, say 1. As long as all values along the chain are

identical, the minimization will not give preference to any level Ij for an unknown pixel.

129

Define the fidelity weight for pixel x ∈ Ω at level 1 ≤ j ≤ L by

c(xj , xj+1

)=

λ (Ij − u0(x))2 if x ∈ Ω \D

1 if x ∈ D.

The regularization weights will still hold throughout the image, so that image u is smoothed

in the unknown region D.

The quantized TV inpainting model inherits all the flaws of continuous-valued TV in-

painting: oversmoothing textured regions, blocky images, not completing broken lines, and

completing curves with straight edges. For small values of L, these flaws can become even

more pronounced for the quantized model. As described in Chapter 3, the inpainting model

is best suited for long, thin domains such as scratches. As the diameter of D increases, the

inpainting errors will become more obvious.

5.3.5 Zooming

Using the quantized TV inpainting method described in the last section, an image can be

zoomed by recasting the zooming problem as an inpainting problem. As in Chapter 3, to

zoom by a magnification factor M > 1 separate the pixels in the image u0 by M − 1 pixels.

Define the unknown pixel domain D to be the buffer region separating the known pixels in

u0. The inpainting routine should then fill in the pixels “in between” the pixels.

Unfortunately, the pixels in between may not be filled in. For larger magnification fac-

tors, the image consisting of isolated pixels may actually correspond to the global minimum.

Figure 5.11 shows a simple example of zooming a white square. For magnification M = 2,

a square is achieved but the bottom of the square is not where one would expect. For

magnification M ≥ 3, the minimizer is an image consisting of isolated white pixels. For a

binary image, the TV norm is a constant multiple of the edge length. The results in this

figure are the correct global minimizers because the isolated small squares have lower total

edge length than a large square for M ≥ 3.

For natural images, the failure of the zooming method will generally not be so pro-

130

Figure 5.11: TV zooming by inpainting with L = 2, λ = 1, and magnification factor M .

Figure 5.12: TV zooming by inpainting with L = 2, λ = 1, and magnification M = 2.

131

nounced. However, isolated pixels could appear in regions of high contrast and around thin

lines. In Figure 5.12, note the isolated pixels along the handle of the camera.

For the continuous-valued TV energy, the zooming method can be pushed toward a

local miniumum by starting the process with a proper initialization, such as the result of a

linear zooming filter. However, the quantized TV model does not require an initialization

and even if one was incorporated it would still achieve the same global minimum. Another

strategy is to use the “soft” inpainting model where the unknown pixels are given some

affinity for their nearest neighbors.

5.4 Quantized TV Minimization with a Blur Kernel

A black-and-white image, such as a barcode or text, that has been blurred will appear to

be a grayscale image with more than two intensity levels. Recovering the original binary

image should combine the quantization and deblurring procedures. Given an image u0 that

has been blurred by a known operator K, the quantized TV deblurring model is

minu∈I1,...,IL

ETV [u |u0,K ] =∫Ω

|∇u| dx+λ

2

∫Ω

(Ku− u0)2dx. (5.7)

If the blur is shift-invariant, K can be expressed as a convolution by some kernel function

k(x). The continuous TV deblurring model has been well-studied and can be implemented

by gradient-based or level set methods [32]. The quantized model, however, has proven

more difficult to implement. In terms of graph models, the difficulty lies in the fact that the

blur operator acts on a group of pixel values so the fidelity term cannot be simply expressed

on a single pixel’s edges. Raj and Zabih proposed an approximation method for the special

case when the blur matrix is diagonal, but no deconvolution method exists for the general

case [81].

132

5.4.1 Deblurring by Numerical Relaxation

To make use of linear algebra notation, express the original image u0 and the ideal image

u as column vectors by reading the pixels, for example, in raster order. For an image with

N pixels, express the linear blur operator as an N ×N matrix K. Note that this blurring

could be spatially varying and is more general than a convolution. Then the fidelity term

can be written ‖Ku − u0‖2 in the L2-norm. If we expand the fidelity term of (5.7), we

obtain‖Ku− u0‖2 = (Ku− u0)T (Ku− u0)

= uTKTKu− uTKTu0 − uT0Ku+ uT

0 u0

= (KTKu, u)− 2(u,KTu0) + ‖u0‖2.

(5.8)

In the last line, the second term is linear in u and the third term is a constant. If we could

make the first term linear in u, then we could model the TV energy as a flow network. Bect

et. al. showed how to break this term into linear components, which we present below [9].

Inspired by relaxation techniques in linear programming, introduce a vector w representing

slack variables or weights. Our goal is to rewrite the first term in (5.8) as

(KTKu, u) = minw‖u− w‖2 + wTAw (5.9)

where A is a N ×N matrix that depends on the blur operator K. The idea is to freeze the

image u, solve for w, and then update the image u. We will first discuss how to derive w

and A.

First note that the right-hand side of (5.9) can be expanded as

‖u− w‖2 + wTAw = (u− w, u− w) + (Aw,w)

= ‖u‖2 − 2(u,w) + ‖w‖2 + (Aw,w).(5.10)

133

Differentiating with respect to w and setting equal to zero yields

−2u+ 2w + 2Aw = 0

⇒ (I +A)w = u

where I dentoes the N ×N identity matrix. Solving for w gives

w = (I +A)−1u. (5.11)

Assuming A and u are fixed, this gives the minimum of (5.9) with respect to w.

Plugging this new expression for w into (5.10) gives

‖u‖2 − 2(u,w) + ‖w‖2 + (Aw,w)

= (u− (I +A)−1u, u− (I +A)−1u) + (A(I +A)−1u, (I +A)−1u)

= (u, u)− (2(I +A)−1u, u) + ((I +A)−1(I +A)−1u, u) + ((I +A)−1A(I +A)−1u, u)

= ([I − 2(I +A)−1 + (I +A)−1(I +A)−1(I +A)]u, u)

= ([I − (I +A)−1]u, u).

For (5.9) to hold, we require

([I − (I +A)−1]u, u) = (KTKu, u)

⇒ I − (I +A)−1 = KTK

⇒ A = (I −KTK)−1 − I.

Using the linear algebra identity, (I −B)−1B = (I −B)−1 − I we obtain

A = (I −KTK)−1KTK.

However, this solution forA is not computationally feasible for most blur matricesK because

I −ATA will be ill-conditioned. To control the condition number, introduce a parameter µ

134

to replace KTK with 1µK

TK. Then (5.9) becomes

µ

(1µKTKu, u

)= µ[min

w‖u− w‖2 + wTAw]

and the solution for A is

A =(I − 1

µKTK

)−1 1µKTK. (5.12)

If we choose µ > ‖KTK‖, then the largest eigenvalue of I − 1µK

TK will be guaranteed to

be less than one.

Putting together equations (5.8)-(5.12), the fidelity term of the TV energy (5.7) can be

written as

‖Ku− u0‖2 = µ‖u− w‖2 + µwTAw − 2(u,KTu0) + ‖u0‖2

where

w = (I +A)−1u, A = (I − 1µKTK)−1 1

µKTK, µ > ‖KTK‖.

So in the flow network model, we can express the fidelity capacity of pixel x with given

fidelity weight λ as

c(xj , xj+1) = λ[µ(Ij − w(x))2 + µ(wTAw)(x)− 2Ij(KTu0)(x) + ‖u0‖2

], 1 ≤ j ≤ L.

(5.13)

The regularization capacities will remain the same as in the original model in Section 5.2.3.

By alternating the computation of the image u and the weights w, the deblurring prob-

lem can be solved by the TV graph cut method. The minimization could proceed for a fixed

number of iterations or until some stopping criterion is achieved, such as when the image

u is no longer updated. The proposed alternating minimization algorithm is summarized

below.

Quantized TV Deblurring Algorithm

Input: Blurred image u0, blur operator K, fidelity weight λ, intensity levels I1, . . . , IL.

Output: Deblurred image u ∈ I1, . . . , IL.

135

Set u = u0, µ = ‖KTK‖+ 1.

Compute A = (I − 1µK

TK)−1 1µK

TK.

Initialize graph with regularization capacities given by (5.4).

Repeat for a fixed number of iterations

Compute weights w = (I +A)−1u.

Set graph fidelity capacities by (5.13).

Compute image u from minimum graph cut.

There are two serious drawbacks to this approach. First, for an image with N pixels the

resulting matrix A will be N ×N , which creates a great demand on memory storage even

for moderate size images. Second, the method may get stuck at a local minimum due to the

nature of alternating minimization. The computation of the image u produces the global

minimum for fixed weights w, and vice-versa. However, alternating the minimization of u

and w does not guarantee convergence to the global minimum of u,w jointly. Indeed, the

approach generally yields unsatisfactory results because it is driven toward local minima.

Both of these problems are addressed in the next by solving the deblurring and zooming

problems simultaneously.

5.4.2 Zooming Using Local Gradient Information

Suppose the observed low-resolution image u0 was obtained from the ideal high-resolution

image u by convolving u with a blur kernel k(x) and downsampling by a factor M > 1:

u0 = k ∗ u ↓M.

Suppose furthermore that the kernel k is defined digitally over an M ×M window. Then

each pixel in u0 is a weighted sum of M2 pixels in a neighborhood in u and each pixel’s

high-resolution neighborhood does not overlap with another’s.

If the observed image u0 has N pixels, the ideal image has M2N pixels since both

dimensions of u0 are increased by a factor M . Writing the images as column vectors, the

136

blur matrix K will be N ×M2N . Matching the operation Ku to the convolution k ∗ u

shows that each row of the matrix K will contain at most M2 nonzero entries. The process

for writing u as a vector is up to the programmer and sometimes a proper choice will result

in easier computation. Convert u to a vector by listing the pixels in block-raster order :

the pixels are read first from within each M ×M convolution block in raster order, then

the blocks are read in raster order (see Figure 5.13). Then the resulting blur matrix K

will be sparse with a vector K ′ of length M2 down the diagonal, where K ′ corresponds to

the elements of the kernel k(x) listed in raster order. The computation of the matrix A in

equation (5.12) will also result in a block-diagonal matrix. Assuming the kernel is spatially

invariant, each block A′ of A will be the same M2 ×M2 matrix:

A′ =(IM2×M2 −

1µ

(K ′)TK ′)−1 1

µ(K ′)TK ′. (5.14)

This solves the problem of storing a large matrix A; the much smaller matrix A′ only needs to

be calculated once. Subsequent calculations can be processed over non-overlapping vectors

of length M2, saving computational costs in calculating the weights w. To compute the

M2 weights wi corresponding to the jth pixel of u0, processing equation (5.11) over the jth

block gives

w1+(j−1)M2≤i≤jM2 =(IM2×M2 + (A′)−1

)u1+(j−1)M2≤i≤jM2 , 1 ≤ j ≤ N. (5.15)

The second problem that needs to be addressed is the tendency of the alternating min-

imization to converge to inaccurate local minima. To drive the computation towards more

appropriate images, additional information can become incorporated into the fidelity term

that connects the low-resolution observation and the high-resolution result. One possibility

is to match local gradients in the two images. The mean horizontal and vertical gradi-

ents within each M ×M convolution block should match the gradient in the corresponding

low-resolution neighborhood. For example, in Figure 5.13 the average horizontal gradients

calculated from the pixel pairs 1,2 and 3,4 would be compared to the horizontal gradient

137

Figure 5.13: Illustration of writing an image in block-raster order for M = 2, N = 4. Theresulting matrices K and A are block-diagonal.

between the pixels a,b. The modified quantized energy becomes

minu∈I1,...,ILETV [u |u0 ] =∫Ω

|∇u| dz +λ

2

∫Ω

(k ∗ u(z)− u0(z))2 dz

+β1

2

∫Ω

1M2

∑p∈N(z)

∂u

∂x(p)− ∂u0

∂x(z)

2

dz +β2

2

∫Ω

1M2

∑p∈N(z)

∂u

∂y(p)− ∂u0

∂y(z)

2

dz

where N(z) denotes the M ×M high-resolution neighborhood corresponding to pixel z. In

general the blurring term is more important than the gradient information, so we would

expect the weights λ ≥ β1 = β2. We found that λ = β1 = β2 worked well for natural

images.

138

Discretizing the partial derivatives by forward differences, the mean gradients can be

written as a weighted sum of the pixels within a block. Expressing the images as vectors in

block-raster order, the calculation of gradients can be absorbed into the block’s convolution

matrix K ′. The 1×M2 matrix will become 3×M2, with the first row listing the kernel k in

raster order and the next two rows describing the mean horizontal and vertical gradients.

Then the fidelity term can be written

∥∥∥∥∥∥∥∥∥∥K ′u|N(z) −

u0(z)

Dxu0(z)

Dyu0(z)

∥∥∥∥∥∥∥∥∥∥

2

where u|N(z) is the high-resolution block in u as a column vector and D denotes the central

finite difference. With this modified K ′, the calculation of the matrix A′ and weights w for

the block are still given by (5.14)-(5.15).

We illustrate this set-up for magnification M = 2 with the 2x2 blur kernel k(x) =

[kij ]1≤i,j≤2. The modified K ′ including both blur and gradient matching is

K ′ =

k11 k12 k21 k22

1/2 −1/2 1/2 −1/21/2 1/2 −1/2 −1/2

If u1, u2, u3, u4 denotes the 2x2 block of u corresponding to pixel z of u0, the above matrix

gives the desired fidelity term at z:

∥∥∥∥∥∥∥∥∥∥∥∥∥K ′

u1

u2

u3

u4

−

u0(z)

Dxu0(z)

Dyu0(z)

∥∥∥∥∥∥∥∥∥∥∥∥∥

2

= (k11u1 + k12u2 + k21u3 + k22u4 − u0(z))2

+(

u2−u12 + u4−u3

2 −Dxu0(z))2 +

(u1−u3

2 + u2−u42 −Dyu0(z)

)2

139

Figure 5.14: The binary 0-1 image at left is convolved with a 2x2 averaging kernel K anddownsampled by factor 2 to produce the grayscale image at right.

Figure 5.15: Results of 2x zoom by different methods. Top row: original image, bicubiczoom, TV filter zoom. Bottom row: quantized TV inpainting, quantized TV zooming byrelaxation, quantized TV zooming using local gradients.

140

For magnification M = 2, we will generally assume the kernel is a 2x2 averaging kernel

kij = 14 , the only isotropic 2x2 kernel with unit volume. Figure 5.14 shows a simple 8x8

binary image that has been convolved with the 2x2 averaging kernel and downsampled.

The result is a 4x4 grayscale image because of the blur kernel, even though the final image

does not appear blurred. Recovering the original shape is a deceptively simple problem. A

good reconstuction should take into account the blur kernel and the quantized nature of

the original image. As shown in Figure 5.15, continuous-valued bicubic and TV zooming

produce blurred grayscale images. Quantized zooming by inpainting, as in Section 5.3.5,

produces isolated white pixels. Quantized zooming incorporating the blur kernel gets stuck

at a local minimum, but adding the gradient information produces the correct diagonal in

the right corner. Unfortunately, the gradient information also tends to round off corners,

as shown in the top left corner of the last image. The best method incorporates three

separate pieces of information: quantization, blurring, and local gradients. The results on

this simple shape suggest that the method outlined in this section should yield positive

results for blocky binary-valued images such as barcodes and text.

The zooming method also produces favorable results on natural images, even when the

true blur kernel in the camera model is unknown. Figure 5.16 shows 2x zoom on the

original cameraman image, which was not synthetically blurred or downsampled. Note that

quantized inpainting shows isolated pixels along the handle of the camera and the face is

largely blurred out. Quantized zooming using gradients produces a strong diagonal along

the handle and the facial features appear more distinct, a difficult task because the image

is only binary-valued. Given the problems with the deconvolution method described in

the last section, it appears that quantized deblurring is best achieved by simultaneously

increasing the resolution of the image.

141

Figure 5.16: Quantized TV zooming on cameraman image. Left: original image. Center:2x zoom by quantized TV inpainting with L = 2, λ = 1. Right: 2x zoom with 2x2 averagingkernel and local gradients.

5.5 Extensions of the Quantized TV Model

5.5.1 Determining Intensity Levels

The standard approach for determining levels for quantization and compression is to match

the intensity histogram to a probability distribution [55]. The approach used in the previous

sections assumes the histogram is uniformly distributed, generally not a practical assump-

tion. The gray values can be iteratively updated by recalculating intensities for a given

quantized image. For a fixed quantized image u : Ω → I1, . . . , Ij, the fidelity term of the

TV energy (5.1) is minimized by updating the intensity levels to the mean gray value:

Inewj =

∫Ω 1u(x)=Ij

(x) u0(x)dx∫Ω 1u(x)=Ij

(x)dx.

We propose an alternating minimization strategy in which a quantized image is calculated,

the intensity levels are updated, and then the image is recomputed under the new intensi-

142

ties. The iteration continues until the intensities are no longer updated. A similar approach

has been suggested for the binary Mumford-Shah semgentation model [90]. Experiments on

natural images suggest this produces a better quantization than the uniform level assign-

ment (see Figure 5.17). The result is sensitive to the initial assignment of levels. A poor

initialization could lead to levels disappearing: the number of distinct levels in the final

image is less than the desired number.

Figure 5.17: Iterating on intensity levels for quantized TV minimization with λ = 1, L = 3.

5.5.2 The TV-L1 Norm

The standard Rudin-Osher-Fatemi model gives strong preference to preserving high-contrast

features. To give stronger emphasis to geometric rather than contrast features, Chan and

Esedoglu suggested reducing the exponent on the fidelity term [28]. The quantized TV-L1

model is:

minu∈I1,...,IL

ETV [u |u0 ] =∫Ω

|∇u| dx+λ

2

∫Ω

|u− u0|dx.

Chan and Esedoglu argued that the L1 norm is particularly well-suited for images quantized

to few levels. In particular they showed that TV-L1 minimization will perfectly recover a

binary image when u0 is binary, which is not true for the classical L2 norm.

Computing this minimimum of this energy with a classical gradient-based approach is

difficult because the fidelity term is no longer differentiable. Chan and Esedoglu proposed a

gradient descent that requires approximating the derivative of the L1 norm by introducing

a lifting parameter. This energy is more easily minimized in the graph cut method by

143

changing the fidelity capacities to

c(xj , xj+1

)= λ|Ij − u0(x)|, 1 ≤ j ≤ L, x ∈ Ω.

The graph cut method will compute the global minimum of the quantized TV-L1 energy,

an improvement over the gradient-based approximation method. Unlike the classical TV

model, the TV-L1 energy is not strictly convex so the global minimum may be non-unique.

Figure 5.18: TV minimization with L = 6 levels under L1 and L2 fidelity constraints.Top row: TV-L2 minimization removes low-contrast features as λ decreases. Bottom row:TV-L1 minimization removes finer-scale geometric features as λ decreases.

Darbon showed that the TV-L1 norm is a contrast invariant filter. That is, if u(x)

is a minimizer for the observed image u0(x), then cu(x) is the minimizer for the image

cu0(x). Darbon suggested a level set method similar in nature to the graph cut method for

computing the global minimum of the quantized TV-L1 energy [38].

Figure 5.18 shows a simple experiment on minimizing the classical TV and the TV-L1

energies on an image with squares of varying contrast and size. Under both norms, the

result approaches a constant image as λ → 0. Under the classical L2 norm the squares

with low contrast on the left side of the image disappear as λ gets smaller, with the smaller

squares disappearing first. Under the L1 norm the squares disappear based on size, with the

144

3 squares of the same size vanishing as a group. This suggests that the value of λ is inversely

proportional to the size of features that are preserved. Chan and Esedoglu suggested that

the TV-L1 norm gives rise to a scale-space in which geometric features of a specific size

disappear at critical values of λ [28]. Note that under the graph cut minimization the fidelity

term is easily modified to any Lp norm, with large values of p placing more emphasis on

contrast and small p emphasizing geometry.

5.5.3 The 8-connected Topology

The anisotropic TV norm gives preference to edges parallel to the axes, resulting in rect-

angular images with sharp corners. Diagonal edges are “staircased” into square blocks. At

each interior pixel, the regularization term compares to the values of the 4 neighbors at

distance one (the “cross” topology). The regularization weights can be made more rota-

tionally invariant by incorporating the diagonally connected neighbors at distance√

2. For

a pixel x ∈ Ω and a diagonally connected neighbor y define the regularization capacity to

be

c(xj , yj

)=Ij − Ij−1√

2, 2 ≤ j ≤ L.

The regularization is still not truly isotropic, but the 8-connected topology will be less likely

to staircase diagonal edges.

For most images, the difference between the minimization under the 4- or 8-connected

topologies will be very small. The difference becomes apparent for image inpainting, as

in Figure 5.19. Quantized TV minimization under the 8-connected topology more closely

resembles the continuous-valued minimization of the isotropic TV norm. Under the 4-

connected topology, there are more geometric configurations with the same global minimum

energy for inpainting this domain. In a sense, the 8-connected topology is less non-unique.

145

Figure 5.19: TV inpainting under the 8-connected topology. The inpainting domain is shownin red in the first image. Left to right: Original image, TV filter, 4-connected quantizedTV, 8-connected quantized TV.

5.5.4 3-D Image Processing

The TV graph cut method extends naturally to 3D volumes by including regularization

links along the third z dimension. In addition to the 4 standard neighbors within each 2D

slice (up, down, left, right), add 2 edges moving forward and backward between the slices.

Because more links are added to each pixel, the value of λ should be smaller than in the

2D model to preserve the balance between the regularization and fidelity terms. Depending

on the application, the fidelity weights can be set to be anisotropic. For example, in a

video sequence of a fast-moving object the weights along the z dimension should be low.

Conversely in a video or volume where there is little change between slices but each image

slice contains fine structures, the fidelity weight should be higher in the z component than

along other directions. As in the last section, it is also possible to incorporate diagonal

elements in three dimensions, giving rise to a 24-connected topology.

146

Figure 5.20: 3D quantized TV denoising of simple volumes with λ = 0.005, L = 2. Themiddle image slice is shown for comparison. Top row: 10x10x10 cube. Bottom row: Sphereof radius 8.

5.6 Applications of Binary TV Minimization

5.6.1 Barcode Image Processing

The ubiquitous is a series of black and white stripes encoding information in the relative

widths of the bars. Although the most common barcode scanners read a signal with laser

optics, barcode images are also decoded with digital cameras to allow greater flexibility

in reading both linear and two-dimensional barcode symbologies. The ideal barcode for

decoding should be binary-valued, but the observed image is generally a grayscale image

corrupted by camera blur, hand jitter, electrical noise, speckle noise, and defects in the

orginal material such as stray marks on the paper.

The classical TV model has been shown to be effective for denoising and deblurring 1D

bilevel signals [45, 100]. But even after adding a penalty term to force black and white

values, the resulting signal is not strictly binary-valued. The signal could be thresholded

147

before being sent to the decoder, but this could introduce errors. The quantized TV model

solves this problem, while also smoothing the image to remove blur and noise. The CPU of

a typical barcode scanner has very limited memory and computing power, but in practice

the decoding process must be very fast. Barcode manufacturers generally require a runtime

less than 100 milliseconds with operations involving only integer arithmetic. Classical TV

minimization, such as gradient descent, is generally too slow and may have convergence

issues. The graph cut method can be implemented in polynomial time while using only

integer-valued variables.

In a barcode image with parallel vertical bars, one would expect the variation along the

y-direction to be very small. But the variation along the x-direction would be very large,

especially if the image is low-resolution and consists of very thin bars. This suggests making

the regularization weights anisotropic:

λ|∇u|1 = λx|ux|+ λy|uy|

If the bars are perfectly vertical, the value of λy would be very large and λx = 0. If the

orientation of the barcode is not vertical, the image could be rotated or the derivative uy

could be modified to trace tangent to the bars. It is safe to assume the orientation angle

is known because in practice the barcode orientation is the first characteristic of the image

that is identified by decoding software.

Figure 5.21 shows a UPC barcode synthetically distorted with both Gaussian blurring

and Gaussian additive noise. Thresholding the image at the median intensity value does not

recover well-defined bars. Quantized TV minimization with isotropic regularization weights

forms rectangles, but it omits the two thin bars at position 160 and 190. Setting λx = 0

allows for greater variation along the x-direction and all bars are recovered.

Stray marks or damaged regions of the barcode may render it undecodable if no line

remains that gives the proper signal. This is a common problem in the shipping industry

because routing directions are stamped onto the package, sometimes mistakenly across the

148

Figure 5.21: Quantized TV denoising of UPC barcode with additive Gaussian noise andGaussian blur. Top: original image. 2nd row: thresholding at median intensity. 3rd row:quantized TV denoising with λ = 0.005, L = 2. Bottom: quantized TV denoising withanisotropic weights λy = 0.005, λx = 0, L = 2.

barcode label. Figure 5.22 shows the result of quantized TV inpainting to fill in the damaged

regions. Setting anisotropic regularization weights with λx = 0 gives a better center portion

of the image. Note that in the last image the black bars are extended too far on both the

top and bottom within damaged regions. This would not adversely affect the decoding

since only the middle portion of the barcode is sent to the decoder. In this example the

inpainting domain D was known, but theoretically the damaged areas could be determined

by calculating regions that do not match the local orientation of the barcode [5].

The traditional approach to barcode imaging is to repeatedly run rows of the image,

called scanlines, through the decoder until a signal decodes. The graph cut method of

course works on 1D signals as well as 2D images, but using the 2D information could allow

149

Figure 5.22: Quantized TV inpainting of damaged barcode. Left: original image withdamaged area shown in red. Center: TV inpainting with λ = 0.1, L = 2. Right: TVinpainting with anisotropic weights λy = 0.1, λx = 0, L = 2.

Figure 5.23: Quantized TV denoising of a barcode projected signal with λ = 10, L = 2.

150

for the creation of a better single scanline. In Chapter 4, we described how to form a

high-resolution signal by projecting multiple scanlines onto the same axis. The resulting

projected signal is very noisy, but after smoothing the result is potentially better than any

scanline available from the original image. Figure 5.23 shows the result of quantized TV

minimization on such a projecion signal.

5.6.2 Enhancement for Text Recognition

Text images are very sensitive to changes in image size, a phenomenon familiar to academics

in preparing figures for reports. The underlying text should be binary-valued, but an im-

age corrupted by camera blur, compression artifacts, and poor interpolation will appear

grayscale. Recovering the black-and-white text is crucial for automatic text recognition.

Most optical character recognition (OCR) systems require a strictly binary image before

decoding begins. Developing binarization algorithms specifically for text is an active re-

search area for the OCR community.

A common binarization strategy is to define a local threshold T (i, j) at each pixel (i, j) ∈

Ω. The binary image u is then

u(i, j) =

0 if u0(i, j) ≤ T (i, j)

1 if u0(i, j) > T (i, j).

Niblack suggested calculating the local threshold T (i, j) using the local mean µ and standard

deviation σ of the gray values in the b× b window centered over pixel (i, j) [77]. For a fixed

parameter k and odd integer b specifying the window size, Niblack’s method is given by

T (i, j) = µb×b(i, j) + kσb×b(i, j).

Based on numerical evidence, Trier and Jain suggested the optimal values for 8-bit text

images are k = −0.2 and b = 15 [94]. Later, Sauvola and Pietaksinen suggested the

151

following modification to Niblack’s method

T (i, j) = µb×b(i, j) + 1 + k

[σb×b(i, j)

R− 1]

where R is a fixed parameter. The authors suggest the values k = 0.5, R = 128, and b = 15

[87]. Two independent surveys of document binarization techniques concluded that the

modified Niblack method is the best strategy for preparing text for OCR systems [88, 95].

The current state-of-the-art in OCR can detect text only 5 pixels high and properly

decode text 7 pixels high. Most OCR software require larger images, so the text needs to be

zoomed as well as binarized. The most common strategy for OCR software is to interpolate

using bicubic zooming, followed by binarization using Niblack’s original method [21]. Using

the quantized TV energy, the zooming and binarization processes can be combined into one

step, while also deblurring the given image. This can produce large binary text images that

are more pleasing visually. But it is unclear if this will improve OCR performance, as the

existing systems are built around the bicubic - Niblack combination.

Figure 5.24 compares TV quantization of a text image to Niblack’s method and its

modification. Note that the local thresholding methods place black dots in clearly white

regions, because even small variations in the gray values result in thresholding to the larger

binary value. Using the image min and max for intensity levels, the first iteration of TV

minimization produces an unacceptable image. Updating the intensity values as in Section

5.5.1 converges to a much better image in 8 iterations. The original image was 8 pixels

high, so the bicubic zooming was possibly unnecessary. Niblack’s method actually performs

better on the original image than on the zoomed image. Figure 5.25 shows the result on a

smaller 6 pixel high image, where zooming is probably necessary. The TV result is not as

clean as before, but it picks up some features better than the local thresholding methods.

Notably, the dot in the “i” is more distinct in the TV image.

152

Figure 5.24: Quantized TV zooming of large text. Top row: original image, 2x bicubiczoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic zoom followed bymodified Niblack’s method, iterations 1 and 8 of quantized 2x TV zooming with λ = 0.1,L = 2. Assumes kernel is 2x2 averaging matrix.

Figure 5.25: Quantized TV zooming of small text. Top row: original image, 2x bicubiczoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic zoom followed bymodified Niblack’s method, iterations 1 and 8 of quantized 2x TV zooming with λ = 0.1,L = 2. Assumes kernel is 2x2 averaging matrix.

153

5.6.3 Medical Image Segmentation

Quantized segmentation in medical imaging is helpful in clearly identifying different bio-

logical tissues, defining the different regions by intensity level. This allows for automated

quantitative shape analysis, e.g. tracking the volume of gray matter in a brain or the

geometry of a tumor [6]. Figure 5.26 shows binary TV segmentation of Computerized To-

mography (CT) and Magnetic Resonance (MR) brain images. In each image, 3 seed pixels

were selected to identify the background, dark tissue, and light tissue. In the CT image, the

white region indicates bone. In the MR image, the white region indicates fat tissue (lipids).

The TV results are very good, but this is partly because the images are ideal. Both images

were provided by the Visible Human Project, so the images were of high quality and already

smoothed by image processing algorithms.

Figure 5.26: Quantized TV segmentation of ideal brain images with λ = 0.1, L = 3. Left2: CT image. Right 2: MR image.

In an actual acquired MR image, the image is corrupted by noise, features textured

regions, and generally has low spatially-varying contrast. Figure 5.27 shows how this change

in contrast complicates the segmentation process. Attempting to segment the entire image

results in a large black region in the center corresponding to the low-contrast region in the

original image. One solution is to segment the image in blocks, adjusting the quantization

154

levels for the contrast within each block. Another possible solution is to equalize the image

contrast, such as using the MR super-resolution presented in Chapter 4.

Figure 5.27: Quantized TV segmentation of low-contrast MR brain image. Left 2: Segmen-tation of entire brain with λ = 50, L = 2. Right 2: Segmentation of region indicated in firstimage with λ = 200, L = 2.

155

Bibliography

[1] A. Almansa, V. Caselles, G. Haro, and B. Rouge. “Restoration and zoom of irregularly

sampled, blurred and noisy images by accurate total variation minimization with local

constraints.” Multiscale Model. Simul, 5: 235-272, 2006.

[2] L. Alvarez, F. Guichard, P.L. Lions, and J.M. Morel. “Axioms and fundamental equa-

tions of image processing.” Arch. Rational Mech. Anal., 123: 199-257, 1993.

[3] L. Ambrosio. “A compactness theorem for a new class of functions of bounded varia-

tion.” Boll. Un. Mat. Ital., 3: 857-881, 1989.

[4] L. Ambrosio and V.M. Tortorelli. “Approximation of functionals depending on jumps

by elliptic functional via Γ-convergence.” Comm. Pure Appl. Math., 43: 999-1036,

1990.

[5] S. Ando and H. Hontani. “Automatic visual searching and reading of barcodes in 3-D

scene.” Proc. IEEE Vehicle Electronics Conf., p. 49-54, 2001.

[6] S. Angenent, E. Pichon, and A. Tannenbaum. “Mathematical methods in medical image

processing.” Bulletin of the American Mathematical Society, 43(3): 365-396, 2006.

[7] G. Aubert and P. Kornprobst. Mathematical Problems in Image Processing. Springer-

Verlag, New York, 2001.

[8] S. Baker and T. Kanade. “Limits on super-resolution and how to break them.” IEEE

Trans. Pattern Analysis and Machine Intelligence, 24: 1167-1183, 2002.

156

[9] J. Bect, L. Blanc-Feraud, G. Aubert, and A. Chambolle. “A l1-unified variational

framework for image restoration.” Proc. Euro. Conf. on Computer Vision, Springer-

Verlag LNCS 3024: 1-13, 2004.

[10] A. Belahmidi. PDEs Applied to Image Restoration and Image Zooming. PhD thesis,

Universite de Paris XI Dauphine, 2003.

[11] A. Belahmidi and F. Guichard. “A partial differential equation approach to image

zoom.” Proc. Int. Conf. on Image Processing, 2004.

[12] M. Bertalmio, A. Bertozzi, and G. Sapiro. “Navier-Stokes, fluid dynamics, and image

and video inpainting.” Proc. IEEE Conf. on Computer Vision and Pattern Recognition,

p. 355-362, 2001.

[13] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher. “Simultaneous structure and texture

inpainting.” Proc. IEEE Conf. Computer Vision and Pattern Recognition p. 707-720,

2003.

[14] A. Bertozzi, S. Esedgolu, and A. Gillette. “Inpainting of binary images using the Cahn-

Hilliard equation.” IEEE Trans. Image Processing, to appear.

[15] Y. Boykov and M.-P. Jolly. “Interactive graph cuts for optimal boundary and region

segmentation of objects in N-D images.” Proc. Int. Conf. Computer Vision, p. 105-112,

2001.

[16] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow

algorithms for energy minimization in vision.” IEEE Trans. Pattern Anal. and Machine

Intelligence, 26: 1124-1137, 2004.

[17] Y. Boykov, O. Veksler, and R. Zabih. “Fast approximate energy minimization via graph

cuts.” IEEE Trans. Pattern Anal. and Machine Intelligence, 23: 1222-1239, 2001.

[18] A. Braides. Γ-convergence for Beginners. Oxford Lecture Series in Mathematics, No.

22, 2002.

157

[19] A. Buades, B. Coll, and J.M. Morel. “A review of image denoising methods, with a

new one.” Multiscale Model. Simul., 4: 490-530, 2005.

[20] A. Buades, B. Coll, and J.M. Morel. “The staircasing effect in neighborhood filters and

its solution.” IEEE Trans. Image Processing, 15: 1499-1505, 2006.

[21] D. Capel and A. Zisserman. “Super-resolution of text image sequences.” Proc. Int.

Conf. on Pattern Recognition, 2000.

[22] D. Capel and A. Zisserman. “Computer vision applied to super resolution.” IEEE

Signal Processing Mag., 2003.

[23] K. Carey, D. Chuang, and S. Hemami. “Regularity-preserving image interpolation.”

IEEE Trans. Image Processing, 8:, 1293-1297, 1999.

[24] V. Caselles, J.M. Morel, and C. Sbert. “An axiomatic approach to image interpolation.”

IEEE Trans. Image Processing, 7: 376-386, 1998.

[25] Y. Cha and S. Kim. “Edge-forming methods for image zooming.” Proc. IEEE Conf.

Computer Vision and Pattern Recognition, p. 275-282, 2004.

[26] Y. Cha and S. Kim. “Edge-forming methods for color image zooming.” IEEE Trans.

Image Processing, 15: 2315-2323, 2006.

[27] A. Chambolle. “Total variation minimization and a class of binary MRF models.”

Proc. Int. Workshop on Energy Minimization Methods in Computer Vision and Pattern

Recognition, p. 136-152, 2005.

[28] T.F. Chan and S. Esedoglu. “Aspects of total variation regularized L1 function ap-

proximation.” SIAM J. Appl. Math., 65: 1817-1837, 2005.

[29] T.F. Chan, S. Esedoglu, and M. Nikolova. “Algorithms for finding global minimizaers

of image segmentation and denoising models.” SIAM J. Appl. Math., to appear.

158

[30] T.F. Chan and S.H. Kang. “An error analysis on image inpainting problems.” J. Math.

Imaging and Vision, to appear.

[31] T.F. Chan and J. Shen. “Mathematical models for local nontexture inpainting.” SIAM

J. Appl. Math., 62: 1019-1043, 2002.

[32] T.F. Chan and J. Shen. Image Processing and Analysis: Variational, PDE, Wavelet,

and Stochastic Methods. SIAM Press, Philadelphia, PA, 2005.

[33] T.F. Chan, S. Osher, and J. Shen. “The digital TV filter and nonlinear denoising.”


[34] T.F. Chan and C.K. Wong. “Total variation blind deconvolution.” IEEE Trans. Image

Processing, 7: 370-375, 1998.

[35] H. Chang, D.-Y. Yeung, and Y. Xiong. “Super-resolution through neighbor embed-

ding.” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p. 275-282,

2004.

[36] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cam-

bridge, MA, 1995.

[37] A. Criminsi, P. Perez, and K. Toyama. “Object removal by exemplar-based inpainting.”

Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2: 721-728, 2003.

[38] J. Darbon. “Total variation minimization with L1 data fidelity as a contrast invariant

filter.” Proc. Int. Symp. Image and Signal Processing and Anal., 2005.

[39] J. Darbon and M. Sigelle. “A fast and exact algorithm for total variation minimization.”

Proc. Iberian Conf. Pattern Recognition and Image Anal., p. 351-359, 2005.

[40] P. Davis. “Mathematics and Imaging.” Mathematical Awareness Week Theme Essay,

1998. Online at http://www.mathaware.org/mam/98/articles/theme.essay.html.

[41] I. Daubechies. Ten Lectures on Wavelets. SIAM Press, Philadelphia, PA, 1992.

159

[42] D. Dobson and F. Santosa. “Recovery of blocky images from noisy and blurred data.”

SIAM J. Appl. Math., 56: 1181-1198, 1996.

[43] J. Edmonds and R. Karp. “Theoretical improvements in the algorithmic efficiency for

network flow problems.” J. of the ACM, 19: 248-264, 1972.

[44] A.A. Efros and T.K. Leung. “Texture synthesis by non-parametric sampling.” Proc.

IEEE Int. Conf. on Computer Vision, p. 1033-1038, 1999.

[45] S. Esedgolu. “Blind deconvolution of barcode signals.” Inverse Problems, 20: 121-135,

2004.

[46] S. Esedoglu and J. Shen. “Digital inpainting based on the Mumford-Shah-Euler image

model.” European J. Appl. Math., 13: 353-370, 2002.

[47] L. Evans. Partial Differential Equations. AMS Press, Providence, RI, 2000.

[48] S. Farisu, M. Elad, and P. Milanfar. “Advances and challenges in super-resolution.”

Int. J. Imaging Systems Technology, 14: 47-57, 2004.

[49] S. Farsiu, M. Elad, and P. Milanfar. “Multi-Frame demosaicing and super-resolution

of color images.” IEEE Trans. Image Processing, 15: 141-159, 2006.

[50] L.R. Ford and D.R. Fulkerson. Flows in Networks. Princeton University Press, Prince-

ton, NJ, 1962.

[51] W. Freeman, T. Jones, and E. Pasztor. “Example-based super-resolution.” MERL

Technical Report, TR 2001-30, 2001.

[52] G. Gilboa, N. Sochen, and Y.Y. Zeevi. “Texture preserving variational denoising using

an adaptive fidelity term.” Proc. Conf. on Geometric and Level Set Methods, p. 137-

144, 2003.

[53] D. Gleich. Matlab Boost Graph Library. Software online at

www.stanford.edu/˜dgleich/programs/matlab bgl/

160

[54] A. Goldberg and R. Tarjan. “A new approach to the maximum flow problem.” Proc.

18th Annual ACM Sym. on Theory of Computing, p. 136-146, 1986.

[55] R. Gonzalez, R. Woods, and S. Eddins. Digital Image Processing Using Matlab. Pearson

Prentice Hall, Upper Saddle River, NJ, 2004.

[56] U. Grenander. “Toward a theory of natural scenes.” Brown Technical Report, 2003.

[57] F. Guichard and J.M. Morel. Image Analysis and PDE’s. IPAM GBM Tutorial, March

2001.

[58] R. Hardie, K. Barnard, and E. Armstrong. “Joint MAP registration and high-resolution

image estimation using a sequence of undersampled images.” IEEE Trans. Image Pro-

cessing, 6: 1621-1633, 1997.

[59] H. He and L. Kondi. “An image super-resolution algorithm for different error levels per

frame.” IEEE Trans. Image Processing, 15: 592-603, 2006.

[60] T. Huang and R. Tsai. “Multi-frame image restoration and registration.” Adv. Com-

puter Vision and Image Processing, 1: 317-339, 1984.

[61] M. Irani and S. Peleg. “Improving resolution by image registration.” Graphical Models

and Image Processing, 53: 231-239, 1991.

[62] E. Jonsson, S. Huang, and T. Chan. “Total Variation Regularization in Positron Emis-

sion Tomography.” UCLA CAM Report, 98-48, 1998.

[63] B. Julesz. “Textons, the elements of texture perception and their interactions.” Nature,

290, 1981.

[64] R. Keys. “Cubic convolution interpolation for digital image processing.” IEEE Trans.

Acoustic, Speech, and Signal Processing, 29: 1153-1160, 1981.

[65] S. Kindermann, S. Osher, and P. Jones. “Deblurring and denoising of images by non-

local functionals.” Multiscale Model. Simul., 4: 1091-1115, 2005.

161

[66] V. Kolmogorov and R. Zabih. “What energy functions can be minimized via graph

cuts?” IEEE Trans. Pattern Anal. and Machine Intelligence, 26: 147-159, 2004.

[67] E. Larsson, D. Erdogmus, R. Yan, J. Principe, and J. Fitzsimmons. “SNR optimality

of sum-of-squares reconstruction for phased-array magnetic resonance imaging.” J. of

Magnetic Resonance, 163: 121-123, 2003.

[68] J. Lie, M. Lysaker, and X.C. Tai. “A binary level set model and some applications to

Mumford-Shah image segmentation.” IEEE Trans. Image Processing, 15: 1171-1181,

2006.

[69] Z. Lin and H.-Y. Shum. “Fundamental limits of reconstruction based superresolution

algorithms under local translation.” IEEE Trans. Pattern Anal. and Machine Intelli-

gence, 26: 1-15, 2004.

[70] H. Lombaert, Y. Sun, L. Grady, and C. Xu. “A multilevel banded graph cuts method

for fast image segmentation.” Proc. IEEE Conf. on Computer Vision, p. 259-265, 2005.

[71] F. Malgouyres. Increase in the resolution of digital images: Variational theory and

applications. PhD thesis, Ecole Normale Superieure de Cachan, 2000.

[72] F. Malgouyres and F. Guichard. “Edge direction preserving image zooming: A math-

ematical and numerical analysis.” SIAM J. Numer. Anal., 39: 1-37, 2001.

[73] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, New York, 1998.

[74] D. Mumford. “The Bayesian rationale for energy functionals.” In Geometry Driven

Diffusion in Computer Vision, Kluwer Academic, p. 141-153, 1994.

[75] D. Mumford and J. Shah. “Optimal approximations by piecewise smooth functions and

associated variational problems.” Comm. Pure Appl. Math., 42: 577-685, 1989.

[76] N. Nguyen and P. Milanfar. “A wavelet-based interpolation-restoration method for

super-resolution.” Circuits, Systems, and Signal Processing, 19: 321-338, 2000.

162

[77] W. Niblack. An Introduction to Digital Image Processing. Prentice Hall, Upper Saddle

River, NJ, 1986.

[78] M. Nikolova. “Estimation of binary images by minimizing convex criteria.” Proc. Int.

Conf. Image Processing, p. 108-112, 1998.

[79] S. Osher and J.A. Sethian. “Fronts propagating with curvature-dependent speed: Al-

gorithms based on Hamilton-Jacobi formulations.” J. Comput. Physics, 79: 12-49,

1988.

[80] P. Perona and J. Malik. “A scale space and edge detection using anisotropic diffusion.”

Proc. IEEE Workshop on Computer Vision, p. 16-22, 1987.

[81] A. Raj and R. Zabih. “A graph cut algorithm for generalized image deconvolution.”

Proc. IEEE Int. Conf. on Computer Vision, p. 1-7, 2005.

[82] D. Robinson and P. Milanfar. “Statistical Performance Analysis of Super-Resolution.”

IEE Trans. on Image Processing, 15: 1413-1428, 2006.

[83] L. Rudin, S. Osher, and E. Fatemi. “Nonlinear total variation based noise removal

algorithms.” Physica D, 60: 259-268, 1992.

[84] B. Russell. “Exploiting the sparse derivative prior for super-resolution.” M.S. thesis,

MIT, 2003.

[85] G. Sapiro and D. Ringach. “Anisotropic diffusion of multi-valued images with applica-

tions to color filtering.” IEEE Trans. Image Processing, 5: 1582-1586, 1996.

[86] L. Saul and S. Roweis. “Think globally, fit locally: Unsupervised learning of low di-

mensional manifolds.” J. Machine Learning Research, 4: 119-155, 2003.

[87] J. Sauvola and M. Pietaksinen. “Adaptive document image binarization.” Pattern

Recognition, 33: 225-236, 2000.

163

[88] M. Sezgin and B. Sankur. “Survey over image thresholding techniques and quantitative

performance evaluation.” J. Electronic Imaging, 13: 146-165, 2004.

[89] R. Schultz and R. Stevenson. “Extraction of high-resolution frames from video se-

quences.” IEEE Trans. Image Processing, 5: 996-1011, 1996.

[90] J. Shen. “Γ-convergence approximation to piecewise constant Mumford-Shah segmen-

tation.” Proc. Int. Conf. Advanced Concepts in Intelligent Vision Systems, p. 499-506,

2005.

[91] J. Shen. “A stochastic-variational model for soft Mumford-Shah segmentation.” Int. J.

Biomedical Imaging, 2006: ID 92329, 2006.

[92] E. Simoncelli and J. Portilla. “Texture characterization via second-order statistics of

wavelet coefficient amplitudes.” Proc. 5th IEEE Conf. Image Processing, 1998.

[93] A. Tikhonov and V. Arsenin. Solutions of of Ill-Posed Problems. Winston and Sons,

Washington D.C., 1977.

[94] O.D. Trier and A.K. Jain. “Goal-directed evaluation of binarization methods.” IEEE

Trans. Pattern Anal. and Machine Intelligence, 17: 1191-1201, 1995.

[95] O.D. Trier and T. Taxt. “Evaluation of binarization methods for document images.”

IEEE Trans. Pattern Anal. and Machine Intelligence, 17: 312-315, 1995.

[96] A. Tsai, A. Yezzi, and A. Willsky. “Curve evolution implementation of the Mumford-

Shah functional for image segmentation, denoising, interpolation and magnification.”


[97] C. Vogel. Computational Methods for Inverse Problems. SIAM Press, Philadelphia,

2002.

[98] L. Wang and K. Mueller. “Generating sub-resolution detail in images and volumes using

constrained texture synthesis.” Proc. IEEE Conf. on Visualization, p. 75-82, 2004.

164

[99] D. West. Introduction to Graph Theory. Prentice Hall, Upper Saddle River, NJ, 1996.

[100] T. Wittman. “Lost in the supermarket: Decoding blurry barcodes.” SIAM News, 37,

2004.

165

university of minnesota this is to certify that i...

Documents