university of minnesota this is to certify that i...
TRANSCRIPT
UNIVERSITY OF MINNESOTA
This is to certify that I have examined this copy of a doctoral thesis by
TODD CAMERON WITTMAN
and have found that is is complete and satisfactory in all respects,and that any and all revisions required by the final
examining comittee have been made.
DEPARTMENT OF MATHEMATICSUNIVERSITY OF MINNESOTA
VARIATIONAL APPROACHES TO DIGITAL IMAGE ZOOMING
A THESISSUBMITTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTABY
TODD CAMERON WITTMAN
IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
August 2006
c©Copyright 2006 TODD WITTMAN
Acknowledgments
I would like to thank my advisor Prof. Fadil Santosa for his patience, guidance, and
mentorship throughout my long stay at University of Minnesota. It means a lot to me to
have someone looking out for me. I owe a lot to the faculty at the University of Minnesota,
especially Prof. Jackie Shen who helped me as a teacher, friend, and tour guide in China.
I would also like to thank Prof. Song-Chun Zhu of UCLA for inviting me to the Lotus
Hill Workshop, where I started my super-resolution work. Some of my photos from that
workshop appear in Chapter 4. Much of the barcode processing work was done as part of
an industrial internship under the supervision of Dr. Miroslav Trajkovic. The pictures of
human brains and advice on working with medical images were provided by my friend Dr.
Steen Moeller at the Center for Magnetic Resonance Research. I also need to thank my
PhD committee for actually reading this thesis and making constructive comments: Gilad
Lerman, Willard Miller, and Ravi Janardan.
As preparation for this thesis, I investigated several compuational strategies and have
adapted software from various sources. Thanks to the authors who provided this code
by placing it online or sending it directly to me. Pilfered code includes Jackie Shen’s Γ-
convergence routine, Antonin Chambolle’s quantized TV minimization, Yuri Boykov’s graph
cut package, Guy Gilboa’s locally adaptive TV norm, Stanford’s BGL graph algorithm
library, and Miroslav Trajkovic’s barcode decoding software.
I would also like to thank my brothers, Scott and Andy, and my parents, Mimi and
Paul, for their support.
Finally I would like to thank you, gentle reader, for reading this thesis and the acknowl-
edgments that preface it. I hope you enjoy reading this as much as I enjoyed writing it.
Hopefully more.
i
Abstract
The purpose of this thesis is to discuss digital image resolution enhancement by varia-
tional methods and the associated computational issues. Two problems related to the basic
zooming problem are also studied: super-resolution and quantized deconvolution.
Digital zooming is important for mundane computing activities such as web browsing
as well as sophisticated applications like satellite imagery and medical diagnosis. Unfor-
tunatley, zooming is an ill-posed mathematical problem and the linear filters common in
imaging software are often not adequate for the task. Other interpolation approaches in-
clude wavelets, PDEs, machine learning, and statistical filters, but variational methods offer
computational advantages in the application and flexibility of the models. We discuss the
theoretical and compuational issues surrounding variational zooming, focusing on the To-
tal Variation (TV) and Mumford-Shah energies. The variational inpainting model is very
flexible and the interpolated result can be improved with energy modifications, including
locally adaptive fidelity weights, soft inpainting, and post-processing.
Super-resolution refers to the process of producing a single high-resolution image from
a set of low-resolution images such as a video sequence. Variational inpainting extends
naturally to the multiple-image case and is shown to be effective for video enhancement,
barcode processing, and MR image reconstruction. We propose a soft inpainting model to
handle local variation and motion within a video sequence.
Text and barcode images should appear as strictly binary-valued images, but due to
blurring and downsampling the actual images takes on many gray values and may be un-
readable by recognition systems. Given a blurred grayscale image, the goal of quantized
zooming is to produce a clean, high-resolution image taking on only a limited number of
gray values. The graph cut method has proven successful for exact minimization of the
ii
quantized TV energy. We show the graph cut method is effective for denoising, segmenta-
tion, and inpainting, but deconvolution is an open problem in the literature. We propose
an alternating minimization method for deblurring that combines graph cuts and numeri-
cal relaxation inspired by linear programming. For the zooming problem, the approach is
improved by the addition of local gradient information. We provide numerical results for
barcode imaging, text enhancement, and medical image segmentation.
iii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction 1
1.1 The Digital Zooming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Organization and Contributions of this Thesis . . . . . . . . . . . . . . . . . 5
2 Survey of Zooming Approaches 7
2.1 Linear Interpolation Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Which Methods to Consider? . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 A PDE-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Heat Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 A Multiscale Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Wavelets and Holder Regularity . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Wavelet-Based Interpolation . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Locally Linear Embedding (LLE) . . . . . . . . . . . . . . . . . . . . 20
2.5.2 LLE-Based Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 A Statistical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
2.6.1 Local vs. Global Interpolation . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 NL-Means Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.3 NL-Means Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.5 Further Research on NL-Means Interpolation . . . . . . . . . . . . . 36
2.7 Summary and Motivation for the Variational Approach . . . . . . . . . . . 38
3 Variational Zooming 40
3.1 Introduction to the Variational Approach . . . . . . . . . . . . . . . . . . . 40
3.2 The Total Variation (TV) Energy . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Theory and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Numerical Computation: The Digital TV Filter . . . . . . . . . . . . 46
3.3 The Mumford-Shah Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Theory and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Numerical Computation: The Γ-Convergence Approximation . . . . 53
3.4 Numerical Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Modifications to the Inpainting Model . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Incorporating a Blur Kernel . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Locally Adaptive Fidelity Weights . . . . . . . . . . . . . . . . . . . 61
3.5.3 Soft Inpainting with Nearest Neighbor Information . . . . . . . . . . 64
3.5.4 Variational Zooming as Post-Processing . . . . . . . . . . . . . . . . 67
4 Variational Super-resolution 71
4.1 Super-resolution of an Image Sequence . . . . . . . . . . . . . . . . . . . . . 71
4.2 Super-resolution by Variational Inpainting . . . . . . . . . . . . . . . . . . . 74
4.2.1 Data Fusion with Known Registration . . . . . . . . . . . . . . . . . 74
4.2.2 Simultaneous Registration and Fusion . . . . . . . . . . . . . . . . . 80
4.3 Artifact Reduction by Soft Inpainting . . . . . . . . . . . . . . . . . . . . . 83
4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
v
4.4.1 Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Barcode Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Reconstruction from MRI Sensor Data . . . . . . . . . . . . . . . . . 100
5 Quantized Zooming 108
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1.1 Quantized Image Processing and the Quantized TV Energy . . . . . 108
5.1.2 Previous Work on Quantized Image Processing . . . . . . . . . . . . 109
5.2 Quantized TV Minimization by Graph Cuts . . . . . . . . . . . . . . . . . . 110
5.2.1 Network Flows: Definitions . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2 Network Flows: Algorithms . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.3 The Quantized TV Model . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Application to Low-Level Vision Tasks . . . . . . . . . . . . . . . . . . . . . 122
5.3.1 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.3 Texture Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.4 Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.5 Zooming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 Quantized TV Minimization with a Blur Kernel . . . . . . . . . . . . . . . . 132
5.4.1 Deblurring by Numerical Relaxation . . . . . . . . . . . . . . . . . . 133
5.4.2 Zooming Using Local Gradient Information . . . . . . . . . . . . . . 136
5.5 Extensions of the Quantized TV Model . . . . . . . . . . . . . . . . . . . . 142
5.5.1 Determining Intensity Levels . . . . . . . . . . . . . . . . . . . . . . 142
5.5.2 The TV-L1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.5.3 The 8-connected Topology . . . . . . . . . . . . . . . . . . . . . . . . 145
5.5.4 3-D Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.6 Applications of Binary TV Minimization . . . . . . . . . . . . . . . . . . . . 147
vi
5.6.1 Barcode Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6.2 Enhancement for Text Recognition . . . . . . . . . . . . . . . . . . . 151
5.6.3 Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . 154
Bibliography 156
vii
List of Figures
1.1 Actual vs. effective resolution. Left: Original image. Center: nearest neigh-
bor zoom. Right: zoom using Navier-Stokes inpainting from [12]. The two
zoomed images have the same number of pixels. . . . . . . . . . . . . . . . . 2
2.1 Part of Lena image downsampled and then upsampled by factor M = 2. . . 8
2.2 PDE-based upsampling with zoom M = 3 and time step δt = 0.1. Top Row:
Magnification of Miller image at time T = 9. Bottom Row: Close-up of
section of image and comparison to linear filters. . . . . . . . . . . . . . . . 14
2.3 PDE-based upsampling of text image with zoom M = 5 and time step δt =
0.1. Left: original image. Right: zoomed image at time T = 25. . . . . . . . 15
2.4 Discrete 3-level wavelet decomposition of noisy sine wave signal. . . . . . . 17
2.5 4-level Wavelet decomposition of noisy step function f . . . . . . . . . . . . . 18
2.6 LLE dimensionality reduction. Left: original 3D spherical data set. Right:
2D data set computed by LLE. . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Training set used for LLE-based interpolation. . . . . . . . . . . . . . . . . 24
2.8 LLE-based interpolation of a face image with zoom M = 3 and training set
shown in Figure 2.7. Left: original image. Right: LLE interpolated image. . 25
2.9 Close-up of eye in Figure 2.8 with zoom M = 3. Top left: nearest neigh-
bor. Top right: bilinear. Bottom left: bicubic. Bottom right: LLE-based
interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
viii
2.10 Text image interpolated by LLE with zoom M = 3. The training set in
Figure 2.7 was used. Top: original image. Bottom: LLE interpolated image. 27
2.11 NL-means denoising on part of the Lena image. Left: noisy image. Right:
image after NL-means denoising. Taken from [19]. . . . . . . . . . . . . . . 30
2.12 Illustration of M -neighborhoods for a 3x3 pixel square topology. . . . . . . 32
2.13 Interpolation of Brodatz fabric texture with zoom M = 3. Left: Original
image. Center: Bicubic interpolation. Right: NL-means interpolation. . . . 33
2.14 NL-means interpolation of ringed image with zoomM = 3 compared to linear
interpolation filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.15 Interpolation of a noisy image by factor M = 3. The image at bottom-right
underwent bicubic interpolation followed by NL-means denoising. . . . . . . 35
2.16 NL-means interpolation of textured image with zoom M = 4. Left: original
image. Center: Bicubic interpolation. Right: NL-means interpolation. . . . 36
2.17 NL-means zooming of portion of MR brain image. Top: original MRI.
Bottom-left: lower left corner of brain. Bottom-right: NL-means zoom with
M = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Illustration of zooming by variational inpainting for magnification M = 3. . 42
3.2 Inpainting a simple image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 TV and Mumford-Shah zoom of checkerboard image for magnification M =
3. The fourth column is a detail view of the image in the third column. . . 56
3.4 Zoom of color image with M = 4. . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Zoom of MRI brain image with M = 3. . . . . . . . . . . . . . . . . . . . . 59
3.6 2x TV zooming of noisy image with locally adaptive fidelity weights. . . . . 63
3.7 Effect of σ on Mumford-Shah soft inpainting with λ = 20, γ = 2000, M = 5. 66
3.8 Comparison of zooming using standard and soft Mumford-Shah inpainting
with λ = 20, γ = 2000, σ = 1, M = 5. . . . . . . . . . . . . . . . . . . . . . 67
ix
3.9 Different possible inpainting masks for a single image with magnification
M = 5. Left to right: original image, standard inpainting mask, average of
soft inpainting mask, Laplacian post-processing mask. . . . . . . . . . . . . 68
3.10 Comparison of standard variational zooming and post-processing methods
with magnification M = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.11 Zooming by magnification factorM = 2π using Mumford-Shah post-processing. 70
4.1 Illustration of image registration for super-resolution. The three images
u1, u2, u3 are aligned to a common high-resolution lattice ΩM by the re-
spective geometric transformations ϕ1, ϕ2, ϕ3. . . . . . . . . . . . . . . . . . 73
4.2 Super-resolution of 5-image sequence. Top left: original third image in se-
quence. Top right: 4x TV SR with λ = 20. Bottom left: 4x MS SR with
λ = 20, γ = 2000. Bottom right: 4x MS SR with registration incorrect by
1/2 pixel on low-resolution lattice. . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 4x color image zoom of 5-image sequence with known registration. Top row:
nearest neighbor, bilinear, bicubic. Bottom row: staircased bicubic, median
image, MS SR with λ = 20, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 79
4.4 Super-resolution of 11-frame video sequence with known registration. Top
row: 4 frames from original sequence. Bottom row: corresponding 4 frames
from 4x MS SR with λ = 20, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 80
4.5 Super-resolution video sequence with known and unknown registration. Top:
one frame from original 11-frame sequence. Center: 4x MS SR using ground-
truth registration. Bottom: 4x MS SR with simultaneous translational reg-
istration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Artifact reduction on three frames of 7-frame video sequence. Top row: orig-
inal video frames. Center row: 2x MS SR with λ = 5, γ = 2000. Bottom
row: 2x MS SR with soft inpainting σ = 10. . . . . . . . . . . . . . . . . . . 89
x
4.7 Frame from traffic video of intersection in Karlsruhe. The four highlighted
cars were tracked for super-resolution enhancement. . . . . . . . . . . . . . 91
4.8 Super-resolution of four 11-frame sections of video in Figure 4.7. Left to right:
original base frame, 4x bicubic zoom, 4x MS SR with λ = 5 and γ = 2000,
4x MS SR with de-interlacing. . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.9 Tested scanlines on a barcode image. . . . . . . . . . . . . . . . . . . . . . . 94
4.10 Three degrees of freedom in barcode rotation. . . . . . . . . . . . . . . . . . 95
4.11 Creating a projected signal u(t) from a barcode image u0(x, y). Left: pro-
jection with parallel bars (roll). Right: projection from focal point F for
non-parallel bars (pitch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.12 Super-resolution of a Code 128A barcode image with roll only. Top to bot-
tom: original image and final TV SR image, ideal signal, single scanline,
projected signal, TV SR signal with λ = 10. . . . . . . . . . . . . . . . . . . 99
4.13 Super-resolution of UPC barcode with severe pitch angle. Top: original
image with traced bars indicated by dots. Bottom: Scanline signal in red
superimposed on TV projected signal in blue. . . . . . . . . . . . . . . . . . 100
4.14 A image from an MRI sensor and contrast-adjusted zoom of two regions. . . 102
4.15 Positions of 16 MRI sensors found by tracing backwards from L2-norm image,
shown in center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.16 Zoom of central area of brain. Left: L2-norm image with enhanced contrast.
Right: MS SR with λ = 100, γ = 2000. . . . . . . . . . . . . . . . . . . . . . 105
4.17 Mumford-Shah fusion of 16 MR sensor images. Top left: a sensor image.
Top right: L2-norm image. Bottom left: MS SR with λ = 100, γ = 2000.
Bottom right: MS SR with λ = 10, γ = 2000. All four images have the same
dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1 Illustration of quantized TV graph model for neighboring pixels x ∼ y. . . . 116
5.2 Effect of λ on TV minimization with L = 4 levels. . . . . . . . . . . . . . . 120
xi
5.3 Effect of # levels L on TV minimization with λ = 1. . . . . . . . . . . . . . 121
5.4 Running time of quantized TV model with preflow-push method. Left: Log-
log plot of # pixels N vs. runtime for repeatedly downsampled Barbara
image. Right: Log-log plot of # levels L vs. runtime on 50x50 Barbara
image. Linear regressions are shown in red. . . . . . . . . . . . . . . . . . . 121
5.5 TV denoising of Barbara image. Left to right: Original image, L = 5 and
λ = 1, L = 5 and λ = 0.1, L = 2 and λ = 1 . . . . . . . . . . . . . . . . . . 123
5.6 TV Poisson denoising. Left: Original image corrupted by Poisson noise.
Center: TV minimization assuming Gaussian noise with λ = 5, L = 3.
Right: TV minimization assuming Poisson noise with λ = 5, L = 3. . . . . . 124
5.7 Quantized TV segmentation of simple images. Left 2: segmentation of nat-
ural image with λ = 0.5, L = 2. Right 2: segmentation of noisy synthetic
with λ = 0.02, L = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 TV seeded segmentation. Left: Original image with 3 seed pixels shown in
red. Center: Quantized TV minimization with λ = 0.5, L = 3 levels selected
by (5.5). Right: Quantized TV minimization with λ = 0.5, L = 3 using seeds. 126
5.9 TV texture segmentation. Left: Original image. Center: TV minimization
with λ = 0.2, L = 2 of entropy statistics. Right: TV minimization with
λ = 0.05, L = 2 of skewness statistics. . . . . . . . . . . . . . . . . . . . . . 128
5.10 TV inpainting. Left: Original image with mask D shown in red. Right: TV
inpainting result with L = 3, λ = 0.1. . . . . . . . . . . . . . . . . . . . . . . 129
5.11 TV zooming by inpainting with L = 2, λ = 1, and magnification factor M . . 131
5.12 TV zooming by inpainting with L = 2, λ = 1, and magnification M = 2. . . 131
5.13 Illustration of writing an image in block-raster order for M = 2, N = 4. The
resulting matrices K and A are block-diagonal. . . . . . . . . . . . . . . . . 138
5.14 The binary 0-1 image at left is convolved with a 2x2 averaging kernel K and
downsampled by factor 2 to produce the grayscale image at right. . . . . . . 140
xii
5.15 Results of 2x zoom by different methods. Top row: original image, bicubic
zoom, TV filter zoom. Bottom row: quantized TV inpainting, quantized TV
zooming by relaxation, quantized TV zooming using local gradients. . . . . 140
5.16 Quantized TV zooming on cameraman image. Left: original image. Center:
2x zoom by quantized TV inpainting with L = 2, λ = 1. Right: 2x zoom
with 2x2 averaging kernel and local gradients. . . . . . . . . . . . . . . . . . 142
5.17 Iterating on intensity levels for quantized TV minimization with λ = 1, L = 3.143
5.18 TV minimization with L = 6 levels under L1 and L2 fidelity constraints.
Top row: TV-L2 minimization removes low-contrast features as λ decreases.
Bottom row: TV-L1 minimization removes finer-scale geometric features as
λ decreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.19 TV inpainting under the 8-connected topology. The inpainting domain is
shown in red in the first image. Left to right: Original image, TV filter,
4-connected quantized TV, 8-connected quantized TV. . . . . . . . . . . . . 146
5.20 3D quantized TV denoising of simple volumes with λ = 0.005, L = 2. The
middle image slice is shown for comparison. Top row: 10x10x10 cube. Bot-
tom row: Sphere of radius 8. . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.21 Quantized TV denoising of UPC barcode with additive Gaussian noise and
Gaussian blur. Top: original image. 2nd row: thresholding at median in-
tensity. 3rd row: quantized TV denoising with λ = 0.005, L = 2. Bottom:
quantized TV denoising with anisotropic weights λy = 0.005, λx = 0, L = 2. 149
5.22 Quantized TV inpainting of damaged barcode. Left: original image with
damaged area shown in red. Center: TV inpainting with λ = 0.1, L = 2.
Right: TV inpainting with anisotropic weights λy = 0.1, λx = 0, L = 2. . . 150
5.23 Quantized TV denoising of a barcode projected signal with λ = 10, L = 2. . 150
xiii
5.24 Quantized TV zooming of large text. Top row: original image, 2x bicubic
zoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic
zoom followed by modified Niblack’s method, iterations 1 and 8 of quantized
2x TV zooming with λ = 0.1, L = 2. Assumes kernel is 2x2 averaging matrix. 153
5.25 Quantized TV zooming of small text. Top row: original image, 2x bicubic
zoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic
zoom followed by modified Niblack’s method, iterations 1 and 8 of quantized
2x TV zooming with λ = 0.1, L = 2. Assumes kernel is 2x2 averaging matrix. 153
5.26 Quantized TV segmentation of ideal brain images with λ = 0.1, L = 3. Left
2: CT image. Right 2: MR image. . . . . . . . . . . . . . . . . . . . . . . . 154
5.27 Quantized TV segmentation of low-contrast MR brain image. Left 2: Seg-
mentation of entire brain with λ = 50, L = 2. Right 2: Segmentation of
region indicated in first image with λ = 200, L = 2. . . . . . . . . . . . . . . 155
xiv
Chapter 1
Introduction
A digital image is not an exact snapshot of reality, it is only a discrete approximation.
This fact becomes apparent when an image is made much larger and the pixels become
visible to the human eye. A larger image should have higher resolution, but an enlarged
image sometimes appear less acceptable than its smaller original. The actual resolution of
an image is defined as the number of pixels, but the effective resolution that we perceive
is a much harder quantity to define as it depends on subjective human judgment. Simply
increasing the number of pixels comprising the image does not necessarily increase the
effective resolution, as illustrated in Figure 1.1. The goal of image zooming is to create an
image with higher effective resolution from a single observed image. The zooming method we
employ depends in large part on our definition of effective resolution, which is an essentially
aesthetic quantity.
The digital zooming problem goes by many names, depending on the application: inter-
polation, image resizing, image upsampling/downsampling, image magnification, resolution
enhancement, etc. The term super-resolution is sometimes used, although in the literature
this generally refers to producing a high-resolution image from multiple images such as a
video sequence. In this thesis, we will refer to the single image case as “zooming” and the
mulitple image scenario as “super-resolution.”
The applications of image zooming range from the commonplace viewing of online images
1
Figure 1.1: Actual vs. effective resolution. Left: Original image. Center: nearest neighborzoom. Right: zoom using Navier-Stokes inpainting from [12]. The two zoomed images havethe same number of pixels.
to the more sophisticated magnification of satellite images. With the rise of consumer-based
digital photography, users expect to have a greater control over their digital images. Dig-
ital zooming has a role in picking up clues and details in surveillance images and video.
As high-definition television (HDTV) technology enters the marketplace, engineers are in-
terested in fast interpolation algorithms for viewing traditional low-definition programs on
HDTV. Astronomical images from rovers and probes are received at an extremely low trans-
mission rate (about 40 bytes per second), making the transmission of high-resolution data
infeasible [40]. In medical imaging, neurologists would like to have the ability to zoom in
on specific parts of brain tomography images. This is just a short list of applications, but
the wide variety cautions us that our desired interpolation result could vary depending on
the application and user.
1.1 The Digital Zooming Problem
In this section, we will establish the notation for image zooming used throughout the paper.
Suppose our image is defined over some rectangle Λ ⊂ <2. Let the function f : Λ → < be
our ideal continuous image. In an abstract sense, we can think of f as being “reality” and
Λ as our “viewing window.” The observed image u0 is a discrete sampling of f at equally
spaced points in the plane. If we suppose the resolution of u0 is δx× δy, we can express u0
2
by
u0(x, y) = Cδx,δy(x, y)f(x, y), (x, y) ∈ Λ (1.1)
where C denotes the Dirac comb
Cδx,δy(x, y) =∑
k,l∈Z
δ(kδx, lδy), (x, y) ∈ <2
and δ denotes the two-variable Dirac delta function
δ(x, y) =
1 if x = y
0 otherwise
The goal of image interpolation is to produce an image u at a different resolution δx′×δy′.
For simplicity, we will assume that the Euclidean coordinates are scaled by the same factor
M :
u(x, y) = C δxM
, δyM
(x, y)f(x, y), (x, y) ∈ Λ (1.2)
Given only the image u0, we will have to devise some reconstruction of f at the pixel values
specified by this new resolution. We will refer to M as our zoom or magnification factor.
Obviously, if M = 1 we trivially recover u0. The image u0 is upsampled if M > 1 and
downsampled if M < 1. In this paper, we will focus on the upsampling case when M > 1
is an integer.
Let ΩM ⊂ Λ denote the lattice induced by (1.2) for a fixed zoom M . Note that the
lattice of the original image u0 in (1.1) is Ω1, written simply as Ω from this point. Also
note that for infinite magnification we obtain ΩM → Λ as M → ∞. For computational
purposes, we can shift the lattices to the positive integers Z+. So if the observed image u0
is an m× n image,
ΩM = [1, 2, . . . ,Mm]× [1, 2, . . . ,Mn] .
Many interpolation techniques impose the constraint Ω ⊆ ΩM . In this case, only a subset
of the pixels in ΩM needs to be determined and the zooming problem becomes a version of
3
the inpainting problem.
Given the notation above, we can state the image interpolation problem succinctly:
Given a low-resolution image u0 : Ω → < and a magnification M > 1, find a high-resolution
image u : ΩM → <. Obviously, this is an ill-posed problem. We need to impose assumptions
on the reconstruction of f in equation (1.2). The choice of the zooming strategy depends
on the choice of assumptions. In other words, we need a mathematical understanding of
what constitutes our perception of “reality” f .
Zooming methods differ in their mathematical description of a “good” interpolated
image. Although it is difficult to compare methods and judge their output, we propose
9 basic criteria for a good zooming method. Some of these criteria are image processing
axioms proposed by [2, 24]. The first 8 are visual properties of the interpolated image, the
last is a computational property of the zooming method.
1. Geometric Invariance: The interpolation method should preserve the geometry and
relative sizes of objects in an image. That is, the subject matter should not change
under interpolation.
2. Contrast Invariance: The method should preserve the luminance values of objects in
an image and the overall contrast of the image.
3. Noise: The method should not add noise or other artifacts to the image, such as
ringing artifacts near the boundaries.
4. Edge Preservation: The method should preserve edges and boundaries, sharpening
them where possible.
5. Aliasing: The method should not produce jagged or ”staircase” edges.
6. Texture Preservation: The method should not blur or smooth textured regions.
7. Over-smoothing: The method should not produce undesirable piecewise constant or
blocky regions.
4
8. Application Awareness: The method should produce results appropriate to the type
of image and order of resolution. For example, the interpolated results should appear
realistic for photographic images, but for medical images the results should have crisp
edges and high contrast. If the interpolation is for general images, the method should
be independent of the type of image.
9. Sensitivity to Parameters: The method should not be too sensitive to internal param-
eters that may vary from image to image.
These are qualitative and somewhat subjective criteria, but they serve as a guide for
developing and evaluating digital zooming. In a sense, the methods discussed in this paper
each present a mathematical model of these visual criteria.
1.2 Organization and Contributions of this Thesis
Simple linear filters are the most common interpolation methods in computer software such
as web browsers and photo editors. In the next chapter, we first establish that linear
filters are inadequate for the zooming problem. To motivate the variational approach and
compare it to other strategies, in Chapter 2 we examine four methods representative of
current research in image processing:
• A wavelet-based interpolation method
• A heat diffusion PDE-based algorithm
• A machine learning strategy inspired by dimensionality reduction
• An adaptation of the Nonlocal Means denoising algorithm.
While each interpolation strategy has its strengths, the variational approach has certain
theoretical and computational advantages.
In Chapter 3, we discuss the use of variations of energy for image processing, specifically
the Total Variation (TV) and Mumford-Shah energies. We discuss theoretical and compu-
5
tational aspects of image inpainting and its extension to digital zooming. To improve the
quality of zoomed images, we propose several modifications to the basic inpainting model.
These modifications include the incorporation of a blur kernel, locally adaptive fidelity
weights, soft inpainting using nearest neighbor information, and variational post-processing.
The super-resolution problem seeks to produce a single high-resolution image from a
sequence of low-resolution images. The TV and Mumford-Shah image zooming models
extend naturally to image sequences, but meaningful data fusion requires that the images
be aligned to sub-pixel accuracy. In Chapter 4, we propose an alternating minimization
strategy to accurately align and fuse the image sequence. To address the problem of local
variation or motion within the sequence, a soft inpainting model can correct for image
artifacts created by the super-resolution process. Variational super-resolution is shown to
be effective in video enhancement, MRI reconstruction, and barcode processing.
Restricting the processed image to a few discrete gray values is useful for enhancing the
components of an image and can help restore simple images corrupted by noise, blur, and
downsampling. Chapter 5 discusses quantized image processing by minimizing the quantized
TV energy by graph cuts. In a graph-theoretic model, finding the global minimum of the
quantized TV energy is equivalent to finding the minimum cut of a flow network. This model
has previously been shown to be effective for image denoising and segmentation. The graph
cut method extends to inpainting and zooming, but incorporating a blur kernel is an open
problem. To address the graph cut deconvolution problem, we propose an approximation
method inspired by numerical relaxation in linear programming. Applications to text,
barcodes, and medical images are presented.
6
Chapter 2
Survey of Zooming Approaches
The goal of this chapter is to give a brief survey of different mathematical approaches to
image zooming. To illustrate each approach, we will focus on a particular method that
is representative of the approach. We will present numerical results for each method and
discuss its strengths and weaknesses. We begin by examining simple linear filters and why
better methods need to be developed.
2.1 Linear Interpolation Filters
The simplest approach is to assume that f in equation (1.2) is reconstructed by a convolution
kernel φ : <2 → < where∫φ(x, y)dydx = 1. Then we can approximate f by f ≈ u0 ∗ φ.
Substituting this into (1.2) gives rise to a general linear interpolation filter
u(x, y) = C δxM
, δyM
(x, y)(u0 ∗ φ)(x, y), (x, y) ∈ Ω.
The simplest linear filters are the bilinear and bicubic interpolation, which assume the
pixel values can be fit locally to linear and cubic functions, respectively [64]. Along with
simple nearest neighbor interpolation, these two filters are the most common interpolation
schemes in commercial software. These methods are easy to code as matrix multiplications
of u0. However, an image contains edges and texture, in other words discontinuities. So
7
the assumptions that pixel values locally fit a polynomial function will produce undesirable
results. The bilinear and bicubic interpolation methods may introduce blurring, create
ringing artifacts, and produce a jagged aliasing effect along edges (see Figure 2.1). The
blurring effects arise from the fact that the methods compute a weighted average of nearby
pixels, just as in Gaussian blurring. The aliasing effects arise because the linear filters do
not take into consideration the presence of edges or how to reconstruct them.
Figure 2.1: Part of Lena image downsampled and then upsampled by factor M = 2.
Other linear interpolation filters include include quadratic zoom, the B-spline method,
and zero-padding. But these schemes produce the same undesirable effects as the bilinear
and bicubic methods, as documented in [72]. Linear filters differ in the choice of φ, which
essentially determines how to compute the weighted average of nearby pixels. While this is
a natural interpolation scheme for general data sets, this is not necessarily appropriate for
visual data. In order to improve upon these linear filters, we need to consider interpolation
methods that somehow quantify and preserve visual information.
2.2 Which Methods to Consider?
Generally speaking, mathematical approaches to image processing can be divided into five
categories:
1. PDE-Based Methods (e.g heat diffusion, Perona-Malik, Navier-Stokes, mean curva-
ture)
8
2. Multiscale Analysis (e.g. wavelets, Fourier analysis, Gabor analysis, Laplacian pyra-
mids)
3. Machine Learning (e.g. unsupervised learning, data mining, Markov networks)
4. Statistical / Probabilistic Methods (e.g. Bayesian inference, Natural Scene Statistics,
pattern theory)
5. Variations of Energy (e.g. Total Variation, Mumford-Shah, active contours)
We are trying to describe the field in broad terms, but not to rank or pigeonhole work
in computer vision. Indeed, many techniques such as TV-wavelets inpainting certainly do
not fit into one category. Also, these methods differ at the mathematical level, but not
necessarily at the conceptual level. For example, some versions of the TV energy can be
minimized by solving a PDE or by optimizing a variation of energy.
In our attempt to survey recent work in image interpolation and also display the variety
of mathematics used, we will highlight one method from each of the first four categories.
The fifth category, variations of energy, will be discussed in detail in the Chapter 3. In this
chapter, we will consider
1. A PDE-Based Approach: anisotropic heat diffusion [10]
2. A Multiscale Approach: wavelet-based interpolation [23]
3. A Machine Learning Approach: LLE-based neighbor embeddings [35] item A Statis-
tical Approach: NL-means interpolation [19]
These methods are, in some sense, representative of the mathematical approaches to the
image interpolation problem and, in a larger sense, to the field of image processing. For
example, the heat equation is the most studied PDE in image processing and wavelet theory
has generated hundreds of research papers. We will briefly describe the mathematics and
motivation behind each method. Then we will present numerical results and discuss each
method’s advantages and drawbacks.
9
2.3 A PDE-Based Approach
A PDE-based approach evolves an image based on a specific driving differential equation.
For example, Cha and Kim proposed an interpolation method based on the PDE form of
the TV energy [25]. In their seminal paper on inpainting, Bertalmio et. al. proposed a
fourth-order PDE based on Navier-Stokes fluid flow [12]. The most famous and well-studied
PDE in image processing is the classical heat equation. Anisotropic heat diffusion has been
successfully applied to image reconstruction and denoising and its behavior is well-known
[57, 80]. Belahmidi and Guichard have proposed an interpolation scheme based on the
classical heat diffusion model [11].
2.3.1 Heat Diffusion
The heat equation is a useful tool for smoothing noisy images. We assume that pixel values
behave like temperature values and diffuse throughout the image. Diffusion is directed by
the unit vectors ~n and ~t, which are oriented by the gradient vector Du normal and tangent
to the edges, respectively:
~n =Du
|Du|=
(ux, uy)√u2
x + u2y
, ~t =Du⊥
|Du|=
(uy,−ux)√u2
x + u2y
Following the notation of Guichard and Morel [57], an image u(t, x) is evolved according to
the PDE∂u
∂t= |Du|D2u(~t,~t) + g (|Du|)D2u(~n, ~n) (2.1)
where
D2u(~v,~v) = ~vTD2u~v.
for the 2x2 Hessian matrix D2u.
The function g(s) is an “edge-stopping function” satisfying 0 ≤ g ≤ 1 that is close to
0 when s is large and 1 when s is small. The most common choice is the Perona-Malik
10
function
g(s) =1
1 + (s/λ)2
where λ is a parameter set experimentally [80]. The effect of g is shown in the following
theorem, which can be proven by direct calculation.
Theorem 2.1 (Belahmidi and Guichard, 2004) When g ≡ 1, equation 2.1 reduces to
the heat equation∂u
∂t= ∆u. (2.2)
When g ≡ 0, equation 2.1 reduces to mean curvature motion
∂u
∂t= |Du|∇ ·
(Du
|Du|
)= |Du|curv(u). (2.3)
In smooth regions, Du is small, g is close to 1, and the two terms of (2.1) have equal
weight. The Laplacian of equation (2.2) will blur the image evenly by isotropic diffusion.
Near edges, Du is large, g is close to 0, and the diffusion will occur along edges, smoothing
the level lines but preserving the sharpness of the edges.
Belahmidi and Guichard adapted the heat equation (2.1) to image interpolation by
adding a fidelity term [10]. The heat equation will still smooth the image while preserving
edges, but the addition of a third term keeps the image u close to the original image u0.
The PDE and initial condition are
∂u
∂t= |Du|D2u(~t,~t) + g(|Du|)D2u(~n, ~n)− Pu+ Zu0
u(0, x) = Zu0
(2.4)
The operator Z : Ω → ΩM is the duplication zoom or nearest neighbor upsampling
technique. The upsampled coarse image Zu0 acts as the initialization. The projection
operator P computes the average of the image u over the M × M stencil used in the
upsampling Z. If we let N(x) denote the M ×M upsampling window containing pixel x,
11
we can write P as
P (x) =1M2
∫N(x)
u(y)dy
The classical heat diffusion (2.1) has been well-studied, but it is unclear how the addition
of the fidelity term in (2.4) affects the equation. Little is known about solution to the
PDE (2.4), although some comments can be made in the viscosity framework. Writing
H(x, u,Du,D2u) for the right-hand side of equation (2.4), a viscosity solution u satisfies
u = 0 on ∂ΩM and for all v ∈ C2(ΩM ) we have:
1. H(x0, u,Du,D2u) ≤ 0 whenever u− v has a local maximum at (t0, x0).
2. H(x0, u,Du,D2u) ≥ 0 whenever u− v has a local minimum at (t0, x0).
Under this definition, Belahmidi proved the following theorem in [10].
Theorem 2.2 (Belahmidi, 2003) Suppose g(s) is the Perona-Malik function and u0 ∈
C(Ω). Then the PDE (2.4) with boundary condition u = 0 on ∂ΩM admits a unique viscosity
solution.
The proof is similar to the proof for viscous solutions to the Hamilton-Jacobi equation
[47]. Of course, this is of limited usefulness for natural images because the original image
u0 is almost certainly not continuous.
2.3.2 Numerical Results
Equation (2.4) can be discretized in a straightforward manner using finite differences. For
choice of small time step δt, we can write
u(n+1)ij = u
(n)ij + δt
(|Du|D2u(~t, t) + g(|Du|)D2u(~n, ~n)− Pu+ Zu0
)ij.
A von Neumann analysis of the 2D heat equation ut = ∆u shows that we require δt(δx)2
< 14
to guarantee stability of an Euler numerical scheme. Using this as a guideline, an image
12
has spatial step δx = 1 so we expect a rough upper bound δt < 0.25. We used Neumann
boundary conditions at the borders of the image.
Belahmidi and Guichard make a heuristic argument for the stopping time T . Running
the heat equation on an image u at scale t is equivalent to convolution with a Gaussian
kernel of standard deviation√
2t. Since the length of the diagonal of a pixel’s upsampled
M × M window is√
2M , the authors argue that the desired standard deviation should
be√
(2)M . So we set the stopping time T = M2. Our experiments with the PDE-based
method indicate that the image does not change much after this stopping time, so the image
may have reached its steady-state by this time.
The zooming method seems to do a good job smoothing edges, while maintaining the
sharpness of the edges. In terms of aliasing edges, it seems to perform better than linear
interpolation filters (see Figure 2.2). The PDE-based method seems to perform well on
natural images, although some textures are over-smoothed.
If the parameter λ in the Perona-Malik function g(s) is set too small, the method will
over-smooth textured regions, resulting in unrealistic images. We set λ very large to avoid
this side-effect. This preserved textures, but it also preserved noise and ringing effects
present in the original image (see Figure 2.3). Another side-effect, which is barely visible
in the figure below, is that the PDE-based method changes the overall contrast of the
image. This is because the diffusion across edges is limited, but still occurs. This may be
undesirable side-effect in some applications, such as medical images where the gray value
of brain matter is crucial.
2.4 A Multiscale Approach
A multiscale approach tries to break an image down into its most basic components of
information and express the image in scales of those building blocks. Multiscale analysis
seems a natural fit for image interpolation, with image upsampling viewed as determining
finer scales of image detail to add to a low-resolution image. Wavelets and their variants
13
Figure 2.2: PDE-based upsampling with zoom M = 3 and time step δt = 0.1. Top Row:Magnification of Miller image at time T = 9. Bottom Row: Close-up of section of imageand comparison to linear filters.
have received much attention for image interpolation, although most of the work has focused
on image super-resolution: interpolating a high-resolution image from an image sequence
rather than a single image. These techniques do not necessarily carry over to single image
super-resolution, as the sequence generally contains much more information than a single
image. The techniques are also highly dependent on precise sub-pixel registration of the
low-resolution images in the sequence [76]. Most of the wavelet-based work on single image
interpolation has focused on detecting extrema and singularities in the wavelet transform. In
this section, we describe a work by Carey, Chuang, and Hemami that focuses on producing
crisp well-defined edges in the interpolant [23].
14
Figure 2.3: PDE-based upsampling of text image with zoom M = 5 and time step δt = 0.1.Left: original image. Right: zoomed image at time T = 25.
2.4.1 Wavelets and Holder Regularity
Carey et. al. begin by defining the smoothness of an image in terms of Holder regularity
of the wavelet transform. We say that a function f : < → < has Holder regularity with
exponent α = n+ r, n ∈ N , 0 ≤ r < 1 if there exists a constant C satisfying
∣∣∣f (n)(x)− f (n)(y)∣∣∣ ≤ C|x− y|r, x, y ∈ <. (2.5)
Functions with a large Holder exponent will be both mathematically and visually smooth.
Locally, an interval with high regularity will be a smooth region and an interval with low
regularity will correspond to roughness, such as at an edge in an image. To extend this
concept to edge detection in the wavelet domain, we need a technique for detecting local
Holder regularity from the wavelet coefficients.
Let ψ be a compactly-supported discrete wavelet function, such as a Daubechies wavelet.
The discrete wavelet transform is computed by projecting a signal onto translations and
dilations of the mother wavelet ψ:
ψk,l(x) = ψ(2kx− l
), k, l ∈ Z. (2.6)
15
The wavelet transform coefficients wk,l at scale k and offset l are given mathematically as
an inner product with the mother wavelet:
wk,l = (f, ψk,l) . (2.7)
Numerically, these coefficients are computed using a filter bank with a scaling function φ
appropriate to the mother wavelet. The dyadic wavelet filter bank repeatedly divides the
signal at scale k into an approximation signal ak and a detail signal dk, also called the
averages and differences (see Figure 2.4). The coefficients of dk are precisely the wavelet
coefficients wk,l.
The following theorem by Ingrid Daubechies establishes the connection between wavelet
coefficients and Holder regularity [41].
Theorem 2.3 (Daubechies, 1992) Let x0 ∈ < and S be a set of index pairs (k, l) such
that for some ε > 0 we have (x0−ε, x0+ε) ⊂ supp (ψk,l). A signal has local Holder regularity
with exponent α in the neighborhood of (x0− ε, x0 + ε) if there exists a constant C such that
max(k,l)∈S
|wk,l| ≤ C2−k(α+ 12). (2.8)
Theorem 2.3 alone is not sufficient for determining the local Holder regularity, because
it requires computation of two unknown constants C and α. It has been observed ex-
perimentally that regions in a signal with low regularity tend to have greater similarity
across scales. Let dm(t) and dn(t) denote the wavelet sub-bands at scales 2m and 2n. The
correlation between sub-bands is given by
Corr (dm(t), dn(t)) =∫<
dm(τ)dn(τ − t)dτ. (2.9)
Applying Theorem 2.3 twice to this definition yields the following theorem.
Theorem 2.4 (Carey-Chuang-Hemami, 1999) Let f : < → < be C∞, except possi-
16
Figure 2.4: Discrete 3-level wavelet decomposition of noisy sine wave signal.
bly in a neighborhood of the origin, where it has Holder regularity with exponent α. The
correlation between sub-bands dm(t) and dn(t) satisfies
|Corr (dm(t), dn(t))| ≤ C2−(m+n)(α+ 12). (2.10)
Theorem 2.4 shows that regions with high regularity will exhibit low correlation across
scales, and vice-versa. In other words, an edge will result in extrema in the wavelet coef-
ficients across several scales, while extrema in smooth regions will not persist across scales
(see Figure 2.5).
The two previous theorems give a heuristic for estimating the local regularity of a signal
by examining the correlation across wavelet sub-bands. Carey et. al. claim that at a strong
edge in a signal, the inequalities in both theorems will be close to equality. By (2.8), in
an interval containing a strong edge the logarithm of the maximum coefficient magnitudes
should be close to linear across scales. The parameters C and α can then be estimated
using equality in (2.8) and (2.10).
17
Figure 2.5: 4-level Wavelet decomposition of noisy step function f .
2.4.2 Wavelet-Based Interpolation
Suppose we are given an image u0 : Ω → < and its corresponding L-level discrete wavelet
decomposition. Synthesizing a new sub-band dL+1 will produce a new image u : Ω2 → <
that is twice the size of the original image. Since the theorems above apply to 1D data,
Carey et. al. proceed by first processing the image data across each row and appending the
signals into a “row-interpolated image.” The same processing step is then applied to the
columns of this new image, with the end result being u.
Carey et. al. suggest interpolating the sub-band signal dL−1 rather than the finest
original sub-band dL because the finest band generally contains too much noise information.
As an initialization for dL+1, the detail signal dL−1 is upsampled by a factor 4 using cubic
spline interpolation. For each subinterval of the signal, the algorithm then determines
18
similarity across scale by computing the linear regression across sub-bands of the maximum
coefficient magnitude. If the linear correlation is strong, then the interval should contain
an edge and the linear regression will predict the magnitude of the coefficient at sub-band
dL+1. The template from dL−1 is used except at edges, where the signal is modified to
achieve equality in (2.8).
On a small set of test images, [23] demonstrate that their wavelet-based interpolation
method results in higher Peak Signal-to-Noise Ratio (PSNR) than the standard bilinear
and bicubic methods. However, visually the methods exhibit little difference. The wavelet-
based method seems to sharpen the edges, but the textured and smooth regions of the
image are blurred. This effect is understood because the interpolation step is simply 1D
cubic interpolation, except at strong edges. This method requires the original image u0 to
be large enough to show a significant amount of information across many wavelet scales.
Also, the technique lacks resizing flexibility because it assumes the zoom factor M is a
multiple of 2.
This wavelet-based method reduces to the linear bicubic filter, except at strong edges.
It is best suited for images with strong, well-defined edges separating smooth regions. One
possible refinement to this method would be to incorporate textured information. Several
papers have demonstrated that wavelet coefficient magnitudes can be used to quantify and
classify textures [63, 92]. It would be interesting to incorporate this idea into texture
interpolation, resulting in a sharpened image that is visually pleasing as well.
2.5 A Machine Learning Approach
As image processing research advances, researchers are realizing that details, textures, and
other visual nuances of images cannot be expressed in compact mathematical form. In
the last few years, researchers have given more attention to machine learning approaches
to guide the computer to learn meaningful information from a set of training images. For
image interpolation, a set of high-resolution images and its corresponding downsampled
19
versions are provided, with the goal of learning how to connect the low-resolution version
to its high-resolution counterpart. William Freeman and his group at Mitsubishi Labs have
developed an approach based on Markov networks and belief propagation networks citeFree-
man. Bryan Russell, one of Freeman’s students, extended this approach by incorporating
priors into the belief propagation networks, which results in realistic textured images with
sharper edges [84]. Most recently Chang, Yeung, and Xiong developed a learning system
inspired by dimensionality reduction techniques, which we highlight below [35].
2.5.1 Locally Linear Embedding (LLE)
In the last five years, much attention has been given to mathematical non-linear dimen-
sionality reduction (NLDR) methods, also called manifold learning techniques. Given a
high-dimensional data set X, the goal is to interpolate a lower dimensional data set Y that
preserves the neighborhoods of the original data set in a geometrically meaningful way.
In 2000, Lawrence Saul and Sam Roweis proposed the Locally Linear Embedding (LLE)
manifold learning technique [86]. For each data point in X, LLE computes the nearest
neighbors and then projects the neighborhood to Y by assuming that the neighborhood is
planar. This technique has proven effective experimentally in reducing the dimensionality
of data sets in a geometrically meaningful manner (see Figure 2.6).
Once a data set Y is determined, it is possible to add a new data point x to the manifold
X and add its projection y to Y without recomputing the entire embedding from X to Y .
Saul and Roweis suggest one solution for this out-of-sample extension is to first compute the
K nearest neighbors xiKi=1 of x in X. Next we compute normalized weights wiK
i=1 that
best form a linear combination of x: x ≈∑K
i=1wixi. Finally, we construct the interpolated
point y by using these same weights in a linear combination of yiKi=1 , the data points in
Y corresponding to the xi’s: y ≈∑K
i=1wiyi . This procedure motivates a machine learning
technique for comparing a given image to the training data set. The key difference is that
we are given a low-resolution image and interpolate a high-dimensional image, so we need
to increase the dimensionality of our data points.
20
Figure 2.6: LLE dimensionality reduction. Left: original 3D spherical data set. Right: 2Ddata set computed by LLE.
2.5.2 LLE-Based Interpolation
Suppose we are given a collection of low-resolution image patches X = xiNi=1 and their
corresponding high-resolution patches Y = yiNi=1, where images are expressed in raster
order as vectors. This training set could be prepared by dividing a set of high-resolution
images into patches Y and downsampling the images to patches X. The training set images
should be carefully chosen to reflect the textures and patterns that will be seen in the
interpolation phase and the downsampling rate should be the desired zoom M . Given a
new image patch x, the goal is to find its corresponding high-resolution image y.
Inspired by LLE’s out-of-sample extension scheme, Chang et. al. propose the following
interpolation. For each low-resolution image patch x, we perform the following steps:
1. Find the K nearest neighbors xiKi=1 of x in the data set X. The metric could be
the Euclidean distance, although more sophisticated image difference metrics could
be devised.
21
2. Compute weights wiKi=1 that minimize the reconstruction error:
err =
∣∣∣∣∣x−K∑
i=1
wixi
∣∣∣∣∣2
(2.11)
subject to the constraintK∑
i=1
wi = 1.
Note that this minimization is only performed over the neighborhood of x, so we
could enforce wi = 0 for any data point xi not in the neighborhood. We can solve
this constrained least squares problem by computing a Gram matrix
G =(x~1T −A
)T (x~1T −A
)
where A is a matrix containing the neighbors xiKi=1 as its columns. Expressing the
weights as a vector ~w, the closed form solution of (2.11) is
~w =G−1~1~1TG−1~1
. (2.12)
Equivalently, we could solve G~w = ~1 and then normalize the weights∑K
i=1wi = 1.
3. Project x to its high-dimensional image patch by computing
y =K∑
i=1
wiyi
where the yi’s are the high-dimensional patches corresponding to the xi’s.
After completing these steps on each image patch, we have obtained a collection of
upsampled image patches which can be arranged into a high-resolution image. However,
these patches are not independent since they should form a single image. To help enforce
continuity between adjacent patches, the training set X is formed by selecting overlapping
patches from the training images. Overlapping the image patches is a common trick that
22
is used in many machine learning vision algorithms [51]. Since this still does not guaran-
tee continuity between the computed high-resolution patches, the high-resolution image is
constructed by averaging pixel values in the overlapping regions.
Note that if we use the raw pixel values, as suggested above, our method will be sensitive
to changes in luminance. That is, if we supply a test image that is brighter than the training
images, the first step of the interpolation will not match correct textures to the given image
patch. Chang et. al. work around this problem by using the relative luminance changes
in their low-resolution patches X. Each pixel in a low-resolution patch is replaced with a
4D feature vector consisting of finite difference approximations of the first and second order
gradients. This helps the algorithm find neighbors with similar patterns rather than similar
luminances. Since this will prevent us from determining the overall luminance value of the
interpolated high-resolution image, the mean luminance value of each high-resolution patch
in the training set Y is subtracted from all pixel values. In step 3, the target high-resolution
patch y is constructed and the mean luminance value of the original low-resolution patch x
is added to y.
2.5.3 Numerical Results
We implemented the LLE-based interpolation by selecting a set of high-resolution pho-
tographs to form the set Y and downsampling them by a specified zoom factor M to obtain
our low-resolution training set X. For our image patch sizes, we used 3x3 windows for the
low-resolution images and 3M × 3M windows for the high-resolution images. Using the
relative luminance changes as our feature vector, each vector in X had length (4)(32) = 36
and each high-resolution vector in Y had length 9M2. The low-resolution patches were
selected from the images with an overlap of 1 pixel width between adjacent patches. The
high-resolution patches necessarily had an overlap width of M pixels. These are the same
window sizes used in [35]. As suggested by the authors, we quadrupled the size of the train-
ing set by using rotations of the image patches (0, 90, 180, 270). This makes use of the
assumption that texture patches are often rotationally invariant. One of our training sets
23
consisting of face images in shown in Figure 2.7. The training time is very time-consuming.
This particular data set of 5 images took 91 minutes to prepare.
Figure 2.7: Training set used for LLE-based interpolation.
In the interpolation phase, we use K = 5 nearest neighbors. Figure 2.8 shows the inter-
polation result on a face image with zoom M = 3. The interpolation phase is rather slow,
since the nearest neighbor computation is rather expensive and grows quadratically with
the size of the training set. This particular image took roughly 20 minutes to compute.
Some aliasing and discretization effects can be seen in the original image and the interpo-
lated image is noticeably smoother. The interpolated image is somewhat blocky however,
reflecting the square windows used in the interpolation.
If we magnify a piece of the image in Figure 2.8, we can better see the blocky nature
of the reconstruction. Figure 2.9 shows a close-up of the eye. The LLE interpolation
definitely exhibits some aliasing, whereas the bilinear and bicubic filters smooth the image
better. Despite this effect, LLE does seem to do a good job interpolating texture.
The major drawback of LLE interpolation and machine learning methods in general is
that we require the generation of a good training set. In this case, the training set should
reflect the textures that will be seen in the test image. That is, if we want to interpolate
24
Figure 2.8: LLE-based interpolation of a face image with zoom M = 3 and training setshown in Figure 2.7. Left: original image. Right: LLE interpolated image.
faces then the training set should consist of face images. Figure 2.10 shows the result of
interpolating a text image using the face image training set. The text in the image is blurred
and the overall contrast of the image is changed.
Not only should the training set reflect the type of image interpolated, but the selected
images should also reflect the order of magnitude resolution desired. For example, the
texture of a brick wall will change drastically depending on the viewer’s distance from the
wall. After images are selected, generating the training manifolds X and Y is very time-
consuming. This data preparation could be done as a pre-processing step, provided the
zoom factor M is known. The downsampling rate and patch sizes depend on M , so this
factor must be fixed before training begins. Selecting and preparing a training set requires
prior knowledge of the type of image to be interpolated, the resolution of the images, and
25
Figure 2.9: Close-up of eye in Figure 2.8 with zoom M = 3. Top left: nearest neighbor.Top right: bilinear. Bottom left: bicubic. Bottom right: LLE-based interpolation.
the desired zoom factor. While this information is not generally available beforehand, there
are applications in which these parameters are known, such as MR image interpolation.
2.6 A Statistical Approach
Many, if not all, approaches to image processing can be interpreted as having a statistical
or probabilistic motivation. Certainly, the linear interpolation filters mentioned earlier are
statistical in nature, devising a convolution kernel that produces a weighted sum of neigh-
boring pixels. Several researchers, particularly in psychology, have focused on developing
a statistical theory of images and patterns [56] and recent efforts have tried to incorporate
this information into interpolation [51]. Interpolating textured images is related to the
problem of texture synthesis, which is based on computing local statistics that segment and
classify the image textures [98]. Several efforts have been made to develop Bayesian and
26
Figure 2.10: Text image interpolated by LLE with zoom M = 3. The training set in Figure2.7 was used. Top: original image. Bottom: LLE interpolated image.
MAP estimators for constructing a super-resolved image from a sequence of low-resolution
images [8]. In this section, we will present a simple statistical filter based on global image
statistics that simultaneously denoises and interpolates a textured image.
2.6.1 Local vs. Global Interpolation
All of the previous methods we have discussed thus far are based on local image statistics
and properties. The PDE and variational methods are based on very local finite difference
calculations. The wavelet interpolation method seeks to detect local singularities in the
image. Even LLE-based interpolation uses only a small 3x3 window of the given image as
its basis for interpolation, even though this window is compared to other small windows in
a large training set. However, textured images will often contain repeatable and identifiable
patterns throughout the image.
27
Although the previous methods preserved edges and structures well, they had a much
harder time interpolating texture. Except for LLE interpolation, all the methods tended
to over-smooth textured regions. This may be because the PDE, variational, and wavelet
methods can be written in a simple closed form, but natural textured images defy com-
pact mathematical explanation. LLE interpolation could only reproduce the texture if the
texture was present in the training set at the desired order of resolution.
In summary, we have observed two simple facts:
1. Most interpolation schemes are local.
2. Most interpolation schemes do not preserve textures.
This motivates the creation of an interpolation scheme based on global image statistics.
Conveniently, a statistical filter based on global statistics has already been recently been
developed for image denoising. Appropriately, its creators refer to it as the Non Local (NL)
filter.
2.6.2 NL-Means Denoising
Buades, Coll, and Morel proposed a new statistical filter for denoising images that uses the
information present in the entire image [19, 20]. Suppose we are given a noisy grayscale
image u0 : Ω → <. For each pixel x ∈ Ω, we define a local neighborhood N(x) ⊆ Ω as a
subset satisfying two simple properties:
1. x ∈ N(x)
2. x ∈ N(y) ⇒ y ∈ N(x)
There are many possible choices of topology that will satisfy these two properties. Note
that a simple N × N window, with N > 1 odd, centered over pixel x will suffice. Each
neighborhood describes the local pattern or texture surrounding x. If x is a noise point, then
to determine the proper value of u0(x) we should consider the pixel values u0(y) surrounded
by neighborhoods N(y) similar to N(x). Not knowing the pattern of the image a priori, we
28
assume that the image neighborhoods are distributed according to a Gaussian distribution.
This gives rise to the NL-means filter:
u(x) =1
Z(x)
∫Ω
u0(y) exp(−|u0(N(x))− u0(N(y))|2
h2
)dy (2.13)
where Z(x) is the normalization factor
Z(x) =∫Ω
exp(−|u0(N(x))− u0(N(y))|2
h2
)dy.
The norm in (2.13) can be any matrix norm, such as the Frobenius norm or any Lp matrix
norm. Buades et. al. recommend the L2 matrix norm.
The filtering parameter h controls the weight pixel values receive and needs to be set
carefully. If h is too small, the image u will closely resemble the original image. If h is too
large, then pixel values with dissimilar neighborhoods will contribute to the value of u(x)
and the result will resemble Gaussian blurring. Intuitively, h acts like a standard deviation
of the neighborhood distribution. Given the Gaussian nature of equation (2.13), we found
experimentally that a good choice is h =√
2σ where σ is the standard deviation of the pixel
values in u0.
Buades et. al. demonstrated that the NL-means filter successfully denoises textured
images. The authors showed that on test cases NL-means outperformed classical denoising
methods, including Gaussian smoothing, the Wiener filter, the TV filter, wavelet thresh-
olding, and anisotropic heat diffusion. They showed that under certain assumptions on the
noise distribution, NL-means minimizes the additive noise present in the original image [19].
In a follow-up paper, the authors showed that NL-means smooths edges and reduces the
staircasing effect in aliased images [20].
29
Figure 2.11: NL-means denoising on part of the Lena image. Left: noisy image. Right:image after NL-means denoising. Taken from [19].
2.6.3 NL-Means Interpolation
Based on the simple, elegant denoising filter in (2.13), we formulate a statistical filter for
image interpolation. Suppose we have a low-resolution, possibly noisy image u0 : Ω → <
and a version of u0 upsampled by a factor M , v : ΩM → <. Here, v will act as our
reference image on the finer lattice ΩM . The interpolation method for obtaining v could be
any chosen method, although the nearest neighbor interpolation would be a suitable choice
since it does not introduce any image artifacts or additional noise.
Similar to the NL-means denoising case, we wish to compare neighborhoods in the image
u0 that will allow us to interpolate pixel values that reproduce local patterns and textures.
To interpolate to the finer lattice, we should compare neighborhoods in v to neighborhoods
in the original image u0. Locally, we may not be able to correctly interpolate the texture
of an image. However, the downsampling process that created the image u0 may have
sampled the texture in a non-uniform fashion so that texture information may be present
in one portion of the image that is not present in another.
To motivate this comparison, consider the following scenario. Suppose we are interpolat-
30
ing a low-resolution photograph of a brown-eyed woman. Suppose that the downsampling
procedure that transferred the real scene to a camera image did not sample the black pupil
in the left eye. When we attempt to zoom in on the left eye, we will have to decide what
pixel value to fill in-between the brown pixels of the iris. If we use any of the interpolation
schemes described previously, they will use local information to fill in the missing pixel
with brown. However, a more natural way to fill in the missing pixel value is to look at
the right eye which, if we’re lucky, may have sampled the black pupil in the low-resolution
photograph. NL-means would compare the neighborhoods throughout the image, decide
that the right eye’s neighborhood closely resembles the left eye’s, and give a large weight
to the black pupil contained in the right eye. Note that this example is different than the
denoising case described in the last section. The missing pupil in the left eye was not due
to noise, it was due to the coarse lattice of the original image.
There is a slight difficulty in comparing neighborhoods on the coarse lattice Ω to neigh-
borhoods on the finer lattice ΩM . Since the interpolation procedure essentially places empty
pixels between pixels, we should think of the neighborhoods as being spread out in a sim-
ilar manner when we move to a finer grid. Suppose we have a fixed zoom factor M and
a neighborhood topology on the original image lattice given by N(x). We define a M -
neighborhood NM (x) ⊆ ΩM as the set of pixels mapped from N(x) by the upsampling
Dirac comb in equation (1.2). Note that for M = 1, we have N1(x) = N(x) ⊆ Ω. Figure
2.12 illustrates M -neighborhoods where the original neighborhood topology is a 3x3 pixel
square. We can think of the M -neighborhood as placing M − 1 empty pixels between each
pixel in the original neighborhood.
Taking the image v as our initialization on the finer lattice ΩM and using our definition
of the M -neighborhood, we can adapt equation (2.13) to interpolate an image u : ΩM → <.
The NL-means interpolation filter becomes
u(x) =1
Z(x)
∫Ω
u0(y) exp(−|v (NM (x))− u0 (N1(y)) |2
h2
)dy, x ∈ ΩM (2.14)
31
Figure 2.12: Illustration of M -neighborhoods for a 3x3 pixel square topology.
where again Z(x) is the normalization factor
Z(x) =∫Ω
exp(−|v (NM (x))− u0 (N1(y)) |2
h2
)dy.
Again, the fitting parameter h needs to be set experimentally. We used h =√
2σ as a
starting guess, where σ is the standard deviation of the pixel values in u0. However, we
found that this value did not work for all images. If h was too small, the image u would show
little change from its initialization v. If h was too large, the resulting interpolated image
would be blurred. But for an appropriate h, NL-means would simultaneously upsample and
denoise the image.
We found that for some natural images with little patterned or textured data, NL-means
would perform very poorly regardless of the value of h and would fill in many pixels with
value zero. Quite simply, if there is no pattern to learn, then NL-means will return a value
0. So we adjusted the filter slightly by adding the initialization point v(x) to the calculation
of u(x). The pixel v(x) will necessarily have weight 1 in the filter calculation, so the filter
equation (2.14) becomes
u(x) =1
1 + Z(x)
1 +∫Ω
u0(y) exp(−|v (NM (x))− u0 (N1(y)) |2
h2
)dy
(2.15)
32
with the same normalization constant Z(x) as before. With this adjustment, even if the
image contains no discernible pattern, then NL-means should return the image v. Note
that we could use any interpolation scheme to determine v, so we may view NL-means
interpolation as a refinement step that can be added to another interpolation scheme.
2.6.4 Numerical Results
For our experiments we used a 5x5 pixel square topology or, when the original image was
large enough, a 7x7 pixel square. We used Neumann boundary conditions to determine
the neighborhoods of pixels at the border. The nearest neighbor interpolation scheme was
used to produce the initial image v. Because each pixel compares its neighborhood to the
neighborhood of every other pixel, NL-means interpolation is quadratic in the number of
image pixels. The computation time is high and may take several minutes to run, depending
on the image size.
For most images, we used the parameter value h =√
2σ, although we needed to adjust
this value for some images. Figure 2.13 shows the result of applying NL-means with h =√
2σ
to a Brodatz texture. The NL-means image appears less discretized than the bicubic image,
but is also more blurred.
Figure 2.13: Interpolation of Brodatz fabric texture with zoom M = 3. Left: Originalimage. Center: Bicubic interpolation. Right: NL-means interpolation.
Figure 2.14 shows the ability of NL-means to simultaneously remove noise and interpo-
33
late an image. The original image contained ringing artifacts from the image conversion
process. The edges are sharper than in nearest neighbor interpolation and the ringing
artifacts are removed.
Figure 2.14: NL-means interpolation of ringed image with zoom M = 3 compared to linearinterpolation filters.
Executing the interpolation and denoising processes simultaneously may have certain
advantages over performing them separately. If denoising is performed first, then denoising
may also remove fine structures which will be on the level of noise in a low-resolution
image. If interpolation is performed first, then noisy data will also be interpolated and the
larger noise points will be harder to remove. Figure 2.15 illustrates this concept. In the
image at bottom right, the salt and pepper noise points are made larger by the bicubic
interpolation and also blurred into the background. The NL-means denoising algorithm is
unable to remove the noise, creating a stippled black background. NL-means interpolation
is more successful in recovering the pure black background. However, it was harder to
34
remove the noise near the edges because fewer neighborhoods in the image matched these
neighborhoods of pixels near the edge.
Figure 2.15: Interpolation of a noisy image by factor M = 3. The image at bottom-rightunderwent bicubic interpolation followed by NL-means denoising.
If the parameter h is too large, the image will be blurred and fine structures may be
lost. In Figure 2.16, NL-means preserves the edges on the striped texture well and the
stripes are smoothed. However, the fine detail of the shirt collar is lost. The original image
resolution was a mere 60x60 pixels, which limited the number of neighborhoods that could
be compared to. This is the paradox of relying on global information for image zooming:
in order to correctly interpolate a high-resolution image, the low-resolution image must be
fairly large to begin with.
When a portion of an image is zoomed upon, NL-means could use the entire original
image for its comparison neighborhoods. Figure 2.17 shows the result on a MR brain image.
The entire MR image was used for the neighborhood calculation. The resulting image has
35
Figure 2.16: NL-means interpolation of textured image with zoom M = 4. Left: originalimage. Center: Bicubic interpolation. Right: NL-means interpolation.
smoothed homogeneous regions, while still giving some hint at texture. The edges, fine
structures, and contrast of the image are preserved.
2.6.5 Further Research on NL-Means Interpolation
Although the algorithm does is not appropriate for all images, NL-means interpolation is
promising and could yield a truly global approach to interpolation. The algorithm is very
sensitive to the value of the filter parameter h and this warrants more investigation. We will
also experiment with different interpolation schemes for producing the initialization image
v.
It may be possible to incorporate the downsampling or camera model into the algorithm.
For example, suppose we know the downsampling is preceded by convolution with a Gaus-
sian point spread function (PSF). When comparing neighborhoods, it may be worthwhile
to replicate the camera model by convolving v with the Gaussian PSF. Another adjustment
which may be promising is to use neighborhoods in our comparisons. In the case when x
is a noise point, it might prove more meaningful to compare the neighborhood surrounding
the point but not the point itself.
As in the Figure 2.17, NL-means can use image information that is not part of the
portion of the image to be zoomed. It might be feasible to extend this idea to consider
36
Figure 2.17: NL-means zooming of portion of MR brain image. Top: original MRI. Bottom-left: lower left corner of brain. Bottom-right: NL-means zoom with M = 3.
neighborhoods in other images, as LLE-based interpolation does. We might also use ro-
tated neighborhoods to better interpolate texture. NL-means might also prove useful for
super-resolution: producing a single high-resolution image from a sequence of low-resolution
images. Most super-resolution schemes require accurate registration of the image sequence,
which can be troublesome if the resolution is very low or the objects in the image undergo
more than translations and rotations. NL-means does not require image registration, only
a large set of neighborhoods to compare.
37
2.7 Summary and Motivation for the Variational Approach
In this chapter, we discussed 3 existing interpolation techniques and presented 1 new tech-
nique that are representative of the different approaches to image processing. Keeping in
mind the criteria we introduced in Chapter 1, we briefly summarize the advantages and
drawbacks of the methods as follows:
• Heat diffusion interpolation: Preserves and smooths edges well, but may over-smooth
textured regions and change contrast levels.
• Wavelet-based interpolation: Preserves and sharpens edges, but not textures. Reduces
to bicubic interpolation in textured or smooth regions. Can only double the resolution
of the image.
• LLE-based interpolation: The training set needs to be carefully selected to represent
the type of images, textures, and order of resolution that will be needed. Tends to
create small blocky regions. Unclear if it outperforms bilinear or bicubic interpolation.
• NL-means interpolation: Interpolates texture, but not specifically set up to sharpen
edges. Best suited for large, textured images. Can simultaneously interpolate and
remove noise, but may remove fine structures as well. May result in aliasing or
blurring. Sensitive to value of parameter h.
As mentioned earlier, most interpolation schemes act only on local information and fail
to interpolate texture well. This motivated the idea behind the NL-means interpolation.
However, the NL-means interpolation scheme does not always produce satisfactory results,
especially on small natural images.
For images or applications where texture is not important, we should employ the oppo-
site strategy of concentrating on local information. A good model should be self-contained,
not relying on detecting patterns within the image or on a database of images. Since the
best interpolation results focused on edges, our model should specifically account for dis-
continuities in the image. We saw that some zooming methods work well for certain types
38
of images, but not others. Therefore our model should be flexible, with parameters or com-
ponents that can be tuned to the image and task at hand. Finally, the model should be
robust to noise, ideally removing the noise during the zooming process. These properties of
a good model motivate the variational approach, which we will define in the next chapter.
39
Chapter 3
Variational Zooming
3.1 Introduction to the Variational Approach
The motivation for the variational approach is best understood in terms of Bayes’ Rule, the
starting point for all of computer vision [32, 74]. The goal is to recover an ideal, noise-free
image u : Ω → < from an observed, noisy image u0. Bayes’ Rule seeks the image u that
maximizes the probability
maxu
Pr (u|u0) =Pr (u) Pr (u0|u)
Pr (u0). (3.1)
Note that the denominator Pr (u0) is a constant and can be ingnored in the optimization. If
we assume that the image u0 is corrupted additive Gaussian white noise n with zero mean
and variance σ2, we can write u0 = u+ n and at a pixel x
Pr (u0(x)|u(x)) ∝ exp
(−(u(x)− u0(x))
2
σ2
).
Furthermore, we can express Pr (u) as a Gibbs energy in terms of a functional R(u)
Pr (u) ∝ exp (−βR(u))
40
for constant β. In statistical mechanics, β is related to the temperature of the system and
Boltzmann’s constant. The maximization in (3.1) is then equivalent to
maxu
Pr (u|u0) = exp
−∫Ω
(u− u0)2
σ2dx− βR(u)
.
Taking the negative log likelihood of both sides and dropping constants, maximizing the
Bayesian probability becoms a minimization of an image energy
minuE [u|u0] = R(u) +
λ
2
∫Ω
(u− u0)2 dx. (3.2)
The constant λ is proportional to 1σ2 . The first term on the right-hand side is called the
regularization term or image prior and generally describes the smoothness of the image u.
The second term is called the fidelity or matching term and forces the computed image to
remain close to the original image in the least squares sense.
The variational approach can be seen as a form of Tikhonov regularization used in the
context of ill-posed problems. Tikhonov and Arsenin proposed the regularization R(u) =∫Ω
|∇u|2dx, assuming smoothness as an image prior [93]. Many image priors have been
developed since, often developed specifically for the image processing task or application.
The regularization need not be explicit and can be learned from data, as done in [8] for
human faces. In this thesis we will focus on the two most popular regularization strategies:
the Total Variation (TV) norm [83] and the Mumford-Shah energy [75].
The minimization in (3.2) has proven effective for image smoothing, denoising, deblur-
ring, and segmentation [32]. The image inpainting problem, as first described by Bertalmio
et. al., seeks to fill in missing or corrupted information in a damaged image while also
possibly denoising the image as a whole [12]. Let D ⊆ Ω denote the damaged region of the
image. Variational inpainting minimizes the energy
minuE [u|u0] = R(u) +
λ
2
∫Ω\D
(u− u0)2 dx. (3.3)
41
The idea is that no information is available within D so the fidelity term is set to zero in
this region, while the regularization term smooths the image as a whole.
If we view image zooming as “filling in pixels in between pixels,” image inpainting
extends naturally to image zooming. For a magnification factor M ≥ 1, let Ω be the
domain of the original image u0 : Ω → < and ΩM denote the high-resolution domain of the
zoomed image u : ΩM → <. Assume for notational convenience that Ω ⊆ ΩM . For digital
images on integer lattices, this can be accomplished by inserting M − 1 pixels between
the pixels of the low-resolution lattice (see Figure 3.1). The inpainting domain becomes
D = ΩM \ Ω and the inpainting model (3.3) becomes the zooming model
minuE [u|u0] = R(u) +
λ
2
∫Ω
(u− u0)2 dx. (3.4)
Figure 3.1: Illustration of zooming by variational inpainting for magnification M = 3.
In this chapter, we will give a brief of survey of the existing mathematical theory behind
the TV and Mumford-Shan energies, with specific attention to results on the inpainting and
zooming problems. We will then discuss numerical computation of the minimimum energy
zooming using the digital TV filter and the Mumford-Shah Γ-convergence approximation.
42
Numerical results will be presented and compared to the zooming processes discussed in
Chapter 2. Finally, we suggest modifications to the basic inpainting/zooming model that
can improve the quality of the image interpolant.
3.2 The Total Variation (TV) Energy
The TV regularization was first proposed for image processing in the seminal paper by
Rudin, Osher, and Fatermi [83]:
RTV (u) =∫Ω
|∇u|dx.
TV regularization encourages image smoothness while allowing for the presence of jumps
and discontinuities, a key feature for image processing because of the importance of edges
in human vision. The norm | · | is generally assumed to be the L2-norm
|∇u| =√u2
x + u2y.
In the literature, this is often referred to as the isotropic norm, as it is rotationally variant.
In Chapter 5, we will discuss quantized TV minimization under the anisotropic L1-norm.
3.2.1 Theory and Theorems
As discussed by Chan and Shen [32], the TV norm can be derived from a level set viewpoint
by builiding it from statistics on level curves
Γα = x ∈ Ω : u(x) = α.
43
Note that if u is smooth, then each Γα will be a smooth curve. Suppose we take the length
L of the curve as a measure of smoothness, then the regularization should be:
R(u) =
∞∫−∞
L(Γα)dα.
The curve length is a natural choice for measuring smoothness and is exploited specifically
by the Mumford-Shah energy. Chan and Shen proved that the length is the only Euclidean
invariant, linear additive curve energy that can be expressed as a two-point accumulation:
e(x1≤i≤n) = cn−1∑i=0
|xi+1 − xi| = cL(Γα), x1≤i≤n ∈ Γα.
Parametrize the level set Γα by orthogonal flows s and t that are tangent and normal to
the curve. Then we have
dα = |∇u|dt, L(Γα) =∫Γα
ds, dsdt = dx1dx2 = dx.
So the regularization becomes
R(u) =
∞∫−∞
∫Γα
|∇u|dtds =∫Ω
|∇u|dx.
A more formal derivation leads to the famous co-area formula expressing the TV norm as
the sum of level set perimeters.
The gradient in the TV norm implicitly assumes that the image u ∈ C1(Ω), although a
general image will contain corners and discontinuities at the edges. Computationally, this
does not pose a problem because the image is digital and the gradient is discretized by finite
differences. But theoretically, we should discuss the gradient in the distributional sense Du
and functions in BV space.
44
Definition 3.1 (BV (Ω)) For a bounded open set Ω ⊂ <2 and a function u ∈ L1(Ω), set
∫Ω
|Du| = sup
∫Ω
u
(∂φ1
∂x1+∂φ2
∂x2
)dx : φ = (φ1, φ2) ∈ C1
0 (Ω)2, |φ|L∞(Ω) ≤ 1
.
under the Lesbesgue measure dx. Define BV (Ω), the space of functions of bounded variation,
to be
BV (Ω) =
u ∈ L1(Ω) :∫Ω
|Du| <∞
.
Note that for u ∈ C1(Ω), we have∫Ω
|Du| =∫Ω
|∇u|dx. Most theoretical results concern-
ing the TV norm are for functions in BV , which possesses several desirable space properties
such as lower semicontinuity and compactness. For example, the TV energy (3.2) does not
generally attain a minimum in Sobolev space W 1,1(Ω), but does in BV space [83].
Theorem 3.1 (Rudin-Osher-Fatemi, 1992) For an observed image u0 ∈ L2(Ω), then
the minimizer of the TV energy
ETV [u|u0] =∫Ω
|Du|+ λ
2
∫Ω
(u− u0)2 dx
exists and is unique in BV (Ω).
However, TV minimization for the inpainting problem is in general unstable, as the
example illustrated in Figure 3.2 shows. Suppose a binary 0-1 image consists of a black
rectangle with height h and width 3h. Suppose the inpainting domainD is a long rectangular
strip of width d centered over the rectangle. Note that for a 0-1 image, the TV norm
corresponds to the total perimeter of the geometric shapes, up to the choice of norm for the
image corners. For d < h, simple geometry shows that the minimum TV energy is attained
by a solid h × 3h black rectangle. For d > h, the minimizer will consist of two separated
black d× d squares. For the special case d = h, the minimum energy is attained by both of
the images just described. While it is somewhat upsetting that TV inpainting is unstable
45
for such a trivial example, this is actually consistent with Gestalt principles of human vision
psychology.
Figure 3.2: Inpainting a simple image.
Instability becomes even more troublesome for the zooming problem, where the known
domain consists of isolated pixels. In a continuous TV zooming model, the existence of a
minimum is not guaranteed and the interpolant depends on the chosen numerical scheme
[24]. One solution is to minimize the TV discrete energy, as presented in the next section.
3.2.2 Numerical Computation: The Digital TV Filter
The Euler-Lagrange equation associated with the TV energy is
−∇ ·(∇u|∇u|
)+ λ(u− u0) = 0 (3.5)
with Neumann boundary conditions
∂u
∂~n= 0 on ∂Ω.
This equation can be solved by numerical methods such as gradient descent
∂u
∂t= −∇ ·
(∇u|∇u|
)+ λ(u− u0).
46
PDE-based methods have to control the size of the time step to make sure the computation
is stable while still converging in an efficient manner. Malgoyres and Guichard outlined
a stable gradient-based implementation of TV zooming in Fourier space which enhances
edges, but produces slight ringing artifacts [71, 72]. These methods assume a continuous
model and may not be appropriate for inpainting / zooming problems.
Since the inputs and ouptputs to the TV energy are digital images defined on discrete
square lattices, Chan, Osher, and Shen proposed calculating the digital TV norm [33]. Let
N(x) ⊆ Ω denote a neighborhood of the pixel x, consisting of pixels near x but not not
including x itself. The standard topology is the 4-connected neighborhood: the neighbors
of pixel (i, j) are the pixels (i± 1, j), (i, j ± 1). The digital TV norm is defined as
RTV (u) =∫Ω
|∇u|dx =∑x∈Ω
√ ∑y∈N(x)
(u(x)− u(y))2.
The Euler-Lagrange equation associated with the digital TV energy is
∑y∈N(x)
1|∇u|
(u(x)− u(y)) + λ(u− u0) = 0.
To solve this Euler-Lagrange equation, the authors suggest a lagged diffusivity fixed-point
iterative scheme. The term 1|∇u| is frozen for one iteration and treated as a constant in the
update u(n) → u(n+1):
∑y∈N(x)
1|∇u(n)|
(u(n+1)(x)− u(n+1)(y)
)+ λ(u(n+1) − u0) = 0.
Solving for u(n+1) yields the digital TV filter
u(n+1)(x) =
∑y∈N(x) h
(n)(y)u(n)(y) + λu0(x)∑y∈N(x) h
(n)(y) + λ
h(n)(y) =1
|∇u(n)(y)|.
47
There are several methods for discretizing the gradient |∇u(x)| in h(n). Chan, Osher, and
Shen suggest a central difference scheme centered around the midpoint between pixel x and
its neighbor. For example, the discretization of |∇u| around pixel x = (i, j) for the neighbor
to the right(i+ 1
2 , j)
is
√(u(i+ 1, j)− u(i, j))2 +
(u(i+ 1, j + 1) + u(i+ 1, j)− u(i, j − 1)− u(i+ 1, j − 1)
4
)2
.
The discretization for the other three directions is very similar. To avoid division by zero
in smooth regions, a lifting parameter a is introduced
|∇u|a =√|∇u|2 + a2.
The authors claim the algorithm is stable for a = 10−4.
The algorithm is known to be stable for all input u0. An interesting feature is that the
filter satisfies a maximum principle, in the sense that the values of u will not exceed the
maximum of u0. Other interpolation filters, such as bicubic, can overshoot the maximum
when attempting to fit the given data to a smooth function.
To adapt the model to inpainting, an indicator function 1(x) is added to the fidelity
term to enforce matching only on the undamaged pixels. For an inpainting domain D, the
Euler-Lagrange equation is
−∇ ·(∇u|∇u|
)+ 1Ω\D(x)λ(u− u0) = 0.
The digital TV filter is
u(n+1)(x) =
∑y∈N(x) h
(n)(y)u(n)(y) + 1Ω\D(x)λu0(x)∑y∈N(x) h
(n)(y) + λ
h(n)(y) =1
|∇u(n)(y)|.
For the zooming case, the domain Ω\D is replaced with the low-resolution lattice Ω ⊆ ΩM .
48
Since the original data is finite-dimensional, the digital TV energy always permits a solution.
The value of the parameter λ balances the smoothness and fidelity terms and has a large
effect on the resulting image. As λ→ 0, the image becomes a constant image corresponding
to the mean of u0. As λ → ∞, the minimizer u → u0. There are several computational
methods for setting the parameter, including generalized cross-validation and the L-curve
method [97]. However, there is no “optimal” parameter value since the desired result de-
pends on the noise level, the image, the application, and the user’s subjective expectations.
We set the parameter experimentally by inspection.
3.3 The Mumford-Shah Energy
The model introduced by Mumford and Shah in 1989 simultaneously tracks the minimum
image u and the edge set Γ of the image [75]. The regularization term is
RMS(u,Γ) =∫
Ω\Γ
|∇u|2dx+ γH1(Γ)
where H1(Γ) denotes the one-dimensional Hausdorff measure and quantifies the total length
of the edges. The regularization smooths the image away from edges while controlling the
size of the edge set. Compared to the TV norm, the exponent 2 on the gradient in the
Mumford-Shah regularization enforces greater smoothing. The exponent 1 in the TV norm
gives equal preference to sharp edges and smooth gradients. Because it also tracks the
edges, the Mumford-Shah functional prefers smoother gradients away from the edges.
3.3.1 Theory and Theorems
Because the minimization is a free boundary problem, less is known theoretically about
the minimizers of the Mumford-Shah energy than the TV energy. In the orginal paper,
Mumford and Shah proved an interesting result about the geometry of the minimizer.
49
Theorem 3.2 (Mumford-Shah, 1989) Let(u ∈W 1,2(Ω),Γ ⊂ Ω
)be a minimizer of the
Mumford-Shah energy
EMS [u,Γ|u0] =∫
Ω\Γ
|∇u|2dx+ γH1(Γ) +λ
2
∫Ω
(u− u0)2 dx. (3.6)
Suppose Γ = ∪Γi where Γi is a simple C1,1-curve and each curve meets another curve or
the boundary ∂Ω only at its endpoints. Then any vertex of Γ must be one of the following:
1. a point on ∂Ω and Γi meets ∂Ω perpendicularly.
2. a point where three Γi’s meet with angle 2π3 (a triple junction).
3. a point where Γi ends and meets nothing (a crack-tip).
This theorem, however, does not establish the existence of such a minimizer where Γ
consists of C1,1-curves. The existence is called the Mumford-Shah Conjecture and has been
studied by several researchers, notably Braides and Bonnet, but remains an open problem
[18]. Ambrosio established the existence of a minimizing image in a special subset of BV
[3]. This minimizer was later shown by other researchers to be a minimizer of (3.6), but
uniqueness of the image and the precise nature of Γ have not been established.
For inpainting, the minimum image is in general non-unique. If we assume that the
edges of a binary 0-1 image occur exactly at the discontinuities, then the Mumford-Shah
and TV energies are equivalent up to the choice of parameters. The binary image will be
perfectly smooth away from the edges and the length of the edges equals the magnitude
of the TV norm, except perhaps at corners. So the trivial example presented in the last
section also shows Mumford-Shah inpainting is non-unique.
Asymptotically, the Mumford-Shah model also has uniqueness issues for the zooming
problem. If we let γ →∞, it can be shown that the minimizing edge set Γ vanishes to ∅ to
50
compensate [32, 75]. The energy (3.6) becomes the Tikhonov or Sobolev smoothing
E [u|u0] =∫Ω
|∇u|2dx+λ
2
∫Ω
(u− u0)2 dx.
Note that inpainting a domain D changes the limit on the second integral to Ω \D. If we
further let λ→∞, then we obtain harmonic inpainting:
∆u = 0 in Ω, u(x) = u0(x) for x ∈ D, ∂u
∂~n= 0 on ∂Ω. (3.7)
For the zooming problem, D is the low-resolution lattice. Finding a harmonic function with
boundary conditions and zero-dimensional data is an ill-posed problem [47]. This suggests
that Mumford-Shah zooming may produce undesirable results, at least in the continuous
model.
An error estimate for harmonic inpainting was developed by Chan and Shen [31] and
later by Chan and Kang [30]. A Green’s function G for a given domain D solves the
harmonic inpainting problem (3.7), if such a G exists. As described in [47], the Green’s
function satisfies
−∆G = δ(y − x) for x ∈ D, G = 0 on ∂D.
Then the harmonic function uh satisfying (3.7) is given by
uh(x) = −∫
∂D
u0(y(s))∂G(x, y)∂~n
ds (3.8)
where ~n is the outward normal along ∂D and s is the arclength parameter of ∂D. Suppose
the ideal image is utrue, with the image matching the given data outside D: utrue = u0 on
Ω \ D. The true image can be expressed in terms of the Green’s function by the double
layer potential
utrue(x) = −∫
∂D
u0(y(s))∂G(x, y)∂~n
ds−∫D
∆utrueG(x, y)dy. (3.9)
51
Subtracting equation (3.8) from (3.9) cancels the first term, yielding
utrue(x)− uh(x) = −∫D
∆utrueG(x, y)dy.
This establishes the following bound on the inpainting error.
Theorem 3.3 (Chan-Shen, 2002) Suppose uh, u0, utrue ∈ C2(Ω) and the inpainting do-
main D has a smooth continuous boundary. Then for any point x ∈ D
|utrue(x)− uh(x)| ≤ L
∫D
G(x, y)dy
where L is a constant satisfying |∆utrue| ≤ L in D.
This theorem is quite elegant because it shows that the error arises from natural and in-
tuitive sources. Inpainting error depends on three factors: the smoothness of the underlying
image (∆utrue), the size of the inpainting domain (integration over D), and the geometry
of the domain (G). By applying the Green’s function for an ellipse and the comparison
principle for Green’s functions, Chan and Kang obtained the following error bound.
Corollary 3.4 (Chan-Kang, 2005) If the inpainting domain D can be covered by an
ellipse with minor diameter d, then
|utrue(x)− uh(x)| ≤ 2Ld2.
This corollary proves that harmonic inpainting is best for long and narrow domains, such
as scratches on a photograph. This is a well-known phenomenon that was observed in the
first research article on image inpainting [12]. Since the Mumford-Shah energy asymptoti-
cally approaches the harmonic inpainting model (3.7) as λ, γ →∞, we can intuitively expect
the Mumford-Shah inpainting error to be bounded by the error for harmonic inpainting, at
least under some choice of parameters.
52
If we view the zooming process for magnification factorM locally as inpainting a distance
M between pixels, then we can conjecture based on Corollary 3.4 that the Mumford-Shah
zooming error is O(LM2). But unfortunately, the error analysis does not extend to the
zooming problem. To cover the inpainting domain in Figure 3.1 with an ellipse, we would
need to span the entire image. Also the given data consists of isolated points, so the Green’s
function could not exist in the classical sense for the simple reason that ∂D = ∅. However,
the underlying data is digital and the pixels are only zero-dimensional in the continuous
sense, indicating that perhaps a modified error error estimate is applicable to the discrete
computation.
3.3.2 Numerical Computation: The Γ-Convergence Approximation
The standard minimization approach in the calculus of variations is to solve the Euler-
Lagrange equation, but this is difficult for the Mumford-Shah model because the energy
(3.6) is not differentiable. There are two main approaches to minimizing the Mumford-Shah
energy: level set methods [79] and approximating the energy by a suitable functional [7].
We will discuss the latter technique, specifically the approximation developed by Ambrosio
and Tortorelli [4]. This Ambrosio-Tortorelli (AT) approximation has been shown to be
equivalent to the Mumford-Shah energy in the Γ-convergence sense, defined below.
Definition 3.2 (Γ-convergence) A sequence fj : X → < ∪ ∞ Γ-converges in X to
f : X → <∪ ∞ if for all x ∈ X the following two properties hold:
• For every sequence xj converging to x, we have f ≤ lim inf fj(xj) .
• There exists a sequence xj converging to x such that f ≥ lim sup fj(xj).
Under reasonable assumptions on the set X, the minimizer of the functional fj coincides
with the minimizer of its Γ-limit f [18]. The idea behind the AT approximation is to
replace the edge set Γ, which is difficult to track numerically, with an edge canyon function
53
z : Ω → [0, 1]. For a fixed parameter ε > 0, the function z ∈ L1(Ω) is designed to be
z(x) =
0 if x ∈ Γ
1 if d(x,Γ) > ε
with all remaining values in Ω defined by L1-extension. The AT approximation is then
given by
EAT [u, z|u0] =∫Ω
z2|∇u|2dx+ γ
∫Ω
(ε|∇z|2 +
(1− z)2
4ε
)dx+
λ
2
∫Ω
(u− u0)2 dx. (3.10)
Comparing this functional term by term to EMS in (3.6), the first term coincides with∫Ω\Γ
|∇u|2dx because z = 0 on the edge set. This term also has the effect of forcing z to zero
in regions with large variation where |∇u| is large. The second and third terms correspond
to the length H1(Γ), with the second term smoothing z and the third forcing z = 1 almost
everywhere. As ε → 0, the functional EAT Γ-converges to EMS in L1(Ω). Furthermore,
EAT admits a minimizer uε that converges in L1(Ω) to a minimizer u of EMS in a special
subset of BV (Ω). For an overview of Γ-convergence and the theory surrounding the AT
approximation, we refer the reader to Braides’ monograph [18] and Chapter 4 of the book
by Aubert and Kornprobst [7].
The AT approximation is differentiable and makes standard variational approaches pos-
sible. The Euler-Lagrange equations are
−∇ ·(z2|∇u|
)+ λ(u− u0) = 0
|∇u|2z + γ
(−2ε∆z +
z − 12ε
)= 0.
(3.11)
We impose boundary Neumann boundary conditions
∂u
∂~n=∂z
∂~n= 0 on ∂Ω.
54
To phrase this as an elliptic system, Esedoglu and Shen [46] introduced the differential
operators
Lz = −∇ · z2∇+ λ
Mu =(
1 +2εγ
)− 4ε∆.
Then the Euler-Lagrange equations in (3.11) can be written
Lzu = λu0, Muz = 1. (3.12)
This system can be solved with an iterative solver such as Gauss-Jacobi, alternating the
minimization of u and z.
To adapt this problem to inpainting, we simply restrict the fidelity parameter λ to be
zero on the damaged region D. The operators in (3.12) are
Lzu(x) = 1Ω\D(x)λu0(x), Muz(x) = 1.
Zooming replaces Ω\D above with the low-resolution domain Ω ⊆ ΩM . Esedoglu and Shen
note that for the inpainting problem, ε = 1 will suffice [46].
The other parameters γ and λ need to be set carefully, balancing the edge length, fidelity,
and implicit parameter 1 on the smoothness term. As before, the parameter λ should be
inversely proportional to the amount of noise in the image: λ = O(
1σ2
). The paramter γ
essentially determines how much of the image can be designated as an edge. As γ → ∞,
the edge canyon function z → 1 a.e. and the edge set vanishes. As γ → 0, z → 0a.e. to
make the smoothness term z2|∇u|2 smaller, effectively designating the entire image as an
edge.
3.4 Numerical Results and Discussion
Variational zooming was implemented using the digital TV filter and the Mumford-Shah
Γ-convergence model described in the previous sections. For TV zooming, we set the lifting
55
parameter a = 10−4. For Mumford-Shah zooming, we used the value ε = 1. The variational
models are sensitive to the other parameters, as shown in Figure 3.3. A simple checkerboard
image is zoomed by a factor M = 3. The interpolant should visually match the original
image, but the results vary widely depending on the choice of parameters.
Figure 3.3: TV and Mumford-Shah zoom of checkerboard image for magnification M = 3.The fourth column is a detail view of the image in the third column.
For the TV model, the interpolant becomes blurred if λ is set too small, with the
blurring most noticeable at corners. In a noise-free image, the value of λ can be set very
large. An artifact known as “scalloping” or the “zipper” artifact is shown in the image
at the top right. Under the TV L2 norm, the inpainted regions between known pixels are
not necessarily piecewise constant and an oscillating gray values appear along the edges.
The zipper artifacts become more prominent for large values of λ, because the known pixel
values cannot be smoothed.
The Mumford-Shah interpolant depends on the balance between the fidelity weight λ
and the edge length parameter γ. As with the TV model, the image becomes more blurred
56
as λ→ 0. If γ is too large relative to λ, the checkerboard images have wavy edges because
the model is minimizing the total edge length. Also, for large γ the image becomes more
blurred because the edge set is small and the smoothing term can blur a larger portion of
the image. Note that the first two Mumford-Shah images show the effect of Theorem 3.2.
Rather than the edges meeting perpendicularly, the corners are rounded off to form several
triple junctions with angle 2π3 between the edges. The third image shows the best balance
of the parameters, but the detail reveals the presence of zipper artifacts.
In general, the value of λ should be set as large as possible while still removing any
noise or unwanted features. For natural images with little or no noise, a value λ = 100
usually works well for the digital TV filter. We found that the Mumford-Shah parameters
λ = 20 and γ = 2000 work well for natural images. Throughout this thesis, the variational
parameters are adapted for the situation, as the parameters depend on both the image and
the application.
To zoom a color image, each of the RGB color channels is enhanced separately. This
assumes the color channels are uncorrelated, which of course they are not. There has been
some research on adapting variational methods for color spaces, notably the work by Sapiro
and Ringach for the vector-valued TV norm [85]. Figure 3.4 shows the result of 4x image
zoom on a color image. Note that the variational methods will smooth out fine structures
such as the glasses on the face and the text on the board. Isolated pixels can be seen in
the text, which may actually correspond to a local minimum of the energy. We expect
more isolated pixels to appear in the interpolant as the magnifcation M gets larger and the
corresponding domain D to inpaint also grows. As can be seen on the face, textures are
over-smoothed and the resulting image may appear “plastic.”
This suggests variational zooming may not be appropriate for producing photo-realistic
images. Instead, the methods are best suited for applications where image smoothing is a
desired result, such as in medical image enhancement or preparing images for automatic
recognition routines [6]. Figure 3.5 shows the results of 3x image zoom of a noisy MRI
brain image. The bicubic interpolation actually enhances the noise, since each pixel in the
57
original image is given equal weight. The TV and Mumford-Shah methods help smooth
out the noise, while also smoothing out the texture to make the anatomical features more
distinct. Note that simultaneously removing noise and enhancing the resolution can be
difficult, because the zooming procedure isolates the noise points and uses them as guides
for inpainting. Whenever possible, the noise should be removed from the low-resolution
image first before zooming.
Figure 3.4: Zoom of color image with M = 4.
Images produced by variational zooming may contain artifacts, including:
1. Over-smoothing of textured regions and fine structures.
58
Figure 3.5: Zoom of MRI brain image with M = 3.
2. Zipper artifacts along edges.
3. Isolated pixels in the final image.
In the next section, we will suggest some modifications to the inpainting model to help
correct for these artifacts. Compared to the zooming methods discussed in Chapter 2, the
varitational approach offers the following advantages:
1. Genericity: The variational method can be described completely in one energy equa-
tion and does not suppose prior knowledge of the image or image class. The LLE-
based zoom in Section 2.5 required a database of image textures, while the NL-means
zooming in Section 2.6 assumed the image contained detectable patterns. As noted
on page 258 of [32], variational inpainting is both local and functional. That is, the
inpainting is based only on information in the vicinity of the missing data and the al-
gorithm treats the images only as functions, not data that requires high-level pattern
recognition.
2. Flexibility: The parameters of variational zooming can be fine-tuned in an intuitive
59
manner to best suit the image and application, e.g. increasing the smoothing for
medical images. This tuning is harder to accomplish with the wavelet-based method
in Section 2.4 or the PDE-based method in 2.3. However, this property may make it
more difficult to select the appropriate parameters for a given image. We found that
values of parameters worked well across classes of images, e.g. medical images or text
images, so that the parameters did not need to be tuned for every new image.
3. Edge preservation: The TV norm is designed to allow discontinuities in the image.
The Mumford-Shah energy can enhance edges by smoothing the regions in the vicinity
of the edge. The Mumford-Shah edge length term will also smooth the edges, so
aliasing or staircasing effects should be less prominent than with linear filters. One-
dimensional edges, such as the glasses in Figure 3.4, may be smoothed out.
4. Stability: Variational zooming is robust to image noise and can even help remove
noise points. The variational approach is also somewhat robust to blur, although it
is difficult to remove without accurate knowledge of the blurring process. Algorithms
based on local pattern detection, such as the LLE and NL-means zoom, will be very
sensitive to noise and blur.
3.5 Modifications to the Inpainting Model
3.5.1 Incorporating a Blur Kernel
In the previous discussion, we assumed the image was corrupted by noise but not blur. A
more accurate image degradation model would be
u0 = K[u] + n
for some blur operator K. The blur operator could involve image corruption from many
physical sources: camera blur, optical blur, motion, atmospheric effects, etc. [32]. Generally,
60
K is assumed to be a convolution with a shift-invariant kernel k(x)
K[u] = k ∗ u(x) =∫Ω
k(x− y)u(y)dy.
The TV inpainting model incorporating K becomes
minuE [u|u0,K] = R(u) +
λ
2
∫Ω\D
(K[u]− u0)2dx.
The associate Euler-Lagrange equation is
−∇ ·(∇u|∇u|
)+ 1Ω\D(x)λK∗(K[u]− u0) = 0.
where K∗ denotes the adjoint of K. For a convolution operator, K∗ is the convolution
with the kernel k reflected about the origin. Incorporating a blur operator into the TV
zooming model has been shown to be effective in reducing blur in the interpolant [1]. The
fidelity term and Euler-Lagrange equations for the Mumford-Shah model are very similar.
In practice, the blur operator K is not known and needs to be estimated from the data
or the camera model. Chan and Wong developed a blind TV deblurring algorithm that
imposes a TV smoothness constraint on the blur operator
minu,K
E [u|u0] = R(u) +λ
2
∫Ω\D
(K[u]− u0)2dx+ β
∫Ω
|∇K|dx.
The algorithm alternately deblurs the image and smooths the blur operator. The algorithm
is known to converge for suitable pre-conditioners [34].
3.5.2 Locally Adaptive Fidelity Weights
While variational methods remove image noise by smoothing, they may also smooth out
textured regions in an image. The resulting images are often said to appear “plastic.” In
the TV energy the amount of smoothing is controlled by the parameter λ, so one solution
61
might be to relax the constant λ to a spatially varying function λ(x)
minuETV [u|u0] =
∫Ω
|∇u|dx+λ(x)
2
∫Ω\D
(u− u0)2dx. (3.13)
The value of λ(x) should adapt to the neighborhood of x, large in noisy or smooth regions
and small in textured regions with large variation. A simple first attempt is to set
λ(x) ∝σ2
loc(x)σ2
where σ2 is the variance of the noise in u0 and σ2loc(x) is the local variance of the noise,
calculated over a fixed neighborhood size. The constant of proportionality needs to be
determined, so there is still a parameter to fine-tune for the image and application. If the
variance in a neighborhood of pixel x is large, then the value of λ(x) should be large to
smooth the image more in the noisy region. The problem with this approach is that it
requires estimates of the noise variance, as opposed to the variance of the image gray values
which would not distinguish noise from texture.
Gilboa, Zeevi, and Sochen proposed a solution that iteratively updates the fidelity
weights of the TV norm [52]. An initial image u(0) is calculated from the minimum TV
energy (3.13) using a constant λ(0) ∝ 1σ2 . Then the fidelity weights for the next iteration
are calculated by
λ(n+1)(x) =σ2
loc(x)σ2
Q(n)(x), Q(n)(x) = (u(n) − u0)∇ ·(∇u|∇u|
). (3.14)
The local variance is estimated by the variance over an N ×N window of u0 convolved with
a Gaussian. The iteration stops when the image update is below some threshold. Note the
formula for Q appears in the TV Euler-Lagrange equation (3.5). The idea is to take the
variance of the gray values as an estimate of the noise variance, then rework this estimate
into the TV minimization. If the value of Q(x) is large, then that pixel’s region had a large
update in the last iteration’s TV minimization. The value of λ(x) is then set larger in the
62
next iteration, so that the smoothing will not be as great in the next iteration. In this
manner, the weights are adapted to even out the smoothing.
Figure 3.6: 2x TV zooming of noisy image with locally adaptive fidelity weights.
For experimentation, the noisy image in Figure 3.6 was synthetically generated with
additive Gaussian noise with mean zero and known variance. The TV zooming result for
constant λ and M = 2 is shown in the third image. The fourth image shows the result
of TV zooming using the fidelity weights λ(x) calculated by (3.14) on the original low-
resolution domain. The window used for calculation of the local variance was 11x11. Note
that the noise is removed in both TV zooms, but the model with adaptive weights shows
more texture in the fur and shirt. The SNR of the noisy image is 10.01, the TV zoom with
constant λ has higher SNR 15.02, and the adaptive TV zoom improves the SNR further to
15.48.
Because of the difficulty in estimating the noise variance in practice, the locally adaptive
63
TV method should only be used for images corrupted with a large amount of noise. We
should note that the results are not as sharp for zooming as they are for the simple denoising
problem (M = 1). It is difficult to track textures across scales, as discussed in Section 2.5.
One possible correction would be to incorporate the change in resolution into (3.14) to
reflect the change in scale. Almansa et. al. suggested a locally adaptive TV zooming based
on Chambolle’s TV denoising algorithm [1].
3.5.3 Soft Inpainting with Nearest Neighbor Information
The zipper artifacts and isolated pixels that arise in variational zooming are due to the
absence of a fidelity weight on the unknown pixels. In the inpainting region D, the reg-
ularization term encourages smoothness and minimum edge length, but the artifacts may
actually correspond to local minima under these priors. For the fidelity term to have an
effect in the domain D, a natural choice is to have an unknown pixel weakly correlated with
its nearest neighbor in the known region Ω \D. Let x denote the nearest neighbor in the
known region of a pixel x ∈ Ω:
x = argminx∈Ω\D
d (x, x) .
Trivially, for a known pixel x ∈ Ω \D we have x = x. Inspired by [91], we propose a “soft”
inpainting model of the form
minuE[u|u0] = R(u) +
λ
2
∫Ω
P (x) (u(x)− u0(x))2 dx.
Here P (x) is a weight function that determines how strongly a pixel correlates with its
nearest neighbor. We would generally expect 0 ≤ P (x) ≤ 1 with P (x) = 1 for known
pixels x ∈ Ω \D and the value of P decaying as the distance d(x, x) grows. Note that the
standard inpainting model is a subset of the soft model with P (x) = 1Ω\D(x), which can be
thought of as “hard” inpainting. One possible choice for a soft weight function is a negative
64
exponential
P (x) = exp(−d
2(x, x)σ2
)where σ is a sensitivity parameter. As σ → 0, the model becomes the traditional hard
inpainting. As σ →∞, P → 1 identically and the fidelity term assigns equal weight to the
pixels in the known and unknown regions. As with the other model parameters, the value
of σ needs to be set carefully to balance the two extremes.
The soft inpainting model generalizes to the K nearest neighbors of a pixel. Let
x1≤i≤K ⊆ Ω \ D denote the K nearest neigbhors of a pixel x ∈ Ω. Averaging over
the K nearest neighbors, the soft inpainting model becomes
minuE[u|u0] = R(u) +
λ
2
∫Ω
1K(x)
K(x)∑i=1
Pi(x) (u(x)− u0(xi))2 dx.
where Pi(x) defines the correlation between pixel x and its ith nearest neighbor xi. A
natural choice is again the exponential function
Pi(x) = exp(−d
2(x, xi)σ2
).
Note that the number of nearest neighbors K(x) could be spatially varying. In particular,
we expect K(x) = 1 for known pixels x ∈ Ω \D.
While the soft inpainting model is not necessarily appropriate for general domains, it
appears well-suited for the zooming problem. In particular, it seems reasonable to use
K = 4 nearest neighbors for pixels in the interior of the inpainting domain and K = 2
neighbors for pixels in a row or column of a known pixel. With respect to a low-resolution
lattice Ω contained in the high-resolution lattice ΩM , we define K(x) for x = (x1, x2) ∈ ΩM
to be
K(x) =
1 if x ∈ Ω
2 if x /∈ Ω, (x1, y) ∈ Ω or (y, x2) ∈ Ω for some y
4 otherwise
. (3.15)
65
Figure 3.7 shows the result of zooming by soft inpainting on a trivial binary image.
Bicubic interpolation completely blurs the edges and staircasing is visible along the diagonal
edge. The magnification factor M is large enough so that the standard Mumford-Shah
interpolant is a set of isolated white pixels. As expected, when σ is too small the result
coincides with standard hard inpainting. If σ is too large, the result is essentially an average
of the neighbors. For an appropriate choice of σ, the soft inpainting model produces a binary
image with well-defined edges. Note that for σ = 1 the diagonal edge is smooth and there
are no staircasing artifacts, although the corners are rounded off.
Figure 3.7: Effect of σ on Mumford-Shah soft inpainting with λ = 20, γ = 2000, M = 5.
Using the value of σ suggested by the last example, Figure 3.8 shows the zooming result
on a natural image. Hard Mumford-Shah inpainting produces isolated pixels and the legs
of the camera tripod almost completely disappear. The soft inpainting model using the
same variational parameters removes these image artifacts. The edges are smoother and
the image regions are more distinct than in the bicubic zoom.
66
Figure 3.8: Comparison of zooming using standard and soft Mumford-Shah inpainting withλ = 20, γ = 2000, σ = 1, M = 5.
3.5.4 Variational Zooming as Post-Processing
In the soft inpainting model, if we let K = 1 and σ →∞ then the fidelity term is equivalent
to matching the image u to the interpolant under the nearest neighbor or duplication zoom.
Similarly if we define K as in (3.15) and define the weight function to be polynomial in
the distance, the fidelity term could match u to a bilinar or bicubic zoom of the image.
This suggests that a special case of the soft inpainting model is equivalent to matching the
image u to some zoomed image v, rather than matching u to the original image u0 on the
low-resolution lattice. Suppose the image v : ΩM → < is a zoomed version of u0 under
some standard interpolation filter, such as bicubic zooming. Then variational zooming can
be seen as a post-processing step on the zoomed image v
minuE[u|v] = R(u) +
λ
2
∫ΩM
P (x) (u− v)2 dx.
The weight function P (x) essentially quantifies the confidence that the pixel x was zoomed
correctly by the process that created v. For post-processing the bicubic zoom, Cha and
67
Kim developed a fourth-order PDE method using a weight function that is polynomial in
the Laplacian of the zoomed image [26]. Adapting this weight function for the variational
approach and normalizing 0 ≤ P (x) ≤ 1, we set
P (x) =Q(x)
maxx∈ΩM
Q(x), Q(x) = (∆v)4 . (3.16)
The inpainting masks for the standard, soft, and post-processing models are illustrated
in Figure 3.9. Note that the first two masks are independent of the image data. The
post-processing weight function is largest at the corners of the zoomed image and small in
homogeneous regions and along smooth edges.
Figure 3.9: Different possible inpainting masks for a single image with magnification M = 5.Left to right: original image, standard inpainting mask, average of soft inpainting mask,Laplacian post-processing mask.
Figure 3.10 compares zooming under the standard inpainting and post-processing mod-
els. The bicubic zoom blurs the edges and staircasing artifacts are clearly visible. TV and
Mumford-Shah inpainting maintain clear smoothed edges, but zipper artifacts are visible in
the TV zoom and Mumford-Shah rounds the corners. Post-processing the bicubic zoom and
using the weight function (3.16), both the TV and Mumford-Shah models produce sharper
68
edges with less artifacts. In particular, the Mumford-Shah model seems well-suited to post-
processing. The zipper artifacts are no longer present, the edges are more distinct, and the
staircasing artifacts are removed. Because the Mumford-Shah model minimizes edge length,
the boundary of the circle appears piecewise linear rather than a smooth curve. To correct
this result, Esedoglu and Shen suggested adding a curvature term to the Mumford-Shah
energy [46].
Figure 3.10: Comparison of standard variational zooming and post-processing methods withmagnification M = 5.
Another key advantage of the post-processing model is that the magnification factor is
no longer restricted to integers M ≥ 1. Figure 3.11 shows the result of zooming a natural
image by a factor M = 2π. Compared to the bicubic zoom, Mumford-Shah post-processing
produces flatter regions bounded by sharper, smoother edges.
69
Figure 3.11: Zooming by magnification factor M = 2π using Mumford-Shah post-processing.
70
Chapter 4
Variational Super-resolution
4.1 Super-resolution of an Image Sequence
The goal of super-resolution (SR) is to produce a high-resolution image u : ΩM → < from
a sequence of N low-resolution images ui : Ωi → <1≤i≤N . We call the array of points
(grid of pixels) from which the image is formed the lattice. Here Ωi denotes the lattice
of the ith low-resolution image and ΩM is the high-resolution lattice that is a factor M
times larger than the original lattice. The input images u1≤i≤N are generally images of the
same visual scene from slightly different perspectives, such as a panning camera filming a
stationary object. Huang and Tsai were the first to notice that sub-pixel motion in the
sequence and image aliasing gave the potential for the construction of higher resolution
images. The authors described two basics steps in the super-resolution process: image
registration and data fusion [60]. These processes are sometimes treated separately in the
literature, although recent papers have addressed the steps jointly [58].
The first and probably most difficult step of super-resolution is to properly align the
images to the same grid ΩM . Let ϕi : Ωi → ΩM denote the coordinate transformation
mapping each image ui to the high-resolution grid. Then for a pixel x ∈ Ωi
ui(x) = Ki (u ϕi) (x) + ni(x)
71
where Ki is the linear blur operator and ni is additive noise for the ith image. If the
magnification M = 1, the transformation ϕi describes the registration between the images.
For M > 1, ϕi describes both the motion and downsampling processes for the ith image.
These transformations are generally restricted to the class of planar homographies. If the
two-dimensional (x, y) is represented in homogeneous coordinates as x = (x, y, 1), a planar
homography H can be expressed as a 3x3 matrix:
x’ = αHx, H ∈M33, α 6= 0
where α is an arbirtrary scaling factor. Because of the scaling, a planar homography has 8
degrees of freedom. Capel and Zisserman outlined three real-world situations in which the
planar homography assumption is appropriate [22].
1. The visual scene or object being viewed is planar and the camera motion is arbitrary.
2. The visual scene is three-dimensional but the camera motion is restricted to rotation
about the optic center and zooming.
3. The camera is sufficient distance from the visual scene that the parallax effects caused
by the three-dimensional nature of the scene are negligible.
In this chapter, we will assume that the input image sequence satisfies one of the above
assumptions reasonably well. Computationally we will restrict the camera/scene motion
to translations in Section 4.2.2, although the method could be extended to general planar
homographies. There exist several methods for image registration under a translational
model, notably the method by Irani and Peleg [61]. However, for a magnification factor
M > 1 the registration needs to be precise to the sub-pixel level, often a very difficult if
not insurmountable task. In general, to increase the resolution by factor M the registration
needs to be accurate to 1/M pixels on the high-resolution grid ΩM . It is assumed that
the transformation ϕi maps to the discrete gridpoints of ΩM , so for a continuous warping
it may be necessary to round the position of pixel ϕi(x) to its nearest gridpoint on ΩM .
72
Alternatively, the gray value at the point x ∈ ΩM could be interpolated from the pixel
neighborhood in ui surrounding ϕ−1i (x). Capel and Zisserman note that in addition to the
geometric registration, it may be necessary to perform a photometric registration between
the images to correct for changes in illumination and camera parameters. We will assume
the photometric differences between the images are neglible. Once the images are aligned
to a common high-resolution lattice ΩM , we obtain an image-like data set on ΩM with
some pixels having known value, some unknown, and some pixels having multiple values
addressed to them (see Figure 4.1). If the desired image grid ΩM is not large enough to
contain all mapped pixels of image ui, we will restrict attention to pixels in ΩM ∩ ϕi (Ωi).
Figure 4.1: Illustration of image registration for super-resolution. The three imagesu1, u2, u3 are aligned to a common high-resolution lattice ΩM by the respective geomet-ric transformations ϕ1, ϕ2, ϕ3.
Next, the registered images are fused into a single high-resolution image u. Note that
even if the transformations ϕi and blur operators Ki are known, the fusion problem is ill-
posed due to noise. The simplest image fusion approach is take the median through all
pixel values
u(x) = medianui ϕ−1i (x)|ϕ−1
i (x) ∈ Ωi, x ∈ ΩM .
73
The median image is commonly used as the benchmark for super-resolution algorithms. A
better approach is to use the Maximum Likelihood Estimation (MLE), as used by Irani and
Peleg [61]. Other researchers have developed Maximum A Posteriori (MAP) models that
incorporate image priors with desirable properties. For example, Schultz and Stevenson
proposed an image prior measuring image smoothness as a function of local second deriva-
tives [89]. The image prior need not be explicit and could be learned from the data, as in
Baker and Kanade’s method designed specifically for super-resolution of human faces [8].
The variational approach proposed in the next section can be viewed as a type of MAP
estimation.
4.2 Super-resolution by Variational Inpainting
4.2.1 Data Fusion with Known Registration
The variational inpainting model for a single image u0 extends naturally to multiple images
u1≤i≤N . Instead of the fidelity term matching to one image, the final image should match
on average all images in the sequence in the least squares sense. For a magnification factor
M , known registration functions ϕ1≤i≤N , and regularization term R(u), the variational
super-resolution model is
minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u) +
λ
2N
N∑i=1
∫ΩM∩ϕi(Ωi)
(u− ui ϕ−1
i
)2dx. (4.1)
For convenience, denote the registered image domain for the ith image in the limit of the
last integral by
Di := ΩM ∩ ϕi(Ωi).
Referring to Figure 4.1, the model will perform variational smoothing on known pixels and
inpainting in unknown regions. For pixels with multiple values, the fidelity term will drive
the image toward the mean value. However the model is not equivalent to matching to
74
the mean image, as pixels with multiple consistent values will receive more weight in the
minimization. That is, a pixel with multiple assignments of the color black will be more
likely to be black in the final image u, compared to a pixel with only one assignment.
The computation for both TV [83] and Mumford-Shah [75] regularization is a simple
modification of the single-image minimization. For the TV regularization
R(u) =∫
ΩM
|∇u|dx
the corresponding Euler-Lagrange equation is
−∇ ·(∇u|∇u|
)+λ
N
N∑i=1
1Di(x)(u− ui ϕ−1
i
)= 0
with Neumann boundary conditions. This equation can be solved by standard gradient
descent or level set methods. Another approach is to modify the digital TV filter of Chan,
Osher, and Shen described in Section 3.2.2 [33]. The digital TV energy is minimized by
iterating for n ≥ 1 the formulas
u(n+1)(x) =
∑y∈N(x) h
(n)(y)u(n)(y) + λN
∑Ni=1 1Di(x)ui ϕ−1
i (x)∑y∈N(x) h
(n)(y) + λN
∑Ni=1 1Di(x)
h(n)(y) =1
|∇u(n)(y)|
where N(x) is the 4-connected neighborhood of pixel x. The finite differences used for
discretizing h(n) are the same as in Section 3.2.2. To avoid division by zero, a lifting
parameter a > 0 is introduced into the norm
1|∇u(y)|a
=1√
a2 + |∇u(y)|2.
The digital TV filter computation is generally stable for a = O(10−4
)[33].
75
The Mumford-Shah regularization for an image u : ΩM → < and edge set Γ is
RMS(u,Γ) =∫
ΩM\Γ
|∇u|2dx+ γH1(Γ)
where H1 denotes the one-dimensional Hausdorff measure. The first term smooths the
image away from the edges and the second term minimizes the total edge length. Using the
Ambrosio-Tortorelli Γ-convergence approximation, let z : ΩM → [0, 1] denote the continuous
edge canyon function with z = 0 on the edge set Γ and z = 1 otherwise [4]. For a parameter
ε > 0, the Γ-convergence approximation to the Mumford-Shah regularization is
RMS [u,Γ|u1≤i≤N ] =∫
ΩM
z2|∇u|2dx+ γ
∫ΩM
(ε|∇z|2 +
(1− z)2
4ε
)dx.
The associated Euler-Lagrange equations and boundary conditions are
−∇ ·(z2|∇u|
)+λ
N
N∑i=1
1Di(x)(u− ui ϕ−1i ) = 0
|∇u|2z + γ
(−2ε∆z +
z − 12ε
)= 0
∂u
∂~n=∂z
∂~n= 0.
These equations can be solved by an elliptic solver such as Gauss-Jacobi, alternating the
minimization of u and z. For inpainting problems, setting the parameter ε = 1 will generally
suffice [46].
This model assumes knowledge of each registration function ϕi : Ωi → ΩM aligning
image ui to the high-resolution lattice. Because precise alignment is generally difficult to
obtain, much of the super-resolution literature obtains the registration from synthetic data
and focuses on the data fusion step. One strategy is to sample the sequence from a single
high-resolution image. Some researchers generate video of stationary objects with a precisely
calibrated slow-moving camera, obtaining the registration from the known camera motion.
A third data generation technique is to align high-resolution images, downsample the images
76
maintaining the registration parameters, and then work with the low-resolution images. For
a first numerical experiment, we employ this last strategy. For the two image sequences
shown in Figures 4.2 - 4.5, in each high-resolution frame in the sequence we manually selected
several control points, generally corners and other distinct image features. The images were
downsampled, tracking the downsampled control points as well. The registration functions
ϕ1≤i≤N are the affine transformations that best the match corresponding control points
between images in the least squares sense. This procedure is completely synthetic and, of
course, not reproducible in practice. The goal of this experiment is simply to establish that
the variational model is effective in fusing data when accurate image registration is known.
Figure 4.2: Super-resolution of 5-image sequence. Top left: original third image in sequence.Top right: 4x TV SR with λ = 20. Bottom left: 4x MS SR with λ = 20, γ = 2000. Bottomright: 4x MS SR with registration incorrect by 1/2 pixel on low-resolution lattice.
Figure 4.2 shows the result of super-resolution of a 5-image sequence of a sign with
77
Chinese characters. The images were aligned to the upsampled lattice of the third image,
which we will call the base frame. Both the TV and Mumford-Shah models produce an
image that is clearly higher resolution than could be obtained from a single image. The
TV image is slightly more blurred; the Mumford-Shah model generally produces sharper
edges with smoother regions away from the edges. This suggests the Mumford-Shah model
is more appropriate for images with sharp edges such as text. However, the TV model
may better for preserving texture in natural images. Because it generally provides sharper
images, most of the results in this chapter are obtained from the Mumford-Shah model.
Figure 4.2 also highlights the importance of precise image registration. To produce the
fourth image, the control points used for alignment were shifted in a random direction by 12
pixel on the low-resolution lattice. While a sub-pixel shift would not affect the result if the
resolution stayed the same, after magnification by a factor M = 4 the control points are off
by 2 pixels in the common high-resolution lattice ΩM . The resulting image is blurred and
the artifacts of the mis-alignment are clearly visible. In general, a magnification by a factor
M requires the images are aligned to 1/M pixel accuracy on their original lattices.
For color RGB images, the simplest solution is to find the minimum energy image over
each color channel separately. Figure 4.3 shows the result of color super-resolution on a
5-frame color video sequence, using the third frame as the base. Note the super-resolution
result recovers features that are not present in the original image, such as the small gaps in
the last character. Processing the channels separately is somewhat naive because it assumes
the image information is uncorrelated across color channels. There has been some research
on redefining the TV energy for color images, notably the work by Sapiro and Ringach
[85]. Some researchers indicate that for super-resolution it is sufficient to enhance only the
luminance channel of the YIQ color space and use a simple interpolation filter for the two
chrominance channels [35]. Farisu, Elad, and Milanfar developed an image prior specifically
for color super-resolution designed to force correlated edge location and orientation between
the color channels [49].
The super-resolution procedure extends naturally to video. Each frame of the video is
78
Figure 4.3: 4x color image zoom of 5-image sequence with known registration. Top row:nearest neighbor, bilinear, bicubic. Bottom row: staircased bicubic, median image, MS SRwith λ = 20, γ = 2000.
79
repeatedly selected as the base frame, aligning all other frames to the upsampled lattice
of the base. Figure 4.4 shows video super-resolution of an 11-frame video sequence with
known registration. The text is not legible in any of the 11 frames, but becomes much
clearer after the super-resolution. The features of the woman’s face are also improved,
but the face appears somewhat unrealistic. Because it minimizes the edge length, the
Mumford-Shah model is well-suited for lines and text, but tends to oversmooth textured
regions. This suggests variational super-resolution is best suited for applications that do
not require photo-realistic images.
Figure 4.4: Super-resolution of 11-frame video sequence with known registration. Top row:4 frames from original sequence. Bottom row: corresponding 4 frames from 4x MS SR withλ = 20, γ = 2000.
4.2.2 Simultaneous Registration and Fusion
Note that for a fixed image u, minimizing the general energy in (4.1) with respect to ϕi
requires just the unweighted fidelity term
minϕ1≤i≤N
E [ϕ1≤i≤N |u, u1≤i≤N ] =∫Di
(u− ui ϕ−1
i
)2dx = |u− ui ϕ−1
i |2. (4.2)
80
The registration functions ϕi should be restricted to a suitable class of spatial transfor-
mations for which registration methods exist. For example, Irani and Peleg outline an
iterative refinement based on a truncated Taylor series for affine transformations consisting
of rotations, translations, and scalings [61]. We found that the iterative refinement method
minimizing the L2 norm (4.2) worked well on the low-resolution lattices, but the result was
not accurate enough on the high-resolution lattice ΩM to produce acceptable SR results.
That is, the registration was accurate at the pixel level but not at the sub-pixel level.
To refine the registration, we propose an alternating minimization model. Suppose
one of the images uB : ΩB → < in the sequence is identified as the base frame and the
high-resolution lattice ΩM is generated by upsampling the lattice ΩB. Each low-resolution
image ui is aligned to the low-resolution image uB by a function τi : Ωi → ΩB. The aligned
images are then upsampled to the lattice ΩM . The minimum energy u is computed from
this registration, followed by minimizing over the registration functions for this image. The
process continues, alternately freezing and minimizing the image and registration functions,
until the registration functions are no longer updated.
Super-resolution by Alternating Minimization
Input: Original image sequence u1≤i≤N , base frame uB, update threshold δ > 0.
Output: Super-resolved image u.
Compute intial registration τ1≤i≤N aligining images to base image uB.
Upsample τ1≤i≤N to create ϕ(0)1≤i≤N .
Repeat
Fix ϕ(n)1≤i≤N and compute image u by minimizing energy (4.1).
Fix u and compute functions ϕ(n+1)1≤i≤N that minimize (4.2).
until max1≤i≤N
|ϕ(n+1)
i − ϕ(n)i |< δ .
Note that if the initial registration is accurate to the pixel level on the low-resolution
lattice, then this registration will be accurate within bM2 c on the high-resolution lattice.
For rigid transformations, the update to the registration functions can be computed by
a local search of pixel mappings on ΩM . We implemented the method above using the
81
Figure 4.5: Super-resolution video sequence with known and unknown registration. Top:one frame from original 11-frame sequence. Center: 4x MS SR using ground-truth registra-tion. Bottom: 4x MS SR with simultaneous translational registration.
82
Mumford-Shah model and restricting the transformations to simple translations
ϕi(x, y) = (x+ a, y + b) ↑M.
where ↑M denotes upsampling by a factorM . We assume the upsampling includes rounding
to the closest lattice point of ΩM , unless a more accurate gray value is interpolated from
ui. The initial registration was computed by the Irani-Peleg method and the updates were
computed by a local enumerative search over [a − bM2 c, a + bM
2 c] × [b − bM2 c, b + bM
2 c].
For most sequences, the process converged within two or three iterations and resulted in a
better image than using the initial registration. However, if the initial registration was not
accurate enough, the resulting image u was poor and the iterates became increasingly more
blurred. This is because the alternating minimization is driven toward a local minimum
close to the initialization which may not correspond to the global minimum over u and ϕi
jointly. The alternating minimization helps refine the registration and corresponding image,
but the initial registration still needs to be precise.
Figure 4.5 compares the alternating minimization SR method to SR using known reg-
istration. Both SR images are clearly an improvement over the original image, but the
second image is less blurred than the third. However, the second image was produced syn-
thetically, using known registration parameters. The third image is based on only the input
video sequence and is reproducible in practice. The blur derives from convergence to a local
minimum as well as the possibility that translations are not sufficient for describing the
motion between frames.
4.3 Artifact Reduction by Soft Inpainting
Variational super-resolution offers several computational and practical advantages, includ-
ing:
• Reconstruction from limited data: The convential wisdom for super-resolution is to
use O(M2) images for a magnification factor M , the idea being that this is the num-
83
ber of images required to fill in all pixels in ΩM . Browsing the literature shows that
using roughly 2M2 frames is the most common practice [8, 49]. Experimentally the
performance appears to level off at this limit [82]. The Mumford-Shah and TV reg-
ularization terms smooth the image, inpainting the unknown regions. The number
of images required to be produced an adequate result appears to be much less than
2M2. The results presented in this chapter use 5-11 images for a magnification factor
M = 4.
• Flexibility: Depending on the application and the input data, the parameters in the
variational model can be tuned to give desired output. For example, for text im-
ages the parameter γ can be increased to give straight lines with sharp edges. For
images with low SNR, the parameter λ can be decreased to increase the smooth-
ing, although this will result in more image blur. Additional image priors are easily
added to the energy, for example Shen and Esedoglu suggest adding a curvature term
to the Mumford-Shah inpainting model to encourage curved edges [46]. Of course,
the sensitivity to the parameters also means that fine-tuning the parameters can be
troublesome for input images where the noise, blur, and image type are unknown.
• Edge enhancement: The edge length term in the Mumford-Shah functional encourages
smooth well-defined edges, while the smoothing term enhances the edges by decreasing
local variation near the edges. To a lesser extent, the TV norm also enhances edges
because the minimization tends to yield piecewise constant regions, sometimes called
“blocky” images [42].
• Registration refinement: The alternating minimization method can iteratively refine
an imprecise initial registration. However, the minimization may converge to a local
minimum which produces an unacceptable blurred image. To avoid this, the initial
registration should be precise as possible. The alternating minimization can correct
the registration with sub-pixel shifts, but it cannot correct an initial registration incor-
rect at the pixel level or a geometric transformation that is inadequate for describing
84
the motion in the given sequence.
Variational SR can make image features clearer, but the resulting images tend not to be
photo-realistic and contain image artifacts. These artifacts derive from the variational SR
process, the underlying data, our assumptions on the data, and the inherent computational
limits of SR. Part of the problem derives from the binary decision that a pixel x ∈ Ωi
either counts in the final SR image or not, with no room for adjusting for local properties
or differences in the images. Inspired by [91], we refer to our data fusion formula (4.1) as
the “hard” inpainting model:
minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u) +
λ
2N
N∑i=1
∫ΩM
1Di(x)(u− ui ϕ−1
i
)2dx.
We can relax the characteristic function 1Di(x) to a “soft” inpainting model:
minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u) +
λ
2N
N∑i=1
∫ΩM
Pi(x)(u− ui ϕ−1
i
)2dx
where Pi : ΩM → < is a weight function, or sensitivity profile, that determines how much
weight the gray value ui ϕ−1i (x) exerts in the final image. The function Pi can be viewed
as a probability function and we generally assume 0 ≤ Pi(x) ≤ 1. Note the hard inpainting
model is a subset of the soft model. Below we discuss different image artifacts that arise in
the SR process and briefly suggest how the soft inpainting model can help correct for these
errors.
• Texture oversmoothing: Images produced by Mumford-Shah SR tend to consist of
smooth regions with sharp boundaries. Although the TV norm performs less smooth-
ing because the exponent on the gradient is smaller, textured regions will also be
smoothed in TV SR. The texture can be preserved by increasing the value of λ, but
this also emphasizes noise and misaligned pixels. One solution is to increase the weight
of the fidelity term in textured regions and decrease the weight in noisy regions. The
85
difficulty is that locally texture resembles noise. Gilboa et. al. suggested a locally
adaptive fidelity term for the TV energy that reduces noise while preserving texture
[52]. Along similar lines, He and Kondi recently proposed a SR scheme with the
fidelity weight varying across the image frames proportional to the amount of noise
in the frame [59]. Combining these ideas, the fidelity weight can be locally adap-
tive within each low-resolution image frame. For example, similar to [52] the weight
function could be
Pi(x) ∝σ2
i
σ2loc(x)
, x ∈ Di(x) (4.3)
where σ2i is the variance of the noise in image ui and σ2
loc(x) is the local variance of the
noise in a neighborhood around pixel x. The constant of proportionality needs to be
determined, although this constant could be absorbed into the parameter λ. Besides
the local variance, other local statistics such as entropy and geometric moments could
be used to differentiate texture from noise [55].
• Camera and motion blur: Suppose Ki is a blur operator that describes the camera
and motion blur in image ui. The blur can be incorporated into the variational model
as:
minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u) +
λ
2N
N∑i=1
∫ΩM
1Di(x)(Kiu− ui ϕ−1
i
)2dx.
Estimating the blur operator, the so-called blind deconvolution problem, is an open
research problem. There have been some results in the variational framework, notably
the work by Chan and Wong for the TV energy that minimizes the TV of the blur
kernel. However, this method requires accurate pre-conditioners or else the algorithm
converges to blurred local minima [34].
• Isolated pixels: Each upsampled image ui on the high-resolution lattice consists of
isolated pixels and the inpainting model does not always connect these single pixels
to other pixels. These isolated pixels are visible in the shadows of the SR images in
86
Figure 4.5 and are very prominent in the inaccurate SR in Figure 4.6. Decreasing the
value of λ will increase image smoothness, while also increasing blurring. The locally
adaptive model (4.3) should treat such pixels similar to noise points and should help
remove these pixels, although this may not be a desirable result in some images.
Another possibility is to use interpolated images ui that completely fill the high-
resolution lattice ΩM . A simple interpolation filter such as bilinear zoom or single-
image variational zooming could be used. The weights Pi(x) could reflect the distance
from a known pixel:
Pi(x) = exp(−d
2(x, x)2
σ2
)where x denotes the nearest known pixel in the original data.
• Inaccurate registration: Imprecise image registration results in blur and misaligned
pixels resembling noise points. Also, the chosen class of geometric transformations
may not be adequate for describing the motion between frames. The iterative refine-
ment method proposed in the last section helps correct for registration errors. The
soft inpainting model can help reduce these errors by making it proportional to the
registration energy (4.2) with respect to the base frame uB:
Pi(x) ∝∫Di
(uB − ui ϕ−1
i (x))2dx.
This function makes the weight functional proportional to the average alignment mis-
match with the base image. Alternatively, we could replace uB with the last image
iterate u(n−1) computed in the alternating minimization.
• Dynamic visual scenes: Super-resolution is effective for a very limited number of nat-
ural image sequences, for the simple reason that the real world is not static. Certain
types of motion can be accounted for by the registration functions, such as an object
moving in a plane parallel to the plane of the camera motion. However, planar pro-
jective transformation cannot account for non-rigid motion, such as moving limbs and
87
changing facial expressions. One approach is to incorporate temporal information and
assume that video frames will more closely resemble the base frame when they are
closer in time. Assuming the images u1≤i≤N are given in temporal order, a natural
weight function is a Gaussian centered over a base frame uB in the sequence:
Pi(x) =1
σ√
2πexp
(−(i−B)2
σ2
).
The value of σ should be inversely proportional to the rate of scene change. That is,
a video of a fast-changing scene should have a very low σ, indicating that only the
single frame uB gives an accurate depiction of the current scene. The weight function
Pi can also be modified to address dramatic changes, such as a movie cut. Another
approach for addressing scene change is to detect local deviations from the base image
uB:
Pi(x) = exp
(−∑
y∈Ni(x)(Ni(y)−NB(y))2
σ2
)(4.4)
where Ni(x) is the pixel neighborhood in image uiϕ−1i around x. Here we assume the
image uB has been upsampled to fill the lattice ΩM using some interpolation method,
such as bilinear interpolation or variational zooming. This function will detect regions
in which ui does not match the base image uB and decreases the weight of the fidelity
term in this region. A region featuring a large amount of variation, e.g. the path
of a fast-moving object, will cause the SR model to default to single-image zooming.
To some extent, this weighting can also correct for registration errors and noise. The
disadvantage of this model is that it also limits the amount of new information that
can be introduced to the base frame uB.
• Parallax effects: The grand challenge of SR research is to account for the three-
dimensional nature of the real world [48]. If the camera is distant from the visual
scene, these effects will be small. The soft inpainting model (4.4) can minimize the
distortion caused by parallax effects, but an ideal model would make use of all the
88
information provided. Such a model would require 3D scene reconstruction from 2D
images coupled with a 3D inpainting model.
Figure 4.6: Artifact reduction on three frames of 7-frame video sequence. Top row: originalvideo frames. Center row: 2x MS SR with λ = 5, γ = 2000. Bottom row: 2x MS SR withsoft inpainting σ = 10.
The video sequence in Figure 4.6 features a moving person tracked by a moving camera
and contains MPEG compression artifacts, motion blur, and aliasing. Super-resolution of
this sequence will result in many of the artifacts mentioned above: oversmoothed texture
in the face, isolated pixels surrounding the head, parallax effects from the head turning,
independent motion from the moving hand, and registration errors from the translational
89
model being insufficient. The soft inpainting function 4.4 was introduced using bilinear
zoom of the base video frame in the comparison with NB(x). The value of σ should be
chosen to balance the introduction of information from other frames with the removal of SR
artifacts. We found that the value σ = 10 removed the super-resolution artifacts, however
the SR images are over-smoothed and appear “plastic.”
4.4 Applicatons
4.4.1 Video Enhancement
Super-resolution has numerous applications to enhancing video: surveillance, tracking and
recognition, converting between movie formats such as DVD and HDTV, etc. As noted
earlier, the variational SR method can produce an entire video sequence by repeatedly
registering the images to a base frame. Note that the registration only needs to be performed
once for all frames, which is important because the registration step generally inolves more
computational effort than the fusion step. For the alternating minimization method, the
iterative registration refinements could also be calculated once for all images although it is
probably best not to do so to account for discrepancies between base frames.
One interesting application is enhancing traffic video for vehicle tracking and recognition.
Figure 4.7 shows one frame of a video sequence taken from a high stationary camera over an
intersection in Karlsruhe, Germany. Performing SR on the original video would accomplish
little, as the streets would be blurred by moving vehicles and the stationary buildings do
not exhibit sub-pixel shifts to permit enhanced resolution. On the other hand, tracking a
moving vehicle would give a good candidate for SR. The camera is far enough from the
scene that parallax effects are negligible as long as the vehicle does not change direction.
To test the parallax effects, one of the four vehicles selected was the white van turning the
corner. The four vehicles identified in Figure 4.7 were tracked manually for 11 consecutive
frames. The tracking was not very accurate, which should not affect the result as long as
each frame is large enough to contain the vehicle but small enough so that registration will
90
Figure 4.7: Frame from traffic video of intersection in Karlsruhe. The four highlighted carswere tracked for super-resolution enhancement.
align to the vehicle rather than other features such as the white lines in the street.
Each of the four vehicle sequences was enhanced by a factor M = 4 with the Mumford-
Shah alternating minimization method. The registration assumes a translational model,
which may not be entirely appropriate for vehicles moving towards or away from the camera.
The vehicle scale should be fairly consistent since the camera is very distant from the street
and the video sequences are very short. As Figure 4.8 shows, the SR enhances the vehicle
shape as well as features such as the windows and tires. However, the images appear
blurred with a horizontal jitter effect. Without knowing the technical specifications of
the video camera, we conjecture that the video was interlaced: the odd and even lines
were acquired separately and the vehicle changed position slightly during the acquisition
phase. To extend SR to de-interlacing, each frame is separated into two images consisting
91
Figure 4.8: Super-resolution of four 11-frame sections of video in Figure 4.7. Left to right:original base frame, 4x bicubic zoom, 4x MS SR with λ = 5 and γ = 2000, 4x MS SR withde-interlacing.
of alternating horizontal lines and this new set consisting of twice the number of frames is
super-resolved. To maintain the aspect ratio of the original frame, a blank row is inserted
on alternating lines for the inpainting mask. This has the same effect as increasing the
dimension vertically by a factor 2M . The de-interlaced SR images are much crisper and
appear more realistic than the original frames. In the second row of Figure 4.8, the white
van is partially occluded by a road sign in the base frame. The van’s rear tire is correctly
recovered by the SR images, a result that would be impossible with single image inpainting.
The other vehicles were also partially occluded in some frames by pedestrians, street lights,
and trees. This shows that SR is effective for disocclusion of objects, assuming the entire
92
object becomes visible over the course of the sequence.
4.4.2 Barcode Image Processing
A linear barcode is a series of alternating black and white stripes encoding information
in the relative widths of the bars. The most common barcode scanners are laser scanners
that read a 1D signal from the barcode. Imaging scanners that obtain a full 2D image of
the barcode are also used to accomodate nonlinear barcodes that encode information in
both the horizontal and vertical directions. However, this accomodation results in lower
decoding performance on linear barcodes for the imaging scanners as compared to the laser
scanners. One method for decoding linear barcode images is to repeatedly acquire signals
called scanlines which are perpendicular to the bar orientation until a scanline is decoded.
This method was a natural choice for industry because this allows the software to use the
existing decoding routines used by the laser scanners. Unfortunately, the imaging scanners
cannot obtain signal resolution that is as high as the laser scanners; the current imaging
scanners have not yet reached the mega-pixel level. Figure 4.9 shows a barcode image that
is not decoded by the current state-of-the-art software and all the many scanlines that were
tested. This method is somewhat wasteful, since an entire 2D image is acquired and stored
in memory but only small 1D portions of the image are used for decoding. Our goal is to
outline a computationally efficient method that uses the entire image to prepare a 1D signal
that is sent to the decoding software.
If we think of each of the scanlines as a one pixel high image, SR can be used to create
a single high-resolution signal from the acquired scanlines. Suppose the original image
u0(x, y) is cropped to contain just the barcode region. This cropping can be accomplished
by automatic barcode detection methods based on local line statistics [5]. Exploiting the
unique geometry of barcodes, it seems natural to register the scanlines by tracing each pixel
along a scanline down the bars in u0 to the base of the image, which we will refer to as the
t-axis (see Figure 4.11). The resulting 1D signal u(t) will be called the projected signal,
as it consists of the projection of all scanlines onto a common axis. To understand how to
93
Figure 4.9: Tested scanlines on a barcode image.
project the scanlines, first note that the orientation of the bars depends on the position of
the barcode image in three-dimensional space.
Although the barcode itself is assumed to planar, the image surface can exhibit three
types of rotations: roll, pitch, and yaw (see Figure 4.10). Image roll occurs within a plane
parallel to the imaging plane of the camera. The bars will remain parallel under image
roll. Image pitch occurs when the top or bottom of the barcode is moved towards or away
from the camera. Under image pitch, the bars are no longer parallel and instead should
converge to some focal or vanishing point. Note that in practice an acquired image will
almost surely exhibit roll and/or pitch. The presence of these rotations will affect how the
projection is done and we will show their presence is actually necessary for super-resolution.
The third image rotation, yaw, occurs when the left or right side of the barcode is pulled
from the camera. Yaw affects only the relative widths of the bars and hence should not
affect our projection. Decoding software corrects for yaw distortion using the “self-clocking”
feature built into linear barcodes – the information is encoded so that an edge occurs at set
intervals.
94
Figure 4.10: Three degrees of freedom in barcode rotation.
For the case of image roll, the bars will be parallel and the projection can proceed by
tracing each pixel in u0 along a vector parallel to the bar orientation. Suppose the roll angle
θ with respect to the y-axis is known. In practice, this angle is the first thing computed by
the imaging scanner software because the scanlines are oriented at this angle θ. The t-axis
will be perpendicular to the bar orientation and is given by y = x tan θ, where the origin is
given at the lower left corner of the image. Some trigonometry shows that the point u(t)
on the projected signal is obtained from the image pixel u0(x, y) by
t = x sec θ + (y − x tan θ) sin θ. (4.5)
An example of the projected signal u(t) for the roll case is shown in the third signal in
Figure 4.12. Note that if the roll angle θ = 0 or π2 , the projection will trace pixels to the
same position along the t-axis. This is called degenerate sampling and the resolution of the
95
Figure 4.11: Creating a projected signal u(t) from a barcode image u0(x, y). Left: projectionwith parallel bars (roll). Right: projection from focal point F for non-parallel bars (pitch).
signal will not be improved. Degenerate sampling will only occur when tan θ is a rational
number. In practice, the chances of obtaining such a roll angle is very small. Thus, image
distortion is actually essential for super-resolution.
For image pitch, the bars will not be parallel and instead should converge to some focal
point F = (xF , yF ). Each image pixel u0(x, y) should be traced along a vector connecting
F and (x, y) to the t-axis at the barcode base. Suppose the base points indicating the left
and right lower corners of the image are P1 = (x1, y1) and P2 = (x2, y2), respectively (see
Figure 4.11). By similar triangles, the pixel u0(x, y) is projected to u(t) by
t = x1 + d (x2 − x1)
d =(x− xF )(y1 − yF )− (y − yF )(x1 − xF )(x− xF )(y1 − y2)− (y − yF )(x1 − x2)
.
To calculate the position of the focal point F , the bar edges are traced to the point that
best matches the intersection in the least squares sense. This calculation is given by the
following simple theorem.
96
Theorem 4.1 Given a set of n lines Li : y = mix+bi, 1 ≤ i ≤ n, let the point F = (xF , yF )
be
F = argminF∈<2
n∑i=1
d2(F,Li)
for the Euclidean distance d. Then F is given by
xF =CD −BE
AD −B2, yF =
AE −BC
AD −B2
where
A =n∑
i=1
m2i
m2i + 1
, B =n∑
i=1
−mi
m2i + 1
,
C =n∑
i=1
−mibim2
i + 1, D =
n∑i=1
1m2
i + 1,
E =n∑
i=1
bim2
i + 1.
The proof follows immediately by taking the first derivative of∑n
i=1 d2(F,Li) and solving
for the coordinates of F . Computationally, the focal point is surprisingly small, generally
on the order O(103) for barcodes with slight natural pitch angles. The computational
difficulty comes in accurately tracing the lines Li. We found a good strategy was to only
count distinct lines consisting of very low (black) or very high (white) pixel values. Note
that the procedure outlined for non-parallel bars handles images rotated by both pitch and
roll. The bars will be parallel if the image is distorted by roll only.
If the bars in the image are parallel, the projected signal u(t) is calculated by estimat-
ing the roll angle θ and using (4.5). Otherwise, the non-parallel projection is given by
tracing lines to the focal point F and using (4.4.2). The resulting projected signal u(t) is
generally very noisy due to discretization, camera blur, hand jitter, electrical noise, defects
in the barcode paper, and inaccurate estimation of the parameters used in the projection
(θ, P1, P2, F ). Also, the projection points u(t) are non-uniformly spaced along the t-axis.
Our solution is to divide the t-axis into N equally spaced intervals. Just as in the image SR
case, each interval could contain 0, 1, or more pixel values. The signal is smoothed using the
97
1D version of the variational minimization for multiple images with known registration (4.1),
where ΩM is simply the discretized t-axis. We opted for the TV regularization, because the
TV norm has been shown to be effective in denoising 1D barcode signals [45, 100].
Figure 4.12 shows a barcode image that was not decoded by the current software. The
first signal is the ideal signal for this barcode. The second signal is a scanline taken from the
center of the image. The scanline has length 150, corresponding to the pixel width of the
barcode. Since the bars were detected to be parallel, the projection was calculated using the
procedure for image roll. The projected signal is very noisy and consists of several thousand
non-uniformly spaced values. The projected signal is divided into N = 500 equally spaced
intervals and smoothed with the digital TV filter super-resolution. The final SR signal
appears smoother and higher resolution than the scanline signal, but the key fact is that
the SR signal is decoded by the decoding software. This example shows that variational SR
can be used to decode barcodes that were previously not decodable.
We found that the projection method for non-parallel bars was less reliable than for the
parallel case, because the pitch distortion requires accurate tracing of the bars to the focal
point. In applications, reliable line tracing algorithms like the Hough transform are too
expensive computationally. Figure 4.13 shows a barcode image with a severe pitch angle.
We used a simple tracing technique that follows paths of the darkest pixels in the image,
indicated by the red dots in the figure. The resulting projected signal was higher quality
than the scanline through the center, notably at positions 150-200.
As an experiment, we obtained a particularly troublesome database consisting of 71
misdecoded barcode images. That is, the software was able to decode the barcode but the
result was not the correct encoded information. Misdecodes are very rare, since most bar-
code symbologies contain error detection features such as a check-sum digit. The projected
signal was calculated for each image, using the parameters and barcode region provided by
the decoding software. Because the scanner typically has very low computational power
and the algorithm has to run very quickly, we were unable to implement TV minimization
on an actual scanner. Instead, we took the mean of all gray values u(t) in each interval
98
Figure 4.12: Super-resolution of a Code 128A barcode image with roll only. Top to bottom:original image and final TV SR image, ideal signal, single scanline, projected signal, TVSR signal with λ = 10.
99
Figure 4.13: Super-resolution of UPC barcode with severe pitch angle. Top: original imagewith traced bars indicated by dots. Bottom: Scanline signal in red superimposed on TVprojected signal in blue.
along the t-axis. Note that as λ→∞, the TV minimization is equivalent to this averaging
process. The mean filtered projected signals were then sent through the signal decoding
software. Of the 71 misdecoded images, 28 (39%) were decoded properly and the remaining
43 were detected as no-decodes. Although decoding 28 of the images is certainly a success,
the important result is that none of the images were misdecoded. This indicates the SR
method may be useful for decoding images that were previously not decodable and also for
checking for misdecodes.
4.4.3 Reconstruction from MRI Sensor Data
Magnetic Resonance Imaging (MRI) is an increasingly important tool for detection and
diagnosis of medical conditions. In a phased-array MRI apparatus, N independent receiver
elements (coils) are placed around the subject, generally at equally spaced intervals along
100
a circle or ellipse. In the presence of a strong magnetic field, atoms with magnetic dipoles
align parallel to the magnetic field. All atoms in the sample are excited by a brief burst
of non-ionizing radiation. For the atoms processing at the frequency of the excitation, this
results in the atoms being displaced from a state of equilibrium. A coil sensor measures
the “relaxation time” of the atoms, indicating the time it takes for the nucleus to return
to its equilibrium energy state. The biological relaxation time is dependent on how the
atoms are bound to to the molecules and can be used to differentiate different tissue types
[6]. With the aid of spatial encoding, the local differences in relaxation can be used to
generate images representing both concentration and biochemical properties. For standard
anatomical images simple grayscale images are produced that illustrate differences in proton
density. From each of the N sensors, a grayscale image ui : Ωi → < can be constructed.
The processing ensures that the N independent images are spatially aligned at the pixel
level. For an underlying (real) image of the subject u : Ω → <, the image ui is theoretically
derived from u by multiplication by a sensitivity profile Pi : Ωi → < with additive Gaussian
noise ni:
ui(x) = Pi(x)u(x) + ni(x). (4.6)
The profile Pi(x) is the transverse component of the magnetic field from the receiver element
and reflects the sensitivity or confidence of the ith sensor at pixel x. An example of an actual
sensor image is shown in Figure 4.14, with zoomed images of a region close to the sensor and
another distant. Note that close to the sensor, the image is well-defined with strong edges
and good contrast. As we move away from the sensor the image grows darker, indicating
the sensitivity profile decays to zero and only the noise remains.
The standard approach for combining the N sensor images into one MR image v is to
take the L2-norm through the images:
v(x) =
√√√√ N∑i=1
[ui(x)]2.
101
Figure 4.14: A image from an MRI sensor and contrast-adjusted zoom of two regions.
Near a sensor, the L2-norm is close to the maximum gray value corresponding to the value
from that sensor image. In the center of the image, the L2-norm is roughly the mean of all
sensor images at that position. Larsson et. al. showed that among all known reconstruction
techniques without knowledge of the sensitivity profiles, the L2-norm produces images with
the highest SNR [67].
The soft inpainting model presented in Section 4.3 suggests that the reconstruction of u
from the sensor images u1≤i≤N could be accomplished by variational SR. For a magnification
M ≥ 1 and known sensitivity profiles in (4.6), the model is
minuE [u|u1≤i≤N , ϕ1≤i≤N , P1≤i≤N ] = R(u) +
λ
2N
N∑i=1
∫Di
(Piu− ui ϕ−1
i
)2dx.
However, this model requires knowledge of the sensitivity profile. Unfortunately, it is cur-
rently not possible to measure the sensitivity profiles, since they are dependent on the
sample. The intensity of ui(x) is proportional to a negative exponential of the relaxation
time. The relaxation time is, in turn, proportional to the strength of the magnetic field
102
exerted by the ith coil. Since magnetic force decays with the distance squared from the
source, we propose the sensitivity profile
Pi(x) = exp(−d
2(x, si)σ2
)
where si is the position of the ith sensor and σ is a parameter indicating the rate of decay.
Note that P → 1 as the position approaches the sensor and P → 0 as the pixels grow more
distant. The sensor positions s1≤i≤N could be measured directly on the MRI apparatus.
We can also try to interpolate the sensor positions by tracing backwards from the L2-norm
image v to the sensor images ui. Matching Piv and ui in the least squares sense gives the
sensor positions and sensitivity parameter σ by
minsi,σ
N∑i=1
(exp
(−d
2(x, si)σ2
)v(x)− ui(x)
)2
. (4.7)
Assuming the sensors are placed evenly in a circle around the image center, we can write
the sensor position in polar coordinates as si =(r, θ + 2π(i−1)
N
). Then the minimization
(4.7) need only find three parameters: r, θ, and σ. Since the functional is differentiable, the
minimization can be performed by gradient-based techniques. Figure 4.15 shows the sensor
positions found by backtracking from an L2-norm brain image for a system with N = 16
sensors.
Inserting the sensitivity profile and sensor positions given by (4.7), the variational SR
model ismin
uE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u)
+λ
2N
N∑i=1
∫Di
(exp
(−d
2(x, si)σ2
)u(x)−
(ui ϕ−1
i
)(x))2
dx.(4.8)
Since the sensor images are aligned at the pixel level, for M = 1 we immediately have
Ωi = ΩM and ϕi = Id. For M > 1, the sensor images may have relative sub-pixel shifts
that need to be determined. Note that to use the computational methods outlined in Section
103
Figure 4.15: Positions of 16 MRI sensors found by tracing backwards from L2-norm image,shown in center.
4.2.1, we rewrite (4.8) as
minuE [u|u1≤i≤N , ϕ1≤i≤N ] = R(u)
+λ
2N
N∑i=1
∫Di
exp(−2d2(x, si)
σ2
)(u(x)− exp
(d2(x, si)σ2
)(ui ϕ−1
i
)(x))2
dx
This version of the minimization more closely resembles the soft inpainting model presented
in Section 4.3.
Figure 4.17 shows the result of Mumford-Shah SR using the sensor positions shown in
104
Figure 4.15. The resolution is not changed between the images (M = 1), because we want
to show a reconstruction rather than an interpolation. Hence there is no need for image
registration. This highlights the difference between effective and actual resolution. Even
though the SR image contains the same number of pixels as one of the 16 sensor images,
the SR images is clearly higher resolution. The four images in Figure 4.17 are displayed
from the minimum to maximum gray value, so all 4 images are equally bright at the sides
closest to the sensors. The L2-norm image is bright around the edges but very dark in
the center, a well-known problem in MR image processing. Note that the SR images are
considerably brighter in the center, making the raw image much easier to examine as a
whole. As λ is decreased, the image becomes brighter. This is partly because decreasing
λ results in greater smoothing and removes outliers that would affect the image display.
Figure 4.16 shows zooms of the central area of the brain of the L2-norm and SR image,
with the contrast enhanced in the L2-norm image to match that in the SR image. The SR
image clearly contains less noise, but the edges and shapes are still distinct.
Figure 4.16: Zoom of central area of brain. Left: L2-norm image with enhanced contrast.Right: MS SR with λ = 100, γ = 2000.
Compared to the standard L2-norm reconstruction, the variational SR reconstruction
105
from sensor data offers three main advantages. First, the contrast is enhanced to make
the central portion of the image more visible. Of course, a spatial contrast equalization
scheme could be used to achieve the same effect. But medical diagnosis is based not only
on shape, but also texture and color. The SR reconstruction enhances the contrast in a
physically meaningful way, so the intensities reflect biological tissue not just mathematical
normalization. Second, the Mumford-Shah functional smooths the image, removing noise
while preserving edges and fine image structures. This makes the medical image potentially
better suited for diagnosis, as well as further image processing such as segmentation and
automatic quantitative shape analysis. Third, the SR image values are on the same intensity
scale as the original sensor images, unlike the L2-norm gray values. This could potentially
make it easier to correlate values directly with signal strength and identify the type of tissue.
106
Figure 4.17: Mumford-Shah fusion of 16 MR sensor images. Top left: a sensor image. Topright: L2-norm image. Bottom left: MS SR with λ = 100, γ = 2000. Bottom right: MS SRwith λ = 10, γ = 2000. All four images have the same dimensions.
107
Chapter 5
Quantized Zooming
5.1 Introduction
5.1.1 Quantized Image Processing and the Quantized TV Energy
Up to this point, we have considered images u : Ω → < that take on a continuum of intensity
levels. Realistically, a digital image should map to a finite set of gray values. The range of
the image is only relaxed to the entire real line for computational and theoretical purposes.
An 8-bit image, the standard for JPEG, maps to integer values between 0 and 255. In
certain situations we may wish to restrict the range of possible values even further, such as
when processing a text image that should theoretically be binary-valued. For segmentation
and recognition tasks, cutting down the number of gray levels is a common image processing
trick for better defining the components of an image. Quantized image processing refers to
any transformation on a grayscale real-valued image u0 : Ω → < that produces a digital
image u : Ω → I, where I is a given set of discrete gray values.
To adapt the variational approach to the quantized case, define the quantized TV energy
minimization as:
minu∈I1,...,IL
ETV [u |u0 ] =∫Ω
|∇u| dx+λ
2
∫Ω
(u− u0)2dx (5.1)
108
where I1, . . . , IL represent the L fixed intensity levels. For the moment, we will consider
the intensity levels to be specified a priori. In Section 5.5.1, we will discuss the problem
of determining the intensities I1, . . . , IL. For small values of L, the TV energy should
closely resemble the Mumford-Shah energy. Note that for a binary 0-1 image, the TV and
Mumford-Shah energies coincide up to choice of parameters.
In this chapter, we will discuss how to minimize the quantized TV energy using the graph
cut approach developed by Boykov et. al. [17]. This method gives an exact minimization of
the quantized TV energy and can be computed in low-order polynomial time. In Sections 5.3
and 5.4, we will show how to apply this model to various low-level vision tasks, including
novel applications to inpainting, zooming, and deconvolution. Extensions of the model
to other energies will be presented in Section 5.5. We conclude this thesis by presenting
applications to the enhancement of text images, barcodes, and MR brain images.
5.1.2 Previous Work on Quantized Image Processing
There have been several PDE/variational approaches to quantized image processing, mainly
focusing on binary denoising and segmentation. The most common strategy has been to
drive the image toward the gray values 0 and 1 by introducing a “double-well” penalty term
u2(u − 1)2 into the energy. Esedoglu proposed adding this term to the 1D TV energy for
partially blind deconvolution of barcode signals [45]. Nikolova intoduced a binary denoising
method that uses the anisotropic TV norm to encourage 0-1 values [78]. Modifying the
Mumford-Shah Γ-convergence model, Shen intoduced the double-well function on the edge
set z, effectively restricting the image to two levels [90]. Lie et. al. offered a level set
implementation of the binary Mumford-Shah model [68]. Chan, Esedoglu, and Nikolova
multiplied the fidelity term of the Mumford-Shah model by the double-well function to
drive the image to 0-1 [29]. Most recently Bertozzi, Esedoglu, and Gillette proposed a binary
inpainting model based on the Cahn-Hilliard equation, a fourth-order PDE containing the
double-well function in its time evolution. This last model is particularly interesting because
it can complete isophotes across large inpainting domains [14].
109
While each of the variational models above has its strengths, introducing a penalty term
drives the image towards specific binary values but the resulting image is still grayscale.
Of course, the result could be thresholded to a pure binary image, but this could introduce
rounding errors and the thresholded result may no longer correspond to a minimum energy
image. The goal is to incorporate the quantization and image enhancement steps into one
procedure. This suggests working in a combinatorial optimization structure rather than
finding a minimum energy image by the calculus of variations.
Incorporating an energy similar to TV into a network flow model, Boykov, Veksler, and
Zabih showed the energy can be minimized by computing a minimum graph cut [17]. The
key feature of this method is that the minimization is exact, both in the fact that the
minimum is global and that the resulting image takes on only the specified intensity values.
The graph cut strategy has been applied to many problems in image processing including
segmentation, object recognition, disocclusion, and multi-camera scene reconstruction. We
refer the reader to [16] for an overview. Darbon and Sigelle adapted the graph cut method
to the TV energy and proposed a fast solution to the L-level problem that repeatedly finds
binary cuts [39]. Chambolle studied the binary TV model in the context of binary Markov
Random Fields and outlined a fast implementation of the multi-level model [27]. We will
discuss the graph cut method in the next section, but first we need to establish some basic
graph theory definitions and alogithms.
5.2 Quantized TV Minimization by Graph Cuts
5.2.1 Network Flows: Definitions
In this section, we establish the basic graph theory definitions necessary for describing the
quantized TV model. Most of the concepts and theorems stated were first stated in the
1956 book by Ford and Fulkerson [50]. For a review, we refer the reader to the introduction
to graph theory textbook by West [99]. We use the standard graph theory notation of
describing a graph G by a vertex set V and an edge set E. Denote a directed edge from
110
vertex u to vertex v by uv.
Definition 5.1 (Flow Networks) A two-terminal flow network is a connected directed
graph G = (V,E) equipped with a nonnegative edge weight function c : E → <≥0 indicating
the capacity of each edge. A vertex s ∈ V is identified as the source and another vertex
t ∈ V as the sink. There should be no edges entering the source or leaving the sink:
c(vs) = c(tv) = 0 ∀ v ∈ V .
For notational convenience, we assume that missing edges have zero capacity: c(uv) =
0 ∀ uv /∈ E. Ford and Fulkerson first described flow networks by analogy to a system
of pipes with water flowing from a source to the sink. The term “capacity” refers to the
amount of water that can pass through each pipe. Beyond plumbing design, flow networks
have found applications to transportation networks and assignment problems. We can now
formalize our definition of the minimum cut of a network.
Definition 5.2 (Minimum Cut) A cut [S, T ] is a partition of the vertex set V such that
S ∪ T = V, S ∩ T = ∅, s ∈ S, t ∈ T . The value of a cut [S, T ] is the total capacity of edges
between S and T :
val([S, T ]) =∑
u∈S,v∈T
[c(uv) + c(vu)] .
The minimum cut of a network has the minimum value among all possible cuts.
Note that the minimum cut for a given network is not guaranteed to be unique. The
Min-Cut Problem is to find the minimum cut of a given flow network. This is a classical
combinatorial optimization problem that can be solved in low-order polynomial time. In
the plumbing analogy, if all pipes in a network had the same capacity, the Min-Cut Problem
would remove as few pipes as possible to sever the source from the sink. Most approaches to
this problem actually solve the related problem of finding the maximum flow of the network,
defined below.
Definition 5.3 (Maximum Flow) A nonnegative edge weight function f : E → <≥0 is
called a feasible flow if f satisfies the two properties:
111
1. Feasibility: f(uv) ≤ c(uv) ∀ uv ∈ E
2. Conservation of Flow:∑
v∈V f(uv) =∑
v∈V f(vu) ∀ u ∈ V \ s, t.
The value of a flow is equal to the net flow into the sink:
val(f) =∑v∈V
f(vt).
The maximum flow is a feasible flow with the maximum value.
The Max-Flow Problem is to find the maximum flow of a given flow network. By analogy,
the goal is to determine the total amount of water that could reach the sink. It turns out
that this amount of water equals the amount that would spill out if the pipes were severed
by a minimum cut. Ford and Fulkerson proved that the Max-Flow and Min-Cut problems
are equivalent, as stated in the Min-Cut Max-Flow Theorem below.
Theorem 5.1 (Ford-Fulkerson, 1956) In every network, the value of a maximum flow
equals the value of a minimum cut.
There is a stronger version of this theorem that says the minimum cut can be recovered
from a maximum flow. This procedure is called the Ford-Fulkerson labeling algorithm,
described in the next section.
5.2.2 Network Flows: Algorithms
The first and simplest approach to finding the maximum flow was proposed by Ford and
Fulkerson. It relies on repeatedly finding available paths to increase the flow. These available
flow paths are called augmenting or special paths.
Definition 5.4 (Augmenting Path) An edge uv is said to be saturated if f(uv) = c(uv)
and unsaturated if f(uv) < c(uv). An augmenting path for a flow f is a path P from
s to t consisting entirely of unsaturated edges: f(uv) < c(uv) ∀ uv ∈ P .
112
It is clear from the definition that a flow is at maximum if and only if there is no
augmenting path for the flow. The idea behind the Ford-Fulkerson algorithm is to add
augmenting paths to the flow until no paths remain.
Ford-Fulkerson Algorithm
Input: Network G = (V,E) with edge capacity function c.
Output: Maximum flow f .
For each uv ∈ E, initialize f(uv) = f(vu) = 0.
while there exists an augmenting path P from s to t
Set df = minc(uv)− f(uv), uv ∈ P.
For each uv ∈ P , set f(uv) = f(uv) + df and f(vu) = −f(uv).
The order in which the augmenting paths are found is flexible and could affect the flow
found if the maximum is non-unique. The search procedure can also drastically affect the
running time. For a poorly chosen search the worst case performance is O(Eval(f)), which
can be unreasonably large even for trivial networks (see p. 596 of [36]). Using a breadth-first
search gives preference to finding the shortest paths in terms of the total number of path
edges. Edmond and Karps proved that this implementation runs in O(V E2) time [43].
An alternate algorithm called the preflow-push method was developed by Goldberg and
Tarjan [54]. The idea is to assign each vertex a “height” that determines how quickly the flow
streams downhill from each junction. The algorithm has a running time of O(V Elog(V 2
E )).
A review and comparison of maximum flow algorithms for image processing can be found
in [16].
Boykov, Veksler, and Zabih developed an approximation method for find near-optimal
solutions to the min-cut problem [17]. Theoretically the α-expansion algorithm finds the
minimum cut [S, T ] in O(V 2Eval([S, T ])) time, although the authors claim that in practice
the running time is O(V ). But to achieve this faster runtime, the solution is no longer
guaranteed to be exact. For image processing problems, the authors observed that the
numerical results are within 1% of the optimal solution.
113
The following corollary to the Ford-Fulkerson Theorem follows from the definition of an
augmented path [99].
Corollary 5.2 For a minimum cut [S, T ], every edge joining a vertex in S and a vertex in
T is saturated.
Using this corollary, we can determine the minimum cut from a maximum flow by
checking saturated edges. The algorithm, a variant of the Ford-Fulkerson algorithm above,
keeps track of a set of reached vertices R and a set of searched vertices S. The search traces
forward from the source along unsaturated edges and backwards (towards the source) along
edges with positive flow. The algorithm terminates when there are no more vertices to
search. Note that the algorithm does not reach the sink t if and only if the given flow f is
maximum [99].
Ford-Fulkerson Labeling Algorithm
Input: Network G = (V,E) with capacity c and maximum flow f .
Output: Minimum cut [S, T ].
Set R = s, S = ∅.
while R 6= S
Choose v ∈ R \ S.
For each vu ∈ E, if f(vu) < c(vu) then set R = R ∪ u.
For each uv ∈ E, if f(uv) > 0 then set R = R ∪ u.
Set S = S ∪ v.
Return minimum cut [S, S ∪ t].
The labeling alogrithm runs in O(V E) time. It is possible to incorporate the labeling
algorithm into the max-flow algorithm by tracking edges that become saturated as the flow
is assigned. So the minimum cut does not need to be found as a separate step.
114
5.2.3 The Quantized TV Model
The quantized TV energy in (5.1) can be modeled by a flow network under an L1 regular-
ization term
|∇u|1 = |ux|+ |uy|.
This version is often called the anisotropic TV norm because, unlike the classical L2 TV
norm, it is not rotationally invariant. The L1 norm gives preference to edges parallel to the
axes and tends to produce blockier images with sharp corners. Using this norm is necessary
for the graph cut framework, but given the applications to barcodes and text images it
seems appropriate to use a model that prefers blocky images.
Let x ∼ y denote that two pixels x, y ∈ Ω are adjacent under the standard 4-connected
cross topology. Rewrite the TV energy in (5.1) in discrete form under the L1 regularization
as
minu∈I1,...,IL
ETV [u |u0 ] =∑
2≤j≤L
∑x∼y∈Ω
u(x)≥Ij>u(y)
(Ij − Ij−1) +λ
2
∑1≤j≤L
∑x∈Ω
u(x)=Ij
(Ij − u0(x))2 (5.2)
where the intensity levels I1, . . . , IL are given in ascending order. Although (5.2) appears
more cumbersome, it will illuminate how to model the TV energy as a flow network. For
neighboring pixels x and y with u(x) 6= u(y), we add a regularization penalty Ij − Ij−1
corresponding to the number of levels separting u(x) and u(y). For the fidelity term, we
add the amount (Ij − u0(x))2 if u(x) = Ij .
For each image pixel, create a directed path of L + 1 vertices corresponding to the L
intensity levels plus a terminal. Each pixel’s path starts at a common vertex corresponding
to level I1, which we identify as the source s. The paths also terminate at a common
sink t, which we identify with the dummy variable IL+1. For pixel x ∈ Ω, we denote its
corresponding vertex at level Ij , 1 ≤ j ≤ L + 1, by xj . Define the capacity function c for
this graph by
c(xj , xj+1
)= λ (Ij − u0(x))
2 , 1 ≤ j ≤ L, x ∈ Ω (5.3)
115
c(xj , yj
)= Ij − Ij−1, 2 ≤ j ≤ L, x ∼ y, x, y ∈ Ω. (5.4)
All edges in the graph not specified by these two equations are assumed to have capacity
zero (or non-existent). Matching these equations to (5.2) shows that a minimum cut of the
network also minimizes the TV energy. This network set-up is sometimes called a “ladder”
system and is illustrated in Figure 5.1. The fidelity term (5.3) is given on the sides of the
ladder and the regularization term (5.4) on the rungs.
Figure 5.1: Illustration of quantized TV graph model for neighboring pixels x ∼ y.
Note that there are two regularization edges in (5.4) for each pair xj , yj , one going each
direction. So the total regularization penalty is actually twice the TV regularization in
(5.2). Hence, the fidelity weight in (5.3) is also doubled by parameter λ instead of λ2 as
in (5.2). Defining the value of u(x) to correspond to where the cut crosses the path for
pixel x, the minimum cut produces the minimum energy image u. Note that the minimum
116
cut will cross each pixel’s path exactly once assuming that λ > 0 and the intensity levels
I1, . . . , IL are distinct. We credit the following theorem to Boykov et. al., who developed
the “ladder” graph cut framework for the Potts model [17].
Theorem 5.3 (Boykov-Veksler-Zabih, 2001) Given an image u0 : Ω → <≥0, parame-
ter λ > 0, and intensity levels I1, . . . , IL with I1 < I2 < · · · < IL. Let [S, T ] be a minimum
cut of the flow network defined by (5.3)-(5.4). Then the image u : Ω → <≥0 defined by
u(x) = Ij if xj ∈ S, xj+1 ∈ T, 1 ≤ j ≤ L
is the minimizer of the anisotropic quantized TV energy given by (5.2). Furthermore,
val ([S, T ]) = 2ETV [u|u0].
Note that the reason the L1 regularization is appropriate for the graph model is because
it can be written as a sum of pairwise interactions, which cannot be done with the L2
TV norm. Furthermore, the L1 norm can be represented as a summation of levels. For
neighboring pixels x ∼ y at levels Ij and Ij + k k ≥ 1, respectively, the anisotropic TV
regularization RTV = |u(x)− u(y)| can be written
RTV (u(x) = Ij , u(y) = Ij+k) =∑
j≤i<j+k
RTV (u(x) = Ii, u(y) = Ii+1) .
Note this formula is essentially a discrete version of the TV norm’s co-area formula. This
is called the “levelable” property of the regularization term. Darbon and Sigelle proved
that the anisotropic TV norm is the only convex and levelable image prior that is invariant
to gray-level shifts [39]. Kolmogorov and Zabih gave more general crieria describing what
types of energies can be minimized by graph cuts [66].
For an image with N pixels and L desired levels, the number of vertices in the flow
network is N(L − 1) + 2. Except for pixels on the image border, each vertex has six
outgoing edges for the directions up, down, left, right, sink, and source. So the number of
117
edges is also O(NL). The adjacency matrix is technically of order NL × NL, but if the
graph is stored as a sparse matrix the memory requirement is only O(NL), equivalent to
storing L copies of the original image. The running time of the Ford-Fulkerson algorithm
for this network is O(N3L3). The preflow-push method takes O(N2L2log(NL)
)time and
the α-expansion approximation algorithm runs in O(NL) time.
Compared to a continuous-valued TV implementation such as the TV filter or gradient
descent, the graph cut TV model offers the following advantages:
• Exact minimization: The graph cut method computes the global minimum of the
quantized TV energy. A continuous TV implementation generally converges to a
local minimum and may face convergence issues such as controlling the time step
or pre-conditioning. There is no convergence in the graph cut method; when the
minimum cut algorithm terminates the quantized image is at minimum.
• Quantization: Thresholding a continuous-valued image or quantizing a discrete-valued
image further can introduce round-off errors. There is no guarantee that a minimizer
of the continuous-valued TV energy can produce the minimizer of the quantized TV
energy. By design, the output of the graph cut method is the minimizer of the
quantized TV energy and the resulting image takes on only the specified intensity
levels.
• Speed: Graph cuts can be computed in time that is low-order polynomial in the
number of pixels by algorithms with fairly low coding complexity. Approximation
algorithms run in linear time in practice.
• Derivative-free: Beyond the gradient in the regularization term, the graph cut method
does not require the computation of any derivatives. The method is not dependent
on a discretization or a computational adjustment like lagged diffusivity.
• No artificial boundary conditions or parameters: Most continuous TV minimization
routines assume the image has Neumann boundary conditions for computational pur-
118
poses. To avoid division by zero in the Euler-Lagrange equation, most routines intro-
duce a lifting parameter to the norm of the gradient. For a time-stepping method,
the TV minimization can be thrown off if the size of the time step ∆t is too large.
Disadvantages of the graph cut TV model include:
• Anisotropic TV: The regularization term must be the anisotropic L1 TV norm. It is
well-known that L2 TV minimization tends to produce images that are “blocky,” in
the sense that the minimization favors piecewise constant regions [42]. The images
resulting from L1 quantized TV minimization will appear even more blocky, with
sharp corners and quantized gray levels. In Section 5.5.3, we show that the image will
more closely resemble the isotropic energy if we use a different topology.
• Deconvolution: Incorporating a blur kernel into the quantized TV model is an open
problem, while it is relatively straightforward in the continuous case. We propose an
approximation strategy to the deconvolution problem in Section 5.4.
• Pre-determined intensity levels: The resulting image depends heavily on the choice
of intensity levels, which are assumed to be specified a priori. In Section 5.5.1, we
discuss how to update the levels for a given image, but the initial choice of intensities
is still important. In some applications, it may be easier to use the continuous TV
model for image enhancement and then determine the quantization levels afterwards.
• Large number of levels: The memory requirements and running time of the graph
cut method both scale with the number of intensity levels L, making the graph cut
method computationally expensive for large values of L. For more than a few levels,
say L > 10, the quantized image closely resembles the continuous-valued image (see
Figure 5.3). Hence, we expect the possible thresholding errors in the continuous case
to be small for large L.
119
5.2.4 Numerical Results
We implemented TV graph cut minimization in Matlab using the preflow-push algorithm
in Stanford’s boost graph library [53]. This implementation was found to be significantly
faster than the Ford-Fulkerson method using a breadth-first search. To save memory, the
graph weights were stored in sparse matrices. For simplicity, the number of levels L is
specified as a parameter and the intensity levels were set by a linear increment:
Ij = min(u0) + (j − 1)max(u0)−min(u0)
L− 1, 1 ≤ j ≤ L. (5.5)
One drawback of the preflow-push algorithm is that it assumes all edge capacities are
integer-valued. If λ is rational, we can place integer weights on the regularization and
fidelity terms so that their relative importance is preserved. That is, if λ = αβ for α, β ∈ Z+
then
minu
∫Ω|∇u| dx+ λ
∫Ω
(u− u0)2 dx = min
uβ
∫Ω|∇u| dx+ α
∫Ω
(u− u0)2 dx.
Then assuming the original image u0 is integer-valued, as is usually the case for 8-bit images,
all capacities defined by (5.3)-(5.4) will also be integer-valued.
Figure 5.2: Effect of λ on TV minimization with L = 4 levels.
The value of the fidelity weight λ has a large effect on the resulting image. The appro-
priate value of λ is hard to determine, because it depends on the original image, the desired
120
Figure 5.3: Effect of # levels L on TV minimization with λ = 1.
Figure 5.4: Running time of quantized TV model with preflow-push method. Left: Log-logplot of # pixels N vs. runtime for repeatedly downsampled Barbara image. Right: Log-logplot of # levels L vs. runtime on 50x50 Barbara image. Linear regressions are shown inred.
121
intensity levels, the amount of noise, and the application. We found that λ = 1 worked well
for 8-bit images. For large values of λ, the result resembles thresholding to the specified
intensities. As λ→ 0, the image becomes smoother and will eventually result in a constant
image. Note that the rightmost image in Figure 5.2 takes on only 2 gray values even though
the image was specified for L = 4 levels. The quantized TV energy restricts the output to
intensities I1, . . . , IL, but realistically the resulting image u is only guaranteed to take on
a subset of I1, . . . , IL.
If the number of gray levels L is large, the resulting image looks very similar to the
result of continuous-valued TV minimization. This suggests that the quantized TV model
is best for small L. The applications presented in this chapter will be for images at few
levels, especially binary-valued images.
Theoretically, the running time of the preflow-push method is O(N2L2log(NL)
)for an
N -pixel image processed over L levels. As a numerical experiment, we repeatedly ran the
graph cut procedure on an image with fixed λ and intensity levels. The “Barbara” image
was repeatedly downsampled and we compared the number of pixels N to the running time
of the TV minimization at L = 2 levels. A linear regression of the log-log plot has a slope
of roughly 2.29. Next, a 50x50 Barbara image was processed at different values of L. The
log-log linear regression has slope 2.38. This confirms numerically that the running time is
slightly worse than quadratic in both N and L.
5.3 Application to Low-Level Vision Tasks
5.3.1 Denoising
Quantized TV minimization denoises an image in two ways: the TV norm minimizes the
amount of local variation and the reducing the number of gray levels results in a smoother
image. Quantization is a denoising process by itself. As discussed earlier, the value of
the fidelity weight λ is supposed to be inversely proportional to the variance of the noise.
Reducing the value of λ results in a smoother image, but reducing the number of levels L
122
seems to have an even greater smoothing effect. In Figure 5.5, the rightmost image appears
the least noisy because the number of levels is the lowest even though the value of λ is fairly
large. However, features are obscured in the binary image. Quantized TV denoising seems
to make the most sense when the underlying noise-free image is quantized, such as with
barcodes or text. The quantized TV denoising model has been studied by Chambolle, who
presented an efficient adjoint formulation implementation [27].
Figure 5.5: TV denoising of Barbara image. Left to right: Original image, L = 5 and λ = 1,L = 5 and λ = 0.1, L = 2 and λ = 1
The Bayesian rationale for the TV energy assumes that the image was corrupted by
additive Gaussian noise. Jonsson, Huang, and Chan studied the TV energy assuming a
Poisson noise model, which has applications to denoising Positron Emission Tomography
(PET) images [62]. At a pixel x, the Poisson noise model assumes
Pr (u0(x)|u(x)) =e−u(x)[u(x)]u0(x)
[u0(x)]!.
Taking the negative log likelihood and ignoring constants in the minimization, the fidelity
term becomesλ
2
∫Ω
(u− u0 log u) dx.
This can be implemented in the graph cut method by changing the fidelity capacity in (5.3)
123
to
c(xj , xj+1
)= λ (Ij − u0(x) log Ij + CL) , 1 ≤ j ≤ L, x ∈ Ω
where CL is a constant satisfying CL ≥ max(u0) log IL and all intensity levels Ij ≥ 1. The
constant CL is added to ensure all edge capacities are non-negative and its presence will
not affect the minimization. Since the TV norm is shift-invariant, we can temporarily shift
the image to handle gray values less than one.
Figure 5.6 shows the result of denoising an image corrupted by Poisson noise. For the
same value of λ and intensity levels, the Poisson TV model is much better at removing the
noise than the standard Gaussian model. The noise points in the interior of the shapes
will persist in the Gaussian model until λ = 0.03, but this low fidelity weight results in
oversmoothing of the shapes’ corners. This simple example shows that knowledge of the
noise process can be incorporated into the quantized TV model and can result in better
images.
Figure 5.6: TV Poisson denoising. Left: Original image corrupted by Poisson noise. Center:TV minimization assuming Gaussian noise with λ = 5, L = 3. Right: TV minimizationassuming Poisson noise with λ = 5, L = 3.
124
5.3.2 Segmentation
Identifying each connected region at constant gray value as an object, quantization can be
thought of as a segmentation procedure. For example, foreground/background segmenta-
tion might suggest processing at L = 2 levels. The quantized TV energy will divide the
image into segments, removing noise and giving preference to connected regions with sharp
boundaries and little local variation [39]. This approach works well for simple images with
constant intensity shapes and high contrast, as in Figure 5.7. However, the segmentation
may not perform well for natural images with texture, low contrast, and complicated ob-
jects taking on many intensity levels. Another problem is that the intensity levels need to
be specified a priori, which may cause difficulties when the objects are close together both
in proximity and gray value.
Figure 5.7: Quantized TV segmentation of simple images. Left 2: segmentation of naturalimage with λ = 0.5, L = 2. Right 2: segmentation of noisy synthetic with λ = 0.02, L = 4
Boykov and Jolly suggested user interaction to guide the segmentation process [15].
Suppose the user selects L “seed” pixels from the image, each pixel belonging to a different
region in the desired segmentation. Assume also that the seeds are at different gray levels
125
so that we can assign the intensities Ij to match the seed pixel values, with I1, . . . , IL
then sorted in ascending order. For a seed s ∈ Ω, we then define the fidelity capacity for
levels 1 ≤ j ≤ L to be
c(sj , sj+1
)=
0 if u(s) = Ij
∞ otherwise.
This forces the minimum cut to pass through the edge corresponding to the gray value of
u(s) at pixels s. This solves the problem of determining the intensity levels, while also
forcing the seed pixels to be in different regions.
Figure 5.8: TV seeded segmentation. Left: Original image with 3 seed pixels shown in red.Center: Quantized TV minimization with λ = 0.5, L = 3 levels selected by (5.5). Right:Quantized TV minimization with λ = 0.5, L = 3 using seeds.
The implicit assumption is that the selected seeds are not noise points in the image,
which would result in a poor choice of intensity levels. One solution would be to assign the
intensity to be an average over a local neighborhood surrounding the seed pixel. Another
possibility would be to have the user select several pixels in each region and average those
gray values. Such interactive segmentation systems have been developed where the user
“scribbles” in the selected regions [70].
126
Note that the seeded minimization in Figure 5.8 is superior visually, but the image
is still not segmented properly. An ideal segmentation at L = 3 levels would put the
person, camera, and background into 3 different regions. However, such a segmentation is
not possible using just intensity information; more sophisticated segmentation techniques
are required. The quantized TV energy should be understood as a pre-processing step
for segmentation, rather than the final segmentation result. Darbon and Sigelle showed
that quantization can improve the performance of edge detection and object recognition
algorithms [39].
5.3.3 Texture Segmentation
The image intensities alone do not appear to be enough to truly segment an image u0, but
segmentation could be achieved by applying a statistical filter to u0 designed to discriminate
certain properties of the image. For example, an object detection filter could assign a value
to each pixel indicating the probability that a pixel belongs to a certain object. Minimizing
the quantized TV energy of the filtered image would be similar to a classification system,
such as a support vector machine. The quantized levels divide the image into regions, while
the TV regularization tries to maintain connected components. The applications could
include object detection, recognition, motion tracking, and texture segmentation, the last
of which is examined in this section.
For texture segmentation, a simple texture filter is based on the intensity histogram.
Suppose the range of gray values is divided into n intervals zi, 1 ≤ i ≤ n. Let Pr (zi)
represent the relative frequency of interval zi in the histogram of the image u0. The entropy
e of the image is defined by
e = −n∑
i=1
Pr(zi) log2 Pr(zi).
The entropy measures the amount of randomness in the image [55]. Note that the entropy
could be calculated over segments of the image, rather than the whole image. For a fixed
127
odd integer N , let e(x) denote the entropy calculated over an N ×N window of u0 centered
over pixel x. To handle pixels at the border, impose Neumann boundary conditions on the
image.
Figure 5.9: TV texture segmentation. Left: Original image. Center: TV minimization withλ = 0.2, L = 2 of entropy statistics. Right: TV minimization with λ = 0.05, L = 2 ofskewness statistics.
Another basic histogram statistic used for texture discrimination is the skewness s. The
skewness is the third moment of the histogram, calculated with respect to the mean µ:
s =n∑
i=1
(zi − µ)3 Pr (zi) , µ =n∑
i=1
zi Pr (zi) .
As the name implies, this statistic measures the amount of symmetry in the histogram. A
symmetric histogram gives s = 0, a right-skewed histogram s > 0, and left-skewed s < 0
[55]. Let s(x) denote the skewness of the N ×N neighborhood of x.
Figure 5.9 shows the result of 2-level quantized TV minimization on the entropy and
skewness images of u0. The window size was 5x5 and the value of λ was tuned to give the best
result. The results are quite good, considering the simplicity of the filters. Better results
should be obtained from more sophisticated texture filters, such as a linear combination of
histogram-based statistics.
128
Figure 5.10: TV inpainting. Left: Original image with mask D shown in red. Right: TVinpainting result with L = 3, λ = 0.1.
5.3.4 Inpainting
Suppose image information is missing or damaged in a set D ⊆ Ω. The basic variational
inpainting model defines the fidelity term to be zero in the region D:
minu∈I1,...,IL
ETV [u |u0 ] =∫Ω
|∇u| dx+λ
2
∫Ω
1Ω\D(x) (u− u0)2dx. (5.6)
In the graph framework, this suggests setting all fidelity capacities to zero in the unknown
region D. However, this could potentially allow the minimum cut to cross the chain of a
pixel x ∈ D more than once. To define the resulting image u in 5.3, the cut should cross
each pixel’s chain exactly once. This is easily remedied by setting all fidelity capacities
along the chain to a positive constant, say 1. As long as all values along the chain are
identical, the minimization will not give preference to any level Ij for an unknown pixel.
129
Define the fidelity weight for pixel x ∈ Ω at level 1 ≤ j ≤ L by
c(xj , xj+1
)=
λ (Ij − u0(x))2 if x ∈ Ω \D
1 if x ∈ D.
The regularization weights will still hold throughout the image, so that image u is smoothed
in the unknown region D.
The quantized TV inpainting model inherits all the flaws of continuous-valued TV in-
painting: oversmoothing textured regions, blocky images, not completing broken lines, and
completing curves with straight edges. For small values of L, these flaws can become even
more pronounced for the quantized model. As described in Chapter 3, the inpainting model
is best suited for long, thin domains such as scratches. As the diameter of D increases, the
inpainting errors will become more obvious.
5.3.5 Zooming
Using the quantized TV inpainting method described in the last section, an image can be
zoomed by recasting the zooming problem as an inpainting problem. As in Chapter 3, to
zoom by a magnification factor M > 1 separate the pixels in the image u0 by M − 1 pixels.
Define the unknown pixel domain D to be the buffer region separating the known pixels in
u0. The inpainting routine should then fill in the pixels “in between” the pixels.
Unfortunately, the pixels in between may not be filled in. For larger magnification fac-
tors, the image consisting of isolated pixels may actually correspond to the global minimum.
Figure 5.11 shows a simple example of zooming a white square. For magnification M = 2,
a square is achieved but the bottom of the square is not where one would expect. For
magnification M ≥ 3, the minimizer is an image consisting of isolated white pixels. For a
binary image, the TV norm is a constant multiple of the edge length. The results in this
figure are the correct global minimizers because the isolated small squares have lower total
edge length than a large square for M ≥ 3.
For natural images, the failure of the zooming method will generally not be so pro-
130
Figure 5.11: TV zooming by inpainting with L = 2, λ = 1, and magnification factor M .
Figure 5.12: TV zooming by inpainting with L = 2, λ = 1, and magnification M = 2.
131
nounced. However, isolated pixels could appear in regions of high contrast and around thin
lines. In Figure 5.12, note the isolated pixels along the handle of the camera.
For the continuous-valued TV energy, the zooming method can be pushed toward a
local miniumum by starting the process with a proper initialization, such as the result of a
linear zooming filter. However, the quantized TV model does not require an initialization
and even if one was incorporated it would still achieve the same global minimum. Another
strategy is to use the “soft” inpainting model where the unknown pixels are given some
affinity for their nearest neighbors.
5.4 Quantized TV Minimization with a Blur Kernel
A black-and-white image, such as a barcode or text, that has been blurred will appear to
be a grayscale image with more than two intensity levels. Recovering the original binary
image should combine the quantization and deblurring procedures. Given an image u0 that
has been blurred by a known operator K, the quantized TV deblurring model is
minu∈I1,...,IL
ETV [u |u0,K ] =∫Ω
|∇u| dx+λ
2
∫Ω
(Ku− u0)2dx. (5.7)
If the blur is shift-invariant, K can be expressed as a convolution by some kernel function
k(x). The continuous TV deblurring model has been well-studied and can be implemented
by gradient-based or level set methods [32]. The quantized model, however, has proven
more difficult to implement. In terms of graph models, the difficulty lies in the fact that the
blur operator acts on a group of pixel values so the fidelity term cannot be simply expressed
on a single pixel’s edges. Raj and Zabih proposed an approximation method for the special
case when the blur matrix is diagonal, but no deconvolution method exists for the general
case [81].
132
5.4.1 Deblurring by Numerical Relaxation
To make use of linear algebra notation, express the original image u0 and the ideal image
u as column vectors by reading the pixels, for example, in raster order. For an image with
N pixels, express the linear blur operator as an N ×N matrix K. Note that this blurring
could be spatially varying and is more general than a convolution. Then the fidelity term
can be written ‖Ku − u0‖2 in the L2-norm. If we expand the fidelity term of (5.7), we
obtain‖Ku− u0‖2 = (Ku− u0)T (Ku− u0)
= uTKTKu− uTKTu0 − uT0Ku+ uT
0 u0
= (KTKu, u)− 2(u,KTu0) + ‖u0‖2.
(5.8)
In the last line, the second term is linear in u and the third term is a constant. If we could
make the first term linear in u, then we could model the TV energy as a flow network. Bect
et. al. showed how to break this term into linear components, which we present below [9].
Inspired by relaxation techniques in linear programming, introduce a vector w representing
slack variables or weights. Our goal is to rewrite the first term in (5.8) as
(KTKu, u) = minw‖u− w‖2 + wTAw (5.9)
where A is a N ×N matrix that depends on the blur operator K. The idea is to freeze the
image u, solve for w, and then update the image u. We will first discuss how to derive w
and A.
First note that the right-hand side of (5.9) can be expanded as
‖u− w‖2 + wTAw = (u− w, u− w) + (Aw,w)
= ‖u‖2 − 2(u,w) + ‖w‖2 + (Aw,w).(5.10)
133
Differentiating with respect to w and setting equal to zero yields
−2u+ 2w + 2Aw = 0
⇒ (I +A)w = u
where I dentoes the N ×N identity matrix. Solving for w gives
w = (I +A)−1u. (5.11)
Assuming A and u are fixed, this gives the minimum of (5.9) with respect to w.
Plugging this new expression for w into (5.10) gives
‖u‖2 − 2(u,w) + ‖w‖2 + (Aw,w)
= (u− (I +A)−1u, u− (I +A)−1u) + (A(I +A)−1u, (I +A)−1u)
= (u, u)− (2(I +A)−1u, u) + ((I +A)−1(I +A)−1u, u) + ((I +A)−1A(I +A)−1u, u)
= ([I − 2(I +A)−1 + (I +A)−1(I +A)−1(I +A)]u, u)
= ([I − (I +A)−1]u, u).
For (5.9) to hold, we require
([I − (I +A)−1]u, u) = (KTKu, u)
⇒ I − (I +A)−1 = KTK
⇒ A = (I −KTK)−1 − I.
Using the linear algebra identity, (I −B)−1B = (I −B)−1 − I we obtain
A = (I −KTK)−1KTK.
However, this solution forA is not computationally feasible for most blur matricesK because
I −ATA will be ill-conditioned. To control the condition number, introduce a parameter µ
134
to replace KTK with 1µK
TK. Then (5.9) becomes
µ
(1µKTKu, u
)= µ[min
w‖u− w‖2 + wTAw]
and the solution for A is
A =(I − 1
µKTK
)−1 1µKTK. (5.12)
If we choose µ > ‖KTK‖, then the largest eigenvalue of I − 1µK
TK will be guaranteed to
be less than one.
Putting together equations (5.8)-(5.12), the fidelity term of the TV energy (5.7) can be
written as
‖Ku− u0‖2 = µ‖u− w‖2 + µwTAw − 2(u,KTu0) + ‖u0‖2
where
w = (I +A)−1u, A = (I − 1µKTK)−1 1
µKTK, µ > ‖KTK‖.
So in the flow network model, we can express the fidelity capacity of pixel x with given
fidelity weight λ as
c(xj , xj+1) = λ[µ(Ij − w(x))2 + µ(wTAw)(x)− 2Ij(KTu0)(x) + ‖u0‖2
], 1 ≤ j ≤ L.
(5.13)
The regularization capacities will remain the same as in the original model in Section 5.2.3.
By alternating the computation of the image u and the weights w, the deblurring prob-
lem can be solved by the TV graph cut method. The minimization could proceed for a fixed
number of iterations or until some stopping criterion is achieved, such as when the image
u is no longer updated. The proposed alternating minimization algorithm is summarized
below.
Quantized TV Deblurring Algorithm
Input: Blurred image u0, blur operator K, fidelity weight λ, intensity levels I1, . . . , IL.
Output: Deblurred image u ∈ I1, . . . , IL.
135
Set u = u0, µ = ‖KTK‖+ 1.
Compute A = (I − 1µK
TK)−1 1µK
TK.
Initialize graph with regularization capacities given by (5.4).
Repeat for a fixed number of iterations
Compute weights w = (I +A)−1u.
Set graph fidelity capacities by (5.13).
Compute image u from minimum graph cut.
There are two serious drawbacks to this approach. First, for an image with N pixels the
resulting matrix A will be N ×N , which creates a great demand on memory storage even
for moderate size images. Second, the method may get stuck at a local minimum due to the
nature of alternating minimization. The computation of the image u produces the global
minimum for fixed weights w, and vice-versa. However, alternating the minimization of u
and w does not guarantee convergence to the global minimum of u,w jointly. Indeed, the
approach generally yields unsatisfactory results because it is driven toward local minima.
Both of these problems are addressed in the next by solving the deblurring and zooming
problems simultaneously.
5.4.2 Zooming Using Local Gradient Information
Suppose the observed low-resolution image u0 was obtained from the ideal high-resolution
image u by convolving u with a blur kernel k(x) and downsampling by a factor M > 1:
u0 = k ∗ u ↓M.
Suppose furthermore that the kernel k is defined digitally over an M ×M window. Then
each pixel in u0 is a weighted sum of M2 pixels in a neighborhood in u and each pixel’s
high-resolution neighborhood does not overlap with another’s.
If the observed image u0 has N pixels, the ideal image has M2N pixels since both
dimensions of u0 are increased by a factor M . Writing the images as column vectors, the
136
blur matrix K will be N ×M2N . Matching the operation Ku to the convolution k ∗ u
shows that each row of the matrix K will contain at most M2 nonzero entries. The process
for writing u as a vector is up to the programmer and sometimes a proper choice will result
in easier computation. Convert u to a vector by listing the pixels in block-raster order :
the pixels are read first from within each M ×M convolution block in raster order, then
the blocks are read in raster order (see Figure 5.13). Then the resulting blur matrix K
will be sparse with a vector K ′ of length M2 down the diagonal, where K ′ corresponds to
the elements of the kernel k(x) listed in raster order. The computation of the matrix A in
equation (5.12) will also result in a block-diagonal matrix. Assuming the kernel is spatially
invariant, each block A′ of A will be the same M2 ×M2 matrix:
A′ =(IM2×M2 −
1µ
(K ′)TK ′)−1 1
µ(K ′)TK ′. (5.14)
This solves the problem of storing a large matrix A; the much smaller matrix A′ only needs to
be calculated once. Subsequent calculations can be processed over non-overlapping vectors
of length M2, saving computational costs in calculating the weights w. To compute the
M2 weights wi corresponding to the jth pixel of u0, processing equation (5.11) over the jth
block gives
w1+(j−1)M2≤i≤jM2 =(IM2×M2 + (A′)−1
)u1+(j−1)M2≤i≤jM2 , 1 ≤ j ≤ N. (5.15)
The second problem that needs to be addressed is the tendency of the alternating min-
imization to converge to inaccurate local minima. To drive the computation towards more
appropriate images, additional information can become incorporated into the fidelity term
that connects the low-resolution observation and the high-resolution result. One possibility
is to match local gradients in the two images. The mean horizontal and vertical gradi-
ents within each M ×M convolution block should match the gradient in the corresponding
low-resolution neighborhood. For example, in Figure 5.13 the average horizontal gradients
calculated from the pixel pairs 1,2 and 3,4 would be compared to the horizontal gradient
137
Figure 5.13: Illustration of writing an image in block-raster order for M = 2, N = 4. Theresulting matrices K and A are block-diagonal.
between the pixels a,b. The modified quantized energy becomes
minu∈I1,...,ILETV [u |u0 ] =∫Ω
|∇u| dz +λ
2
∫Ω
(k ∗ u(z)− u0(z))2 dz
+β1
2
∫Ω
1M2
∑p∈N(z)
∂u
∂x(p)− ∂u0
∂x(z)
2
dz +β2
2
∫Ω
1M2
∑p∈N(z)
∂u
∂y(p)− ∂u0
∂y(z)
2
dz
where N(z) denotes the M ×M high-resolution neighborhood corresponding to pixel z. In
general the blurring term is more important than the gradient information, so we would
expect the weights λ ≥ β1 = β2. We found that λ = β1 = β2 worked well for natural
images.
138
Discretizing the partial derivatives by forward differences, the mean gradients can be
written as a weighted sum of the pixels within a block. Expressing the images as vectors in
block-raster order, the calculation of gradients can be absorbed into the block’s convolution
matrix K ′. The 1×M2 matrix will become 3×M2, with the first row listing the kernel k in
raster order and the next two rows describing the mean horizontal and vertical gradients.
Then the fidelity term can be written
∥∥∥∥∥∥∥∥∥∥K ′u|N(z) −
u0(z)
Dxu0(z)
Dyu0(z)
∥∥∥∥∥∥∥∥∥∥
2
where u|N(z) is the high-resolution block in u as a column vector and D denotes the central
finite difference. With this modified K ′, the calculation of the matrix A′ and weights w for
the block are still given by (5.14)-(5.15).
We illustrate this set-up for magnification M = 2 with the 2x2 blur kernel k(x) =
[kij ]1≤i,j≤2. The modified K ′ including both blur and gradient matching is
K ′ =
k11 k12 k21 k22
1/2 −1/2 1/2 −1/21/2 1/2 −1/2 −1/2
If u1, u2, u3, u4 denotes the 2x2 block of u corresponding to pixel z of u0, the above matrix
gives the desired fidelity term at z:
∥∥∥∥∥∥∥∥∥∥∥∥∥K ′
u1
u2
u3
u4
−
u0(z)
Dxu0(z)
Dyu0(z)
∥∥∥∥∥∥∥∥∥∥∥∥∥
2
= (k11u1 + k12u2 + k21u3 + k22u4 − u0(z))2
+(
u2−u12 + u4−u3
2 −Dxu0(z))2 +
(u1−u3
2 + u2−u42 −Dyu0(z)
)2
139
Figure 5.14: The binary 0-1 image at left is convolved with a 2x2 averaging kernel K anddownsampled by factor 2 to produce the grayscale image at right.
Figure 5.15: Results of 2x zoom by different methods. Top row: original image, bicubiczoom, TV filter zoom. Bottom row: quantized TV inpainting, quantized TV zooming byrelaxation, quantized TV zooming using local gradients.
140
For magnification M = 2, we will generally assume the kernel is a 2x2 averaging kernel
kij = 14 , the only isotropic 2x2 kernel with unit volume. Figure 5.14 shows a simple 8x8
binary image that has been convolved with the 2x2 averaging kernel and downsampled.
The result is a 4x4 grayscale image because of the blur kernel, even though the final image
does not appear blurred. Recovering the original shape is a deceptively simple problem. A
good reconstuction should take into account the blur kernel and the quantized nature of
the original image. As shown in Figure 5.15, continuous-valued bicubic and TV zooming
produce blurred grayscale images. Quantized zooming by inpainting, as in Section 5.3.5,
produces isolated white pixels. Quantized zooming incorporating the blur kernel gets stuck
at a local minimum, but adding the gradient information produces the correct diagonal in
the right corner. Unfortunately, the gradient information also tends to round off corners,
as shown in the top left corner of the last image. The best method incorporates three
separate pieces of information: quantization, blurring, and local gradients. The results on
this simple shape suggest that the method outlined in this section should yield positive
results for blocky binary-valued images such as barcodes and text.
The zooming method also produces favorable results on natural images, even when the
true blur kernel in the camera model is unknown. Figure 5.16 shows 2x zoom on the
original cameraman image, which was not synthetically blurred or downsampled. Note that
quantized inpainting shows isolated pixels along the handle of the camera and the face is
largely blurred out. Quantized zooming using gradients produces a strong diagonal along
the handle and the facial features appear more distinct, a difficult task because the image
is only binary-valued. Given the problems with the deconvolution method described in
the last section, it appears that quantized deblurring is best achieved by simultaneously
increasing the resolution of the image.
141
Figure 5.16: Quantized TV zooming on cameraman image. Left: original image. Center:2x zoom by quantized TV inpainting with L = 2, λ = 1. Right: 2x zoom with 2x2 averagingkernel and local gradients.
5.5 Extensions of the Quantized TV Model
5.5.1 Determining Intensity Levels
The standard approach for determining levels for quantization and compression is to match
the intensity histogram to a probability distribution [55]. The approach used in the previous
sections assumes the histogram is uniformly distributed, generally not a practical assump-
tion. The gray values can be iteratively updated by recalculating intensities for a given
quantized image. For a fixed quantized image u : Ω → I1, . . . , Ij, the fidelity term of the
TV energy (5.1) is minimized by updating the intensity levels to the mean gray value:
Inewj =
∫Ω 1u(x)=Ij
(x) u0(x)dx∫Ω 1u(x)=Ij
(x)dx.
We propose an alternating minimization strategy in which a quantized image is calculated,
the intensity levels are updated, and then the image is recomputed under the new intensi-
142
ties. The iteration continues until the intensities are no longer updated. A similar approach
has been suggested for the binary Mumford-Shah semgentation model [90]. Experiments on
natural images suggest this produces a better quantization than the uniform level assign-
ment (see Figure 5.17). The result is sensitive to the initial assignment of levels. A poor
initialization could lead to levels disappearing: the number of distinct levels in the final
image is less than the desired number.
Figure 5.17: Iterating on intensity levels for quantized TV minimization with λ = 1, L = 3.
5.5.2 The TV-L1 Norm
The standard Rudin-Osher-Fatemi model gives strong preference to preserving high-contrast
features. To give stronger emphasis to geometric rather than contrast features, Chan and
Esedoglu suggested reducing the exponent on the fidelity term [28]. The quantized TV-L1
model is:
minu∈I1,...,IL
ETV [u |u0 ] =∫Ω
|∇u| dx+λ
2
∫Ω
|u− u0|dx.
Chan and Esedoglu argued that the L1 norm is particularly well-suited for images quantized
to few levels. In particular they showed that TV-L1 minimization will perfectly recover a
binary image when u0 is binary, which is not true for the classical L2 norm.
Computing this minimimum of this energy with a classical gradient-based approach is
difficult because the fidelity term is no longer differentiable. Chan and Esedoglu proposed a
gradient descent that requires approximating the derivative of the L1 norm by introducing
a lifting parameter. This energy is more easily minimized in the graph cut method by
143
changing the fidelity capacities to
c(xj , xj+1
)= λ|Ij − u0(x)|, 1 ≤ j ≤ L, x ∈ Ω.
The graph cut method will compute the global minimum of the quantized TV-L1 energy,
an improvement over the gradient-based approximation method. Unlike the classical TV
model, the TV-L1 energy is not strictly convex so the global minimum may be non-unique.
Figure 5.18: TV minimization with L = 6 levels under L1 and L2 fidelity constraints.Top row: TV-L2 minimization removes low-contrast features as λ decreases. Bottom row:TV-L1 minimization removes finer-scale geometric features as λ decreases.
Darbon showed that the TV-L1 norm is a contrast invariant filter. That is, if u(x)
is a minimizer for the observed image u0(x), then cu(x) is the minimizer for the image
cu0(x). Darbon suggested a level set method similar in nature to the graph cut method for
computing the global minimum of the quantized TV-L1 energy [38].
Figure 5.18 shows a simple experiment on minimizing the classical TV and the TV-L1
energies on an image with squares of varying contrast and size. Under both norms, the
result approaches a constant image as λ → 0. Under the classical L2 norm the squares
with low contrast on the left side of the image disappear as λ gets smaller, with the smaller
squares disappearing first. Under the L1 norm the squares disappear based on size, with the
144
3 squares of the same size vanishing as a group. This suggests that the value of λ is inversely
proportional to the size of features that are preserved. Chan and Esedoglu suggested that
the TV-L1 norm gives rise to a scale-space in which geometric features of a specific size
disappear at critical values of λ [28]. Note that under the graph cut minimization the fidelity
term is easily modified to any Lp norm, with large values of p placing more emphasis on
contrast and small p emphasizing geometry.
5.5.3 The 8-connected Topology
The anisotropic TV norm gives preference to edges parallel to the axes, resulting in rect-
angular images with sharp corners. Diagonal edges are “staircased” into square blocks. At
each interior pixel, the regularization term compares to the values of the 4 neighbors at
distance one (the “cross” topology). The regularization weights can be made more rota-
tionally invariant by incorporating the diagonally connected neighbors at distance√
2. For
a pixel x ∈ Ω and a diagonally connected neighbor y define the regularization capacity to
be
c(xj , yj
)=Ij − Ij−1√
2, 2 ≤ j ≤ L.
The regularization is still not truly isotropic, but the 8-connected topology will be less likely
to staircase diagonal edges.
For most images, the difference between the minimization under the 4- or 8-connected
topologies will be very small. The difference becomes apparent for image inpainting, as
in Figure 5.19. Quantized TV minimization under the 8-connected topology more closely
resembles the continuous-valued minimization of the isotropic TV norm. Under the 4-
connected topology, there are more geometric configurations with the same global minimum
energy for inpainting this domain. In a sense, the 8-connected topology is less non-unique.
145
Figure 5.19: TV inpainting under the 8-connected topology. The inpainting domain is shownin red in the first image. Left to right: Original image, TV filter, 4-connected quantizedTV, 8-connected quantized TV.
5.5.4 3-D Image Processing
The TV graph cut method extends naturally to 3D volumes by including regularization
links along the third z dimension. In addition to the 4 standard neighbors within each 2D
slice (up, down, left, right), add 2 edges moving forward and backward between the slices.
Because more links are added to each pixel, the value of λ should be smaller than in the
2D model to preserve the balance between the regularization and fidelity terms. Depending
on the application, the fidelity weights can be set to be anisotropic. For example, in a
video sequence of a fast-moving object the weights along the z dimension should be low.
Conversely in a video or volume where there is little change between slices but each image
slice contains fine structures, the fidelity weight should be higher in the z component than
along other directions. As in the last section, it is also possible to incorporate diagonal
elements in three dimensions, giving rise to a 24-connected topology.
146
Figure 5.20: 3D quantized TV denoising of simple volumes with λ = 0.005, L = 2. Themiddle image slice is shown for comparison. Top row: 10x10x10 cube. Bottom row: Sphereof radius 8.
5.6 Applications of Binary TV Minimization
5.6.1 Barcode Image Processing
The ubiquitous is a series of black and white stripes encoding information in the relative
widths of the bars. Although the most common barcode scanners read a signal with laser
optics, barcode images are also decoded with digital cameras to allow greater flexibility
in reading both linear and two-dimensional barcode symbologies. The ideal barcode for
decoding should be binary-valued, but the observed image is generally a grayscale image
corrupted by camera blur, hand jitter, electrical noise, speckle noise, and defects in the
orginal material such as stray marks on the paper.
The classical TV model has been shown to be effective for denoising and deblurring 1D
bilevel signals [45, 100]. But even after adding a penalty term to force black and white
values, the resulting signal is not strictly binary-valued. The signal could be thresholded
147
before being sent to the decoder, but this could introduce errors. The quantized TV model
solves this problem, while also smoothing the image to remove blur and noise. The CPU of
a typical barcode scanner has very limited memory and computing power, but in practice
the decoding process must be very fast. Barcode manufacturers generally require a runtime
less than 100 milliseconds with operations involving only integer arithmetic. Classical TV
minimization, such as gradient descent, is generally too slow and may have convergence
issues. The graph cut method can be implemented in polynomial time while using only
integer-valued variables.
In a barcode image with parallel vertical bars, one would expect the variation along the
y-direction to be very small. But the variation along the x-direction would be very large,
especially if the image is low-resolution and consists of very thin bars. This suggests making
the regularization weights anisotropic:
λ|∇u|1 = λx|ux|+ λy|uy|
If the bars are perfectly vertical, the value of λy would be very large and λx = 0. If the
orientation of the barcode is not vertical, the image could be rotated or the derivative uy
could be modified to trace tangent to the bars. It is safe to assume the orientation angle
is known because in practice the barcode orientation is the first characteristic of the image
that is identified by decoding software.
Figure 5.21 shows a UPC barcode synthetically distorted with both Gaussian blurring
and Gaussian additive noise. Thresholding the image at the median intensity value does not
recover well-defined bars. Quantized TV minimization with isotropic regularization weights
forms rectangles, but it omits the two thin bars at position 160 and 190. Setting λx = 0
allows for greater variation along the x-direction and all bars are recovered.
Stray marks or damaged regions of the barcode may render it undecodable if no line
remains that gives the proper signal. This is a common problem in the shipping industry
because routing directions are stamped onto the package, sometimes mistakenly across the
148
Figure 5.21: Quantized TV denoising of UPC barcode with additive Gaussian noise andGaussian blur. Top: original image. 2nd row: thresholding at median intensity. 3rd row:quantized TV denoising with λ = 0.005, L = 2. Bottom: quantized TV denoising withanisotropic weights λy = 0.005, λx = 0, L = 2.
barcode label. Figure 5.22 shows the result of quantized TV inpainting to fill in the damaged
regions. Setting anisotropic regularization weights with λx = 0 gives a better center portion
of the image. Note that in the last image the black bars are extended too far on both the
top and bottom within damaged regions. This would not adversely affect the decoding
since only the middle portion of the barcode is sent to the decoder. In this example the
inpainting domain D was known, but theoretically the damaged areas could be determined
by calculating regions that do not match the local orientation of the barcode [5].
The traditional approach to barcode imaging is to repeatedly run rows of the image,
called scanlines, through the decoder until a signal decodes. The graph cut method of
course works on 1D signals as well as 2D images, but using the 2D information could allow
149
Figure 5.22: Quantized TV inpainting of damaged barcode. Left: original image withdamaged area shown in red. Center: TV inpainting with λ = 0.1, L = 2. Right: TVinpainting with anisotropic weights λy = 0.1, λx = 0, L = 2.
Figure 5.23: Quantized TV denoising of a barcode projected signal with λ = 10, L = 2.
150
for the creation of a better single scanline. In Chapter 4, we described how to form a
high-resolution signal by projecting multiple scanlines onto the same axis. The resulting
projected signal is very noisy, but after smoothing the result is potentially better than any
scanline available from the original image. Figure 5.23 shows the result of quantized TV
minimization on such a projecion signal.
5.6.2 Enhancement for Text Recognition
Text images are very sensitive to changes in image size, a phenomenon familiar to academics
in preparing figures for reports. The underlying text should be binary-valued, but an im-
age corrupted by camera blur, compression artifacts, and poor interpolation will appear
grayscale. Recovering the black-and-white text is crucial for automatic text recognition.
Most optical character recognition (OCR) systems require a strictly binary image before
decoding begins. Developing binarization algorithms specifically for text is an active re-
search area for the OCR community.
A common binarization strategy is to define a local threshold T (i, j) at each pixel (i, j) ∈
Ω. The binary image u is then
u(i, j) =
0 if u0(i, j) ≤ T (i, j)
1 if u0(i, j) > T (i, j).
Niblack suggested calculating the local threshold T (i, j) using the local mean µ and standard
deviation σ of the gray values in the b× b window centered over pixel (i, j) [77]. For a fixed
parameter k and odd integer b specifying the window size, Niblack’s method is given by
T (i, j) = µb×b(i, j) + kσb×b(i, j).
Based on numerical evidence, Trier and Jain suggested the optimal values for 8-bit text
images are k = −0.2 and b = 15 [94]. Later, Sauvola and Pietaksinen suggested the
151
following modification to Niblack’s method
T (i, j) = µb×b(i, j) + 1 + k
[σb×b(i, j)
R− 1]
where R is a fixed parameter. The authors suggest the values k = 0.5, R = 128, and b = 15
[87]. Two independent surveys of document binarization techniques concluded that the
modified Niblack method is the best strategy for preparing text for OCR systems [88, 95].
The current state-of-the-art in OCR can detect text only 5 pixels high and properly
decode text 7 pixels high. Most OCR software require larger images, so the text needs to be
zoomed as well as binarized. The most common strategy for OCR software is to interpolate
using bicubic zooming, followed by binarization using Niblack’s original method [21]. Using
the quantized TV energy, the zooming and binarization processes can be combined into one
step, while also deblurring the given image. This can produce large binary text images that
are more pleasing visually. But it is unclear if this will improve OCR performance, as the
existing systems are built around the bicubic - Niblack combination.
Figure 5.24 compares TV quantization of a text image to Niblack’s method and its
modification. Note that the local thresholding methods place black dots in clearly white
regions, because even small variations in the gray values result in thresholding to the larger
binary value. Using the image min and max for intensity levels, the first iteration of TV
minimization produces an unacceptable image. Updating the intensity values as in Section
5.5.1 converges to a much better image in 8 iterations. The original image was 8 pixels
high, so the bicubic zooming was possibly unnecessary. Niblack’s method actually performs
better on the original image than on the zoomed image. Figure 5.25 shows the result on a
smaller 6 pixel high image, where zooming is probably necessary. The TV result is not as
clean as before, but it picks up some features better than the local thresholding methods.
Notably, the dot in the “i” is more distinct in the TV image.
152
Figure 5.24: Quantized TV zooming of large text. Top row: original image, 2x bicubiczoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic zoom followed bymodified Niblack’s method, iterations 1 and 8 of quantized 2x TV zooming with λ = 0.1,L = 2. Assumes kernel is 2x2 averaging matrix.
Figure 5.25: Quantized TV zooming of small text. Top row: original image, 2x bicubiczoom, bicubic zoom followed by Niblack’s method. Bottom row: bicubic zoom followed bymodified Niblack’s method, iterations 1 and 8 of quantized 2x TV zooming with λ = 0.1,L = 2. Assumes kernel is 2x2 averaging matrix.
153
5.6.3 Medical Image Segmentation
Quantized segmentation in medical imaging is helpful in clearly identifying different bio-
logical tissues, defining the different regions by intensity level. This allows for automated
quantitative shape analysis, e.g. tracking the volume of gray matter in a brain or the
geometry of a tumor [6]. Figure 5.26 shows binary TV segmentation of Computerized To-
mography (CT) and Magnetic Resonance (MR) brain images. In each image, 3 seed pixels
were selected to identify the background, dark tissue, and light tissue. In the CT image, the
white region indicates bone. In the MR image, the white region indicates fat tissue (lipids).
The TV results are very good, but this is partly because the images are ideal. Both images
were provided by the Visible Human Project, so the images were of high quality and already
smoothed by image processing algorithms.
Figure 5.26: Quantized TV segmentation of ideal brain images with λ = 0.1, L = 3. Left2: CT image. Right 2: MR image.
In an actual acquired MR image, the image is corrupted by noise, features textured
regions, and generally has low spatially-varying contrast. Figure 5.27 shows how this change
in contrast complicates the segmentation process. Attempting to segment the entire image
results in a large black region in the center corresponding to the low-contrast region in the
original image. One solution is to segment the image in blocks, adjusting the quantization
154
levels for the contrast within each block. Another possible solution is to equalize the image
contrast, such as using the MR super-resolution presented in Chapter 4.
Figure 5.27: Quantized TV segmentation of low-contrast MR brain image. Left 2: Segmen-tation of entire brain with λ = 50, L = 2. Right 2: Segmentation of region indicated in firstimage with λ = 200, L = 2.
155
Bibliography
[1] A. Almansa, V. Caselles, G. Haro, and B. Rouge. “Restoration and zoom of irregularly
sampled, blurred and noisy images by accurate total variation minimization with local
constraints.” Multiscale Model. Simul, 5: 235-272, 2006.
[2] L. Alvarez, F. Guichard, P.L. Lions, and J.M. Morel. “Axioms and fundamental equa-
tions of image processing.” Arch. Rational Mech. Anal., 123: 199-257, 1993.
[3] L. Ambrosio. “A compactness theorem for a new class of functions of bounded varia-
tion.” Boll. Un. Mat. Ital., 3: 857-881, 1989.
[4] L. Ambrosio and V.M. Tortorelli. “Approximation of functionals depending on jumps
by elliptic functional via Γ-convergence.” Comm. Pure Appl. Math., 43: 999-1036,
1990.
[5] S. Ando and H. Hontani. “Automatic visual searching and reading of barcodes in 3-D
scene.” Proc. IEEE Vehicle Electronics Conf., p. 49-54, 2001.
[6] S. Angenent, E. Pichon, and A. Tannenbaum. “Mathematical methods in medical image
processing.” Bulletin of the American Mathematical Society, 43(3): 365-396, 2006.
[7] G. Aubert and P. Kornprobst. Mathematical Problems in Image Processing. Springer-
Verlag, New York, 2001.
[8] S. Baker and T. Kanade. “Limits on super-resolution and how to break them.” IEEE
Trans. Pattern Analysis and Machine Intelligence, 24: 1167-1183, 2002.
156
[9] J. Bect, L. Blanc-Feraud, G. Aubert, and A. Chambolle. “A l1-unified variational
framework for image restoration.” Proc. Euro. Conf. on Computer Vision, Springer-
Verlag LNCS 3024: 1-13, 2004.
[10] A. Belahmidi. PDEs Applied to Image Restoration and Image Zooming. PhD thesis,
Universite de Paris XI Dauphine, 2003.
[11] A. Belahmidi and F. Guichard. “A partial differential equation approach to image
zoom.” Proc. Int. Conf. on Image Processing, 2004.
[12] M. Bertalmio, A. Bertozzi, and G. Sapiro. “Navier-Stokes, fluid dynamics, and image
and video inpainting.” Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
p. 355-362, 2001.
[13] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher. “Simultaneous structure and texture
inpainting.” Proc. IEEE Conf. Computer Vision and Pattern Recognition p. 707-720,
2003.
[14] A. Bertozzi, S. Esedgolu, and A. Gillette. “Inpainting of binary images using the Cahn-
Hilliard equation.” IEEE Trans. Image Processing, to appear.
[15] Y. Boykov and M.-P. Jolly. “Interactive graph cuts for optimal boundary and region
segmentation of objects in N-D images.” Proc. Int. Conf. Computer Vision, p. 105-112,
2001.
[16] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow
algorithms for energy minimization in vision.” IEEE Trans. Pattern Anal. and Machine
Intelligence, 26: 1124-1137, 2004.
[17] Y. Boykov, O. Veksler, and R. Zabih. “Fast approximate energy minimization via graph
cuts.” IEEE Trans. Pattern Anal. and Machine Intelligence, 23: 1222-1239, 2001.
[18] A. Braides. Γ-convergence for Beginners. Oxford Lecture Series in Mathematics, No.
22, 2002.
157
[19] A. Buades, B. Coll, and J.M. Morel. “A review of image denoising methods, with a
new one.” Multiscale Model. Simul., 4: 490-530, 2005.
[20] A. Buades, B. Coll, and J.M. Morel. “The staircasing effect in neighborhood filters and
its solution.” IEEE Trans. Image Processing, 15: 1499-1505, 2006.
[21] D. Capel and A. Zisserman. “Super-resolution of text image sequences.” Proc. Int.
Conf. on Pattern Recognition, 2000.
[22] D. Capel and A. Zisserman. “Computer vision applied to super resolution.” IEEE
Signal Processing Mag., 2003.
[23] K. Carey, D. Chuang, and S. Hemami. “Regularity-preserving image interpolation.”
IEEE Trans. Image Processing, 8:, 1293-1297, 1999.
[24] V. Caselles, J.M. Morel, and C. Sbert. “An axiomatic approach to image interpolation.”
IEEE Trans. Image Processing, 7: 376-386, 1998.
[25] Y. Cha and S. Kim. “Edge-forming methods for image zooming.” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, p. 275-282, 2004.
[26] Y. Cha and S. Kim. “Edge-forming methods for color image zooming.” IEEE Trans.
Image Processing, 15: 2315-2323, 2006.
[27] A. Chambolle. “Total variation minimization and a class of binary MRF models.”
Proc. Int. Workshop on Energy Minimization Methods in Computer Vision and Pattern
Recognition, p. 136-152, 2005.
[28] T.F. Chan and S. Esedoglu. “Aspects of total variation regularized L1 function ap-
proximation.” SIAM J. Appl. Math., 65: 1817-1837, 2005.
[29] T.F. Chan, S. Esedoglu, and M. Nikolova. “Algorithms for finding global minimizaers
of image segmentation and denoising models.” SIAM J. Appl. Math., to appear.
158
[30] T.F. Chan and S.H. Kang. “An error analysis on image inpainting problems.” J. Math.
Imaging and Vision, to appear.
[31] T.F. Chan and J. Shen. “Mathematical models for local nontexture inpainting.” SIAM
J. Appl. Math., 62: 1019-1043, 2002.
[32] T.F. Chan and J. Shen. Image Processing and Analysis: Variational, PDE, Wavelet,
and Stochastic Methods. SIAM Press, Philadelphia, PA, 2005.
[33] T.F. Chan, S. Osher, and J. Shen. “The digital TV filter and nonlinear denoising.”
IEEE Trans. Image Processing, 10: 231-241, 2001.
[34] T.F. Chan and C.K. Wong. “Total variation blind deconvolution.” IEEE Trans. Image
Processing, 7: 370-375, 1998.
[35] H. Chang, D.-Y. Yeung, and Y. Xiong. “Super-resolution through neighbor embed-
ding.” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p. 275-282,
2004.
[36] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cam-
bridge, MA, 1995.
[37] A. Criminsi, P. Perez, and K. Toyama. “Object removal by exemplar-based inpainting.”
Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2: 721-728, 2003.
[38] J. Darbon. “Total variation minimization with L1 data fidelity as a contrast invariant
filter.” Proc. Int. Symp. Image and Signal Processing and Anal., 2005.
[39] J. Darbon and M. Sigelle. “A fast and exact algorithm for total variation minimization.”
Proc. Iberian Conf. Pattern Recognition and Image Anal., p. 351-359, 2005.
[40] P. Davis. “Mathematics and Imaging.” Mathematical Awareness Week Theme Essay,
1998. Online at http://www.mathaware.org/mam/98/articles/theme.essay.html.
[41] I. Daubechies. Ten Lectures on Wavelets. SIAM Press, Philadelphia, PA, 1992.
159
[42] D. Dobson and F. Santosa. “Recovery of blocky images from noisy and blurred data.”
SIAM J. Appl. Math., 56: 1181-1198, 1996.
[43] J. Edmonds and R. Karp. “Theoretical improvements in the algorithmic efficiency for
network flow problems.” J. of the ACM, 19: 248-264, 1972.
[44] A.A. Efros and T.K. Leung. “Texture synthesis by non-parametric sampling.” Proc.
IEEE Int. Conf. on Computer Vision, p. 1033-1038, 1999.
[45] S. Esedgolu. “Blind deconvolution of barcode signals.” Inverse Problems, 20: 121-135,
2004.
[46] S. Esedoglu and J. Shen. “Digital inpainting based on the Mumford-Shah-Euler image
model.” European J. Appl. Math., 13: 353-370, 2002.
[47] L. Evans. Partial Differential Equations. AMS Press, Providence, RI, 2000.
[48] S. Farisu, M. Elad, and P. Milanfar. “Advances and challenges in super-resolution.”
Int. J. Imaging Systems Technology, 14: 47-57, 2004.
[49] S. Farsiu, M. Elad, and P. Milanfar. “Multi-Frame demosaicing and super-resolution
of color images.” IEEE Trans. Image Processing, 15: 141-159, 2006.
[50] L.R. Ford and D.R. Fulkerson. Flows in Networks. Princeton University Press, Prince-
ton, NJ, 1962.
[51] W. Freeman, T. Jones, and E. Pasztor. “Example-based super-resolution.” MERL
Technical Report, TR 2001-30, 2001.
[52] G. Gilboa, N. Sochen, and Y.Y. Zeevi. “Texture preserving variational denoising using
an adaptive fidelity term.” Proc. Conf. on Geometric and Level Set Methods, p. 137-
144, 2003.
[53] D. Gleich. Matlab Boost Graph Library. Software online at
www.stanford.edu/˜dgleich/programs/matlab bgl/
160
[54] A. Goldberg and R. Tarjan. “A new approach to the maximum flow problem.” Proc.
18th Annual ACM Sym. on Theory of Computing, p. 136-146, 1986.
[55] R. Gonzalez, R. Woods, and S. Eddins. Digital Image Processing Using Matlab. Pearson
Prentice Hall, Upper Saddle River, NJ, 2004.
[56] U. Grenander. “Toward a theory of natural scenes.” Brown Technical Report, 2003.
[57] F. Guichard and J.M. Morel. Image Analysis and PDE’s. IPAM GBM Tutorial, March
2001.
[58] R. Hardie, K. Barnard, and E. Armstrong. “Joint MAP registration and high-resolution
image estimation using a sequence of undersampled images.” IEEE Trans. Image Pro-
cessing, 6: 1621-1633, 1997.
[59] H. He and L. Kondi. “An image super-resolution algorithm for different error levels per
frame.” IEEE Trans. Image Processing, 15: 592-603, 2006.
[60] T. Huang and R. Tsai. “Multi-frame image restoration and registration.” Adv. Com-
puter Vision and Image Processing, 1: 317-339, 1984.
[61] M. Irani and S. Peleg. “Improving resolution by image registration.” Graphical Models
and Image Processing, 53: 231-239, 1991.
[62] E. Jonsson, S. Huang, and T. Chan. “Total Variation Regularization in Positron Emis-
sion Tomography.” UCLA CAM Report, 98-48, 1998.
[63] B. Julesz. “Textons, the elements of texture perception and their interactions.” Nature,
290, 1981.
[64] R. Keys. “Cubic convolution interpolation for digital image processing.” IEEE Trans.
Acoustic, Speech, and Signal Processing, 29: 1153-1160, 1981.
[65] S. Kindermann, S. Osher, and P. Jones. “Deblurring and denoising of images by non-
local functionals.” Multiscale Model. Simul., 4: 1091-1115, 2005.
161
[66] V. Kolmogorov and R. Zabih. “What energy functions can be minimized via graph
cuts?” IEEE Trans. Pattern Anal. and Machine Intelligence, 26: 147-159, 2004.
[67] E. Larsson, D. Erdogmus, R. Yan, J. Principe, and J. Fitzsimmons. “SNR optimality
of sum-of-squares reconstruction for phased-array magnetic resonance imaging.” J. of
Magnetic Resonance, 163: 121-123, 2003.
[68] J. Lie, M. Lysaker, and X.C. Tai. “A binary level set model and some applications to
Mumford-Shah image segmentation.” IEEE Trans. Image Processing, 15: 1171-1181,
2006.
[69] Z. Lin and H.-Y. Shum. “Fundamental limits of reconstruction based superresolution
algorithms under local translation.” IEEE Trans. Pattern Anal. and Machine Intelli-
gence, 26: 1-15, 2004.
[70] H. Lombaert, Y. Sun, L. Grady, and C. Xu. “A multilevel banded graph cuts method
for fast image segmentation.” Proc. IEEE Conf. on Computer Vision, p. 259-265, 2005.
[71] F. Malgouyres. Increase in the resolution of digital images: Variational theory and
applications. PhD thesis, Ecole Normale Superieure de Cachan, 2000.
[72] F. Malgouyres and F. Guichard. “Edge direction preserving image zooming: A math-
ematical and numerical analysis.” SIAM J. Numer. Anal., 39: 1-37, 2001.
[73] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, New York, 1998.
[74] D. Mumford. “The Bayesian rationale for energy functionals.” In Geometry Driven
Diffusion in Computer Vision, Kluwer Academic, p. 141-153, 1994.
[75] D. Mumford and J. Shah. “Optimal approximations by piecewise smooth functions and
associated variational problems.” Comm. Pure Appl. Math., 42: 577-685, 1989.
[76] N. Nguyen and P. Milanfar. “A wavelet-based interpolation-restoration method for
super-resolution.” Circuits, Systems, and Signal Processing, 19: 321-338, 2000.
162
[77] W. Niblack. An Introduction to Digital Image Processing. Prentice Hall, Upper Saddle
River, NJ, 1986.
[78] M. Nikolova. “Estimation of binary images by minimizing convex criteria.” Proc. Int.
Conf. Image Processing, p. 108-112, 1998.
[79] S. Osher and J.A. Sethian. “Fronts propagating with curvature-dependent speed: Al-
gorithms based on Hamilton-Jacobi formulations.” J. Comput. Physics, 79: 12-49,
1988.
[80] P. Perona and J. Malik. “A scale space and edge detection using anisotropic diffusion.”
Proc. IEEE Workshop on Computer Vision, p. 16-22, 1987.
[81] A. Raj and R. Zabih. “A graph cut algorithm for generalized image deconvolution.”
Proc. IEEE Int. Conf. on Computer Vision, p. 1-7, 2005.
[82] D. Robinson and P. Milanfar. “Statistical Performance Analysis of Super-Resolution.”
IEE Trans. on Image Processing, 15: 1413-1428, 2006.
[83] L. Rudin, S. Osher, and E. Fatemi. “Nonlinear total variation based noise removal
algorithms.” Physica D, 60: 259-268, 1992.
[84] B. Russell. “Exploiting the sparse derivative prior for super-resolution.” M.S. thesis,
MIT, 2003.
[85] G. Sapiro and D. Ringach. “Anisotropic diffusion of multi-valued images with applica-
tions to color filtering.” IEEE Trans. Image Processing, 5: 1582-1586, 1996.
[86] L. Saul and S. Roweis. “Think globally, fit locally: Unsupervised learning of low di-
mensional manifolds.” J. Machine Learning Research, 4: 119-155, 2003.
[87] J. Sauvola and M. Pietaksinen. “Adaptive document image binarization.” Pattern
Recognition, 33: 225-236, 2000.
163
[88] M. Sezgin and B. Sankur. “Survey over image thresholding techniques and quantitative
performance evaluation.” J. Electronic Imaging, 13: 146-165, 2004.
[89] R. Schultz and R. Stevenson. “Extraction of high-resolution frames from video se-
quences.” IEEE Trans. Image Processing, 5: 996-1011, 1996.
[90] J. Shen. “Γ-convergence approximation to piecewise constant Mumford-Shah segmen-
tation.” Proc. Int. Conf. Advanced Concepts in Intelligent Vision Systems, p. 499-506,
2005.
[91] J. Shen. “A stochastic-variational model for soft Mumford-Shah segmentation.” Int. J.
Biomedical Imaging, 2006: ID 92329, 2006.
[92] E. Simoncelli and J. Portilla. “Texture characterization via second-order statistics of
wavelet coefficient amplitudes.” Proc. 5th IEEE Conf. Image Processing, 1998.
[93] A. Tikhonov and V. Arsenin. Solutions of of Ill-Posed Problems. Winston and Sons,
Washington D.C., 1977.
[94] O.D. Trier and A.K. Jain. “Goal-directed evaluation of binarization methods.” IEEE
Trans. Pattern Anal. and Machine Intelligence, 17: 1191-1201, 1995.
[95] O.D. Trier and T. Taxt. “Evaluation of binarization methods for document images.”
IEEE Trans. Pattern Anal. and Machine Intelligence, 17: 312-315, 1995.
[96] A. Tsai, A. Yezzi, and A. Willsky. “Curve evolution implementation of the Mumford-
Shah functional for image segmentation, denoising, interpolation and magnification.”
IEEE Trans. Image Processing, 10: 1169-1186, 2001.
[97] C. Vogel. Computational Methods for Inverse Problems. SIAM Press, Philadelphia,
2002.
[98] L. Wang and K. Mueller. “Generating sub-resolution detail in images and volumes using
constrained texture synthesis.” Proc. IEEE Conf. on Visualization, p. 75-82, 2004.
164
[99] D. West. Introduction to Graph Theory. Prentice Hall, Upper Saddle River, NJ, 1996.
[100] T. Wittman. “Lost in the supermarket: Decoding blurry barcodes.” SIAM News, 37,
2004.
165