mos aici ng o f cam er a-capt ur ed do cumen ts 2007. 1. 2. · 84 di"eren t resolutio n, no...

Mosaicing of Camera-captured Documents1

Without Pose Restriction2

Jian Liang a, Daniel DeMenthon b, David Doermann b3

aJian Liang is with Amazon.com; Seattle, WA; USA.4

bDaniel DeMenthon and David Doermann are with University of Maryland;5

College Park, MD; USA.6

Preprint submitted to Elsevier 2 July 2006

Abstract7

In this paper we present an image mosaicing method for camera-captured docu-8

ments. Our method is unique in not restricting the camera position, thus allowing9

greater flexibility than approaches using scanner or fixed-cameras. To accommodate10

for the perspective distortions introduced by varying poses, we implement a two-11

step image registration process that accurately computes the projectivity between12

any two document images with an overlapping area that may be as small as 10%. In13

the overlapping area, we apply a sharpness based selection process to obtain seam-14

less composition across the border and within. Experiments show that our approach15

can produce a very sharp, high resolution and accurate full page mosaic from small16

image patches of a document captured from unconstrained positions.17

Key words: Camera-based document analysis, image mosaicing, image registration18

1 Introduction19

Digital image mosaicing has been studied for several decades, starting from20

the mosaicing of aerial and satellite pictures, and now expanding into the21

consumer market for panoramic picture generation. Its success depends on22

two key components: image registration and image blending. The first aims at23

finding the geometric relationship between the to-be-mosaiced images, while24

the latter is concerned with creating a seamless composition.25

Many researchers have developed techniques for the special case of document26

image mosaicing [3,7,8,10,11,12]. The basic idea is to create a full view of a27

Email address: [email protected] (Jian Liang).

2

document page, often too large to capture during a single scan or in a sin-28

gle frame, by stitching together many small patches. If the small images are29

obtained through flatbed scanners [3,10], image registration is less difficult30

because the overlapping part of two images differ only by a 2D Euclidean31

transformation. However, if the images are captured by cameras, the over-32

lapping images differ by a projective transformation. Virtually all reported33

work that we are aware of on document mosaicing using cameras impose some34

restrictions on the camera position to avoid perspective distortion. Some of35

them simply ask the user to point the camera straight at the document plane36

[7,11]. Others require hardware support. Nakao et al [8] attach a video camera37

to a mouse, facing down at the document page. While a user drags the mouse38

across the page, a sequence of pictures are taken, and registered pairwise with39

the help of mouse movement. In [12] a overhead camera is fixed facing down40

while the document is moved on the desktop. While hardware support re-41

duces projective transformations to Euclidean transformations, it defeats one42

purpose of using cameras, which is to achieve flexibility and convenience.43

Our goal is to remove the constraints on camera position and orientation44

such that users can take pictures of a document from any position without45

requirement of special hardware. Figure 1 shows two image patches of a doc-46

ument captured by a camera. Due to unconstrained camera zooming and po-47

sitions, these two images differ greatly in perspective, resolution, brightness,48

contrast, and sharpness. Although many methods have been proposed for im-49

age registration ([9,4], to name a few), the images in Fig 1 still present great50

challenges because of large displacement, small overlapping area (∼ 10%), sig-51

nificant perspective distortion, and periodicity of printed text which presents52

indistinguishable texture patterns everywhere. Applying a global registration53

3

technique such as the Fourier-Mellin transform [9] did not give good results54

in our experiments. We also tested a feature points detector (PCA-SIFT [4])55

with two robust estimators (RANSAC and Graduated Assignment [2]). The56

result is also unsatisfactory because of the large number of outliers (above57

90%) in feature point matches. Fig. 2 shows the matched feature points found58

using the PCA-SIFT method for three pairs of images. In Fig. 2(a), most of59

the matches are correct (or close to be correct) as their connection lines have60

roughly the same length and direction. Only a couple of incorrect ones exist61

and they stand out clearly. In Fig. 2(b), the two document image patches62

have the same displacement and scale difference as the two scenery images in63

Fig. 2(a). However, the percentage of incorrect matches is significantly higher64

than Fig. 2(a) because the periodicity of text lines and characters makes fea-65

ture points less distinguishable from one another. Fig. 2(c) shows the matched66

points between the two images in Fig. 1. In this case, the incorrect matches67

are so overwhelming that it is very difficult to identify any good matches at68

a glance. Overall, this example shows that while it is easy to locate feature69

points in document images, it is more difficult to find good matches under70

perspective distortion and with small overlapping areas.71

Fig. 1. Document image patches captured from unconstrained positions. Two cor-

responding words in the overlapping areas are shown by the two overlaid boxes.

With respect to image blending, there are three possible problems that have72

not been well addressed for document images. Conventional blending computes73

4

(a) (b) (c)

Fig. 2. Match points found by PCA-SIFT. (a) Two sub-images with different scale

and rotation generated from a scenery image. (b) Two sub-images from a document

image with the same scaling and rotation as in (a). (c) Two camera-captured images

of two overlapping parts of a document page. A thick black line shows one correct

match.

the weighted average in an overlapped area, i.e., f = a1f1 +a2f2, where f1 and74

f2 are pixel values from two images, a1 and a2 are two weights that sum to 1.75

By varying the weights, one achieves a gradual transition from one image to76

another across the overlapping area. Other more sophisticated methods exist,77

which are essentially variations of weighted averaging [1]. Though averaging78

usually works well for general images, it is not optimal for document images.79

First, averaging methods treat only the overlapping area. They do not ad-80

dress the overall uneven lighting across images. Second, registration may have81

errors. In mis-registered areas, weighted averaging would result in so-called82

‘ghost’ images. Third, two images may have different sharpness because of83

different resolution, noise level, zooming, out-of-focus blur, motion blur, or84

lighting change. Weighted averaging essentially reduces the sharpness of the85

5

sharper image by blending a blurred image into it. Figure 3 shows the short-86

comings of averaging method. For general scenery or portrait images, a certain87

amount of lighting variation and blurring is acceptable and ‘ghosts’ can be88

softened by blurring. However, for document images, viewers and OCR algo-89

rithms expect sharp contrast between text and background and a minimum90

lighting variation. Therefore, averaging does not represent the optimal way of91

creating document mosaics.92

(b) (c)

(a) (d) (e)

Fig. 3. Challenges for seamless image blending. (a) For two document patches with

uneven lighting, their weighted average results in inconsistent contrast across the

composite image. (b) A small portion of the darker image. (c) The same portion of

the lighter image. (d) Weighted averaging result of (b) and (c) extracted from (a).

(e) Our selective image blending result.

Our proposed registration method for two overlapping views consists of two93

steps. First we remove perspective distortion and relative rotation of individ-94

ual views using text lines and vertical character strokes detected in document95

images. This step removes perspective foreshortening and rotation, and leaves96

only a translation and a scaling between the two views. Then, we find fea-97

ture point matches between views using PCA-SIFT. Although outliers still98

dominate, we are able to filter them out efficiently because of the simplified99

transformation between the two views. After refining the transformation with100

cross-correlation block matching results, we can obtain a very accurate regis-101

6

tration for the image pair.102

We treat the inconsistency of lighting by localized histogram normalization,103

which balances the brightness and contrast across two images as well as within104

each image. Then in the overlapped area, we perform a component level selec-105

tive image composition which preserves the sharpness of the printed markings,106

and ensures a smooth transition near the overlapping area border.107

A shorter version of our work has appeared in [6]. In this paper we present108

our method in full details and provide more experimental results.109

2 Document Image Registration110

The first step of our image registration method is to remove perspective fore-111

shortening and rotation between two images of the document, and leave only112

a three-parameter transformation combining a translation and a scaling be-113

tween the two images. We achieve this by removing the perspective distortion114

in both images. The basic idea is to detect text line directions and vertical115

character stroke directions, find their vanishing points, and the homography116

that maps the vanishing points back to infinity (see [5]). The two images in117

Fig. 3 are transformed to the images in Fig. 4 after rectification and localized118

histogram normalization.119

Ideally, after rectification, the two perspective-free images should only differ by120

a translation and a scaling. Although projectivity is gone, large displacement,121

small overlap, and periodicity of texture still prevent common registration122

methods from succeeding. For example, our global Fourier-Mellin registration123

method still fails, and PCA-SIFT still gives a lot of false matches that defeat124

7

Fig. 4. Document patches after rectification and localized histogram normalization.

Graduated Assignment and make RANSAC ineffective. Our solution is to filter125

out the outliers using a voting mechanism in the spirit of the Hough transform,126

and to use the good matches to register the two images.127

First, let us assume the scale is known. Suppose two images (called A and B)128

are placed within the same coordinate system after proper scaling, and the129

true translation of image B with respect to image A is (x0, y0). Let {pi}Ni=1 be130

the feature points in image A, and {qi}Ni=1 be the matched points in image B.131

If pi and qi are a correct match, we have qi − pi = (x0, y0), and an inequality132

otherwise. We compute all the displacements between matched points, i.e., let133

qi−pi = (xi, yi). We have (xj, yj) = (xk, yk) (we say that they are compatible),134

where j and k denote any two correct matches. In the meantime, the probabil-135

ity of having (xs, ys) = (xt, yt), where either s or t denotes an incorrect match,136

is extremely low assuming incorrect matches are randomly distributed across137

the image. We group the matches with equal displacement (within a certain138

quantization bound) into compatible groups. Ideally, all correct matches are139

assigned to one group, while each incorrect match constitutes a group of its140

own. Hence, the correct matches are the matches in the largest group, and141

their displacements represent the correct translation. In practice, due to the142

8

quantization in the histogram used in our compatibility test (see below), some143

incorrect matches that are similar may be placed in the same group. Even so,144

the sizes of such groups are highly unlikely to exceed the size of the group of145

correct matches.146

If the scale estimate deviates from the correct value, the compatibility mea-147

sure among correct matches will degrade. A small scale error can be absorbed148

by the histogram quantization. As the error increases, the group of correct149

matches will eventually split. Given a completely incorrect scale, the displace-150

ment distribution of correct matches will be as random as incorrect matches, so151

the largest compatible group will split into single-match groups. In summary,152

the largest compatible group is generated when the scale is correct.153

Based on the above analysis, searching for the largest compatible group of154

matches as a function of scale can simultaneously solve the problems of finding155

1) the correct matches, 2) the correct scale, and 3) the correct translation156

between two images. The specific procedure is as follows:157

(1) For every scale s in a range, construct the compatible groups and let g(s)158

be the largest.159

(2) Select s∗ which maximize |g(s)| and s∗ is the correct scale.160

(3) Find all matches in g(s∗), and compute the mean of their displacements,161

which is the correct translation.162

For a given scale, we use a 2D histogram of the match displacements in x163

and y to find the compatible groups. We divide the 2D displacement space164

into bins, and the displacement of each match falls in one bin. To address165

quantization error at bin boundaries, we smooth the 2D histogram by a 3× 3166

averaging kernel. Then, the bin with the most votes is the largest compatible167

9

group. The optimal bin size should be proportional to the average position168

error of the correctly matched feature points. In practice, we find that this is169

not critical. We use 1/20 of the image diagonal length as the bin size.170

We use PCA-SIFT to find the matches between the two images in Fig. 4. Fig. 5171

shows the number of matches for the first and second largest compatible groups172

found in 2D histograms as a function of the scale. The highest peak in the173

solid curve identifies the correct scale. At the correct scale, the second largest174

group (only three votes) is much smaller than the largest group (12 votes).175

This shows good aggregation of correct matches. After examination, we found176

the second largest group resides in a neighboring bin of the largest group, and177

the three matches are approximately correct. These two groups would merge178

if the bin size is increased. With different bin sizes we obtain curves slightly179

different from those in Fig. 5. The correct scale is always found.180

−1 −0.5 0 0.5 10

5

10

15

log(scale)

Size of largest compatible groupSize of 2nd largest group

Fig. 5. 2D histogram peak values vs. scales

The figure also shows that when the scale is set slightly larger than the best181

value, the solid curve drops while the dotted line climbs. This means some182

matches in the largest group shift to the second largest group in the neigh-183

boring bin. This confirms that the largest group splits when the scale is not184

perfect. When the scale differs significantly from the best value, either to the185

left of right, the solid curve drops to two matches 1 and the dotted curve shows186

1 The largest group keeps two matches because PCA-SIFT repeated a pair of

10

only one match.187

Given the best scale, we use the corresponding 2D histogram to find the188

matches aggregated in the largest group at this scale. Fig. 6 shows the correct189

and incorrect matches.190

Fig. 6. Correct (left) and incorrect (right) matches in the PCA-SIFT result.

Taking the correct matches, we compute an initial projective transformation191

between the two images and map one into the other, as shown in Fig. 7(a).192

However, because good matches tend to reside near the overlapped region’s193

center, the registration is inaccurate near the border. We further refine the194

registration using cross-correlation block matching. This results in a dense195

and accurate set of matched points covering the whole overlapped area, which196

allows us to compute a refined projective transformation (see Fig. 7(b)).197

matched points in its output.

11

(a) (b)

Fig. 7. Image registration results where squares and crosses indicate the matched

points in two images. (a) Registration using correct PCA-SIFT matches shows

misalignment near borders of overlapping region. (c) Registration using additional

matches obtained from cross-correlation block matching is very accurate.

3 Seamless Composition198

As we have stated in the introduction, there are three difficulties in creating199

a seamless document mosaic. The first is due to inconsistent lighting across200

two images. Conventional blending does not address overall lighting incon-201

sistency, and it works well for general photos only because people accept202

lighting changes in natural scenes. However, documents are fundamentally203

binary with black print on white paper, and viewers’ eyes are very sensitive204

to varying shade in documents. Typically, the histogram of a document im-205

age is bimodal. Different lighting conditions cause the two modes to shift.206

One way of balancing the lighting across two document images is to binarize207

both images. However, binarization introduces artifacts. Instead, we choose208

localized histogram normalization. The basic idea is to compute the local his-209

togram in a small neighborhood, normalize the histogram such that the two210

modes are transformed to black and white respectively (or very dark and light211

12

gray). Histogram normalization preserves the transition between background212

and foreground, so the result is more pleasing to view.213

The second problem is registration error, and the third is uneven sharpness of214

patch images. We solve both with selective image composition, i.e., each pixel215

in the result is chosen from the image with the best sharpness. We measure216

sharpness in an image by the local average of gradient magnitudes. In the217

following, the index of the selected image for a pixel is called the decision for218

this pixel.219

The pixel-level decisions can be represented by a map in which the same220

decisions are grouped into regions. The boundaries of decision regions may221

intersect characters and words. Thus, if we apply pixel-level decisions directly,222

some characters or words may consist of pieces with different sharpness chosen223

from different images, which is not desirable. Furthermore, mis-registration224

tends to break decision regions into small pieces, resulting in ‘ghost’ images.225

Therefore we aggregate the pixel-level decisions at the word level. This re-226

quires finding words. To do this, we compose an averaging image for the over-227

lapped area, binarize it, dilate the foreground, and find connected components.228

The dilation has two effects. First, areas that may contain ‘ghost’ images are229

merged into the nearest component. Second, the width of our dilation kernel230

is set to be larger than its height, so components of a word are more likely to231

be merged than components from upper or lower text lines. As a result, most232

connected components contain a word. Next, all the pixels inside a connected233

component vote with their pixel-level decision, and the majority vote is taken234

as the component decision. The values for all the pixels of the component are235

selected from the winning image. This process ensures that ‘ghost’ images are236

13

eliminated and words do not have an uneven sharpness. For background areas237

(areas that are not included in word regions), the variation of sharpness is not238

visible, so we use the pixel-level decisions directly (without voting) to assign239

their values.240

Fig. 8 illustrates the process of selective image composition and the results.241

Fig. 8(a) shows that most components consist of a single word. Fig. 8(b) shows242

the component-level decision map by two shades of gray. The arrows indicate243

words that are cut into different parts in pixel-level decision map but not not244

broken in component-level decision map.245

Words may still be broken by the boundaries of the overlapping area (e.g.,246

“previously” and “interpret” in the lower left part of Fig. 8(b)). In this case,247

half of the broken word has better sharpness than the other half. We may opt248

to select the entire word from the image with lower sharpness to eliminate the249

difference in sharpness. This choice depends on user preference.250

In the background area, the pixel-level decisions result in a large light gray251

region embedded in dark gray area. This does not create visible differences252

in the final image because the variation of sharpness in the background is253

small. In Fig. 8, the comparison between (c) and (d) shows that our approach254

preserves the sharpness. In Fig. 8(e), the overlapping area boundary is visible.255

It is eliminated in Fig. 8(f).256

4 Experiments257

We test our mosaicing algorithm with real camera-captured images. For each258

test document, we scan it at 300 dpi and use the scan as the ground truth259

14

(a) (b)

(c) (d) (e) (f)

Fig. 8. Selective image composition. (a) Connected components are represented by

white. The overlapping area is represented by light gray. (b) The binary selection

decision map distinguished by dark and light gray. (c, e) Weighted averaging result.

(d, f) Selective image composition result.

image. We capture an image of the whole document page with a digital camera.260

In our test, we used a Canon camera with 6M pixels. The camera is fixed on a261

tripod, the document is posted on a wall, and we carefully calibrate the camera262

to face straight at the wall so the image is perspective free. In some sense, this263

image represents the optimal image we can obtain with the digital camera264

and we use it as the base line in evaluation. Then we hold the camera in hand265

and take a set of images of the document from different positions. Each image266

patch contains a part of the document and has perspective distortion. These267

image patches are the input to our mosaicing algorithm. Fig. 9 shows some268

example patches that come from the same document. Throughout our test,269

we use the default auto-mode settings of the camera.270

15

Fig. 9. Examples of image patches captured from a document page.

In the first experiment, we capture four image patches for each document and271

we make sure that each two patches overlap. We feed the four image patches272

in all possible orders to the mosaicing module which always takes the first273

image as the base image and register the next incoming image to the current274

composite image. In this way we generate 24 (= C44) compositions.275

In the second experiment, we capture the images in a more unconstrained way.276

The number of image patches for a document varies up to twelve. We do not277

force each pair of images to overlap. Instead, we visually inspect the captured278

images to determine if they overlap and record this information. Based on279

this overlapping information, ten random mosaicing sequences are generated280

under the condition that each image patch must overlap with at least one281

image patch in the sequence before it. This ensures that when an image patch282

is fed to the mosaicing module, it overlaps with the current composition.283

Fig. 10 shows the camera-captured frontal views of two documents, and two284

composite images for each document. We see small differences among compos-285

ite images generated from different mosaicing sequences and we find that the286

difference largely depends on the first image in their mosaicing sequences. If287

the first images are not the same, the small amount of perspective remained in288

the rectified images after geometric rectification may lead to a weak projective289

transformation between them, and therefore results in the difference between290

16

the composite images.291

(a) (c)

(b) (d)

Fig. 10. Perspective-free images compared to composite images. (a, b) Camera cap-

tured documents without perspective distortion. To be fair, they are also enhanced

by the same localized histogram normalization module used for image patches in

document mosaicing so that all have the same level of contrast. The slight barrel

effect in both (a) and (b) is caused by the wide lens distortion of the camera. (c,

d) Two composite images for each document. All compositions have higher resolu-

tion and do not show the barrel effect.

Despite this small difference, all compositions are very close to their base line292

counterparts. This is also confirmed by the OCR performance comparison test.293

We apply OCR to the digital scan, the base line camera-captured image, and294

17

all the composition results. We use the OCR text from the scan as the ground295

truth text and compute the character and word recognition rates for the base296

line and composite images. Table. 1 shows the averaged result. We see that the297

OCR performance on composite images is very close to that on the perspective298

free images of the whole page. In some cases, we actually obtain better OCR299

result from the composite image. In those cases, the base line image suffer300

from slight out-of-focus blur or sub-optimal exposure which are common for301

consumer level cameras, while the image patches have better sharpness and302

exposure that also show up in the composite result.303

Base line images Composite images

Character recognition rate 92.3% 91.0%

Word recognition rate 89.2% 89.5%

Table 1

Average OCR rates of base line images and composite images.

5 Summary304

With camera-based document analysis becoming a new trend in the OCR305

community, the mosaicing of document images expands from scans to camera-306

captured documents. While traditional document mosaicing techniques require307

perspective-free images, the method we propose in this paper can handle docu-308

ment images captured from unconstrained positions. Our registration method309

is robust against severe perspective distortion, large displacement and small310

overlapping area. The novel and important step in our approach is to re-311

move the perspective distortion using a rectification method. This step reduces312

18

the difference between two images to only a translation and a scaling, which313

greatly simplifies the feature point verification problem.314

We also propose an image blending method that is optimized for document315

images. Our method addresses the inconsistent lighting, ‘ghost’ image, and316

varying sharpness problems that are common for camera-captured documents.317

Our method produces composite image very close to the frontal view of the318

document. In some cases, we observe that the quality of the composite image319

exceeds that of the frontal image captured by a camera which suffers from320

blur, reflection, or barrel distortion.321

The mosaicing technique proposed in this paper can be applied in many322

camera-based document analysis applications. Especially for mobile devices323

equipped with low-resolution cameras, mosaicing is the only practical way to324

produce a full page image at high resolution.325

In this paper, mosaicing is mainly motivated by the need for a substitute ap-326

proach when direct frontal view is unavailable. However, for curved documents,327

even when direct frontal view is obtained, the non-planar shape will still dis-328

tort the image. Therefore, in our future work we will address the mosaicing of329

curved documents.330

References331

[1] P. J. Burt and E. H. Adelson. A multiresolution spline with application to332

image mosaics. ACM Transactions on Graphics, 2(4):217–236, 1983.333

[2] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph334

19

matching. IEEE Trans. PAMI, 18(4):377–388, April 1996.335

[3] F. Isgro and M. Pilu. A fast and robust image registration method based on an336

early consensus paradigm. Pattern Recognition Letters, 25(8):943–954, 2004.337

[4] Y. Ke and R. Sukthankar. PCA-SIFT: a more distinctive representation for338

local image descriptors. In Proc. CVPR, volume 2, pages 506–513, 2004.339

[5] J. Liang, D. DeMenthon, and D. Doermann. Flattening curved documents in340

images. In Proc. CVPR, pages 338–345, 2005.341

[6] J. Liang, D. DeMenthon, and D. Doermann. Camera-based document image342

mosaicing. In Proc. ICPR, 2006.343

[7] M. Mirmehdi, P. Clark, and J. Lam. Extracting low resolution text with an344

active camera for OCR. In Proc. IX Spanish Sym. Pat. Rec. and Image Proc.,345

pages 43–48, May 2001.346

[8] T. Nakao, A. Kashitani, and A. Kaneyoshi. Scanning a document with a small347

camera attached to a mouse. In Proc. WACV’98, pages 63–68, 1998.348

[9] B. S. Reddy and B. N. Chatterji. An FFT-based technique for translation,349

rotation, and scale-invariant image registration. IEEE Trans. Image Proc.,350

5(8):1266–1271, 1996.351

[10] K. Schutte and A. M. Vossepoel. Accurate mosaicking of scanned maps, or how352

to generate a virtual a0 scanner. In Proc. ASCI’95, pages 353–359, 1995.353

[11] A. P. Whichello and H. Yan. Document image mosaicing. In Proc. ICPR, pages354

1081–1083, 1998.355

[12] A. Zappala, A. Gee, and M. J. Taylor. Document mosaicing. Image and Vision356

Computing, 17(8):585–595, 1999.357

20