mos aici ng o f cam er a-capt ur ed do cumen ts 2007. 1. 2. · 84 di"eren t resolutio n, no...
TRANSCRIPT
Mosaicing of Camera-captured Documents1
Without Pose Restriction2
Jian Liang a, Daniel DeMenthon b, David Doermann b3
aJian Liang is with Amazon.com; Seattle, WA; USA.4
bDaniel DeMenthon and David Doermann are with University of Maryland;5
College Park, MD; USA.6
Preprint submitted to Elsevier 2 July 2006
Abstract7
In this paper we present an image mosaicing method for camera-captured docu-8
ments. Our method is unique in not restricting the camera position, thus allowing9
greater flexibility than approaches using scanner or fixed-cameras. To accommodate10
for the perspective distortions introduced by varying poses, we implement a two-11
step image registration process that accurately computes the projectivity between12
any two document images with an overlapping area that may be as small as 10%. In13
the overlapping area, we apply a sharpness based selection process to obtain seam-14
less composition across the border and within. Experiments show that our approach15
can produce a very sharp, high resolution and accurate full page mosaic from small16
image patches of a document captured from unconstrained positions.17
Key words: Camera-based document analysis, image mosaicing, image registration18
1 Introduction19
Digital image mosaicing has been studied for several decades, starting from20
the mosaicing of aerial and satellite pictures, and now expanding into the21
consumer market for panoramic picture generation. Its success depends on22
two key components: image registration and image blending. The first aims at23
finding the geometric relationship between the to-be-mosaiced images, while24
the latter is concerned with creating a seamless composition.25
Many researchers have developed techniques for the special case of document26
image mosaicing [3,7,8,10,11,12]. The basic idea is to create a full view of a27
Email address: [email protected] (Jian Liang).
2
document page, often too large to capture during a single scan or in a sin-28
gle frame, by stitching together many small patches. If the small images are29
obtained through flatbed scanners [3,10], image registration is less difficult30
because the overlapping part of two images differ only by a 2D Euclidean31
transformation. However, if the images are captured by cameras, the over-32
lapping images differ by a projective transformation. Virtually all reported33
work that we are aware of on document mosaicing using cameras impose some34
restrictions on the camera position to avoid perspective distortion. Some of35
them simply ask the user to point the camera straight at the document plane36
[7,11]. Others require hardware support. Nakao et al [8] attach a video camera37
to a mouse, facing down at the document page. While a user drags the mouse38
across the page, a sequence of pictures are taken, and registered pairwise with39
the help of mouse movement. In [12] a overhead camera is fixed facing down40
while the document is moved on the desktop. While hardware support re-41
duces projective transformations to Euclidean transformations, it defeats one42
purpose of using cameras, which is to achieve flexibility and convenience.43
Our goal is to remove the constraints on camera position and orientation44
such that users can take pictures of a document from any position without45
requirement of special hardware. Figure 1 shows two image patches of a doc-46
ument captured by a camera. Due to unconstrained camera zooming and po-47
sitions, these two images differ greatly in perspective, resolution, brightness,48
contrast, and sharpness. Although many methods have been proposed for im-49
age registration ([9,4], to name a few), the images in Fig 1 still present great50
challenges because of large displacement, small overlapping area (∼ 10%), sig-51
nificant perspective distortion, and periodicity of printed text which presents52
indistinguishable texture patterns everywhere. Applying a global registration53
3
technique such as the Fourier-Mellin transform [9] did not give good results54
in our experiments. We also tested a feature points detector (PCA-SIFT [4])55
with two robust estimators (RANSAC and Graduated Assignment [2]). The56
result is also unsatisfactory because of the large number of outliers (above57
90%) in feature point matches. Fig. 2 shows the matched feature points found58
using the PCA-SIFT method for three pairs of images. In Fig. 2(a), most of59
the matches are correct (or close to be correct) as their connection lines have60
roughly the same length and direction. Only a couple of incorrect ones exist61
and they stand out clearly. In Fig. 2(b), the two document image patches62
have the same displacement and scale difference as the two scenery images in63
Fig. 2(a). However, the percentage of incorrect matches is significantly higher64
than Fig. 2(a) because the periodicity of text lines and characters makes fea-65
ture points less distinguishable from one another. Fig. 2(c) shows the matched66
points between the two images in Fig. 1. In this case, the incorrect matches67
are so overwhelming that it is very difficult to identify any good matches at68
a glance. Overall, this example shows that while it is easy to locate feature69
points in document images, it is more difficult to find good matches under70
perspective distortion and with small overlapping areas.71
Fig. 1. Document image patches captured from unconstrained positions. Two cor-
responding words in the overlapping areas are shown by the two overlaid boxes.
With respect to image blending, there are three possible problems that have72
not been well addressed for document images. Conventional blending computes73
4
(a) (b) (c)
Fig. 2. Match points found by PCA-SIFT. (a) Two sub-images with different scale
and rotation generated from a scenery image. (b) Two sub-images from a document
image with the same scaling and rotation as in (a). (c) Two camera-captured images
of two overlapping parts of a document page. A thick black line shows one correct
match.
the weighted average in an overlapped area, i.e., f = a1f1 +a2f2, where f1 and74
f2 are pixel values from two images, a1 and a2 are two weights that sum to 1.75
By varying the weights, one achieves a gradual transition from one image to76
another across the overlapping area. Other more sophisticated methods exist,77
which are essentially variations of weighted averaging [1]. Though averaging78
usually works well for general images, it is not optimal for document images.79
First, averaging methods treat only the overlapping area. They do not ad-80
dress the overall uneven lighting across images. Second, registration may have81
errors. In mis-registered areas, weighted averaging would result in so-called82
‘ghost’ images. Third, two images may have different sharpness because of83
different resolution, noise level, zooming, out-of-focus blur, motion blur, or84
lighting change. Weighted averaging essentially reduces the sharpness of the85
5
sharper image by blending a blurred image into it. Figure 3 shows the short-86
comings of averaging method. For general scenery or portrait images, a certain87
amount of lighting variation and blurring is acceptable and ‘ghosts’ can be88
softened by blurring. However, for document images, viewers and OCR algo-89
rithms expect sharp contrast between text and background and a minimum90
lighting variation. Therefore, averaging does not represent the optimal way of91
creating document mosaics.92
(b) (c)
(a) (d) (e)
Fig. 3. Challenges for seamless image blending. (a) For two document patches with
uneven lighting, their weighted average results in inconsistent contrast across the
composite image. (b) A small portion of the darker image. (c) The same portion of
the lighter image. (d) Weighted averaging result of (b) and (c) extracted from (a).
(e) Our selective image blending result.
Our proposed registration method for two overlapping views consists of two93
steps. First we remove perspective distortion and relative rotation of individ-94
ual views using text lines and vertical character strokes detected in document95
images. This step removes perspective foreshortening and rotation, and leaves96
only a translation and a scaling between the two views. Then, we find fea-97
ture point matches between views using PCA-SIFT. Although outliers still98
dominate, we are able to filter them out efficiently because of the simplified99
transformation between the two views. After refining the transformation with100
cross-correlation block matching results, we can obtain a very accurate regis-101
6
tration for the image pair.102
We treat the inconsistency of lighting by localized histogram normalization,103
which balances the brightness and contrast across two images as well as within104
each image. Then in the overlapped area, we perform a component level selec-105
tive image composition which preserves the sharpness of the printed markings,106
and ensures a smooth transition near the overlapping area border.107
A shorter version of our work has appeared in [6]. In this paper we present108
our method in full details and provide more experimental results.109
2 Document Image Registration110
The first step of our image registration method is to remove perspective fore-111
shortening and rotation between two images of the document, and leave only112
a three-parameter transformation combining a translation and a scaling be-113
tween the two images. We achieve this by removing the perspective distortion114
in both images. The basic idea is to detect text line directions and vertical115
character stroke directions, find their vanishing points, and the homography116
that maps the vanishing points back to infinity (see [5]). The two images in117
Fig. 3 are transformed to the images in Fig. 4 after rectification and localized118
histogram normalization.119
Ideally, after rectification, the two perspective-free images should only differ by120
a translation and a scaling. Although projectivity is gone, large displacement,121
small overlap, and periodicity of texture still prevent common registration122
methods from succeeding. For example, our global Fourier-Mellin registration123
method still fails, and PCA-SIFT still gives a lot of false matches that defeat124
7
Fig. 4. Document patches after rectification and localized histogram normalization.
Graduated Assignment and make RANSAC ineffective. Our solution is to filter125
out the outliers using a voting mechanism in the spirit of the Hough transform,126
and to use the good matches to register the two images.127
First, let us assume the scale is known. Suppose two images (called A and B)128
are placed within the same coordinate system after proper scaling, and the129
true translation of image B with respect to image A is (x0, y0). Let {pi}Ni=1 be130
the feature points in image A, and {qi}Ni=1 be the matched points in image B.131
If pi and qi are a correct match, we have qi − pi = (x0, y0), and an inequality132
otherwise. We compute all the displacements between matched points, i.e., let133
qi−pi = (xi, yi). We have (xj, yj) = (xk, yk) (we say that they are compatible),134
where j and k denote any two correct matches. In the meantime, the probabil-135
ity of having (xs, ys) = (xt, yt), where either s or t denotes an incorrect match,136
is extremely low assuming incorrect matches are randomly distributed across137
the image. We group the matches with equal displacement (within a certain138
quantization bound) into compatible groups. Ideally, all correct matches are139
assigned to one group, while each incorrect match constitutes a group of its140
own. Hence, the correct matches are the matches in the largest group, and141
their displacements represent the correct translation. In practice, due to the142
8
quantization in the histogram used in our compatibility test (see below), some143
incorrect matches that are similar may be placed in the same group. Even so,144
the sizes of such groups are highly unlikely to exceed the size of the group of145
correct matches.146
If the scale estimate deviates from the correct value, the compatibility mea-147
sure among correct matches will degrade. A small scale error can be absorbed148
by the histogram quantization. As the error increases, the group of correct149
matches will eventually split. Given a completely incorrect scale, the displace-150
ment distribution of correct matches will be as random as incorrect matches, so151
the largest compatible group will split into single-match groups. In summary,152
the largest compatible group is generated when the scale is correct.153
Based on the above analysis, searching for the largest compatible group of154
matches as a function of scale can simultaneously solve the problems of finding155
1) the correct matches, 2) the correct scale, and 3) the correct translation156
between two images. The specific procedure is as follows:157
(1) For every scale s in a range, construct the compatible groups and let g(s)158
be the largest.159
(2) Select s∗ which maximize |g(s)| and s∗ is the correct scale.160
(3) Find all matches in g(s∗), and compute the mean of their displacements,161
which is the correct translation.162
For a given scale, we use a 2D histogram of the match displacements in x163
and y to find the compatible groups. We divide the 2D displacement space164
into bins, and the displacement of each match falls in one bin. To address165
quantization error at bin boundaries, we smooth the 2D histogram by a 3× 3166
averaging kernel. Then, the bin with the most votes is the largest compatible167
9
group. The optimal bin size should be proportional to the average position168
error of the correctly matched feature points. In practice, we find that this is169
not critical. We use 1/20 of the image diagonal length as the bin size.170
We use PCA-SIFT to find the matches between the two images in Fig. 4. Fig. 5171
shows the number of matches for the first and second largest compatible groups172
found in 2D histograms as a function of the scale. The highest peak in the173
solid curve identifies the correct scale. At the correct scale, the second largest174
group (only three votes) is much smaller than the largest group (12 votes).175
This shows good aggregation of correct matches. After examination, we found176
the second largest group resides in a neighboring bin of the largest group, and177
the three matches are approximately correct. These two groups would merge178
if the bin size is increased. With different bin sizes we obtain curves slightly179
different from those in Fig. 5. The correct scale is always found.180
−1 −0.5 0 0.5 10
5
10
15
log(scale)
Size of largest compatible groupSize of 2nd largest group
Fig. 5. 2D histogram peak values vs. scales
The figure also shows that when the scale is set slightly larger than the best181
value, the solid curve drops while the dotted line climbs. This means some182
matches in the largest group shift to the second largest group in the neigh-183
boring bin. This confirms that the largest group splits when the scale is not184
perfect. When the scale differs significantly from the best value, either to the185
left of right, the solid curve drops to two matches 1 and the dotted curve shows186
1 The largest group keeps two matches because PCA-SIFT repeated a pair of
10
only one match.187
Given the best scale, we use the corresponding 2D histogram to find the188
matches aggregated in the largest group at this scale. Fig. 6 shows the correct189
and incorrect matches.190
Fig. 6. Correct (left) and incorrect (right) matches in the PCA-SIFT result.
Taking the correct matches, we compute an initial projective transformation191
between the two images and map one into the other, as shown in Fig. 7(a).192
However, because good matches tend to reside near the overlapped region’s193
center, the registration is inaccurate near the border. We further refine the194
registration using cross-correlation block matching. This results in a dense195
and accurate set of matched points covering the whole overlapped area, which196
allows us to compute a refined projective transformation (see Fig. 7(b)).197
matched points in its output.
11
(a) (b)
Fig. 7. Image registration results where squares and crosses indicate the matched
points in two images. (a) Registration using correct PCA-SIFT matches shows
misalignment near borders of overlapping region. (c) Registration using additional
matches obtained from cross-correlation block matching is very accurate.
3 Seamless Composition198
As we have stated in the introduction, there are three difficulties in creating199
a seamless document mosaic. The first is due to inconsistent lighting across200
two images. Conventional blending does not address overall lighting incon-201
sistency, and it works well for general photos only because people accept202
lighting changes in natural scenes. However, documents are fundamentally203
binary with black print on white paper, and viewers’ eyes are very sensitive204
to varying shade in documents. Typically, the histogram of a document im-205
age is bimodal. Different lighting conditions cause the two modes to shift.206
One way of balancing the lighting across two document images is to binarize207
both images. However, binarization introduces artifacts. Instead, we choose208
localized histogram normalization. The basic idea is to compute the local his-209
togram in a small neighborhood, normalize the histogram such that the two210
modes are transformed to black and white respectively (or very dark and light211
12
gray). Histogram normalization preserves the transition between background212
and foreground, so the result is more pleasing to view.213
The second problem is registration error, and the third is uneven sharpness of214
patch images. We solve both with selective image composition, i.e., each pixel215
in the result is chosen from the image with the best sharpness. We measure216
sharpness in an image by the local average of gradient magnitudes. In the217
following, the index of the selected image for a pixel is called the decision for218
this pixel.219
The pixel-level decisions can be represented by a map in which the same220
decisions are grouped into regions. The boundaries of decision regions may221
intersect characters and words. Thus, if we apply pixel-level decisions directly,222
some characters or words may consist of pieces with different sharpness chosen223
from different images, which is not desirable. Furthermore, mis-registration224
tends to break decision regions into small pieces, resulting in ‘ghost’ images.225
Therefore we aggregate the pixel-level decisions at the word level. This re-226
quires finding words. To do this, we compose an averaging image for the over-227
lapped area, binarize it, dilate the foreground, and find connected components.228
The dilation has two effects. First, areas that may contain ‘ghost’ images are229
merged into the nearest component. Second, the width of our dilation kernel230
is set to be larger than its height, so components of a word are more likely to231
be merged than components from upper or lower text lines. As a result, most232
connected components contain a word. Next, all the pixels inside a connected233
component vote with their pixel-level decision, and the majority vote is taken234
as the component decision. The values for all the pixels of the component are235
selected from the winning image. This process ensures that ‘ghost’ images are236
13
eliminated and words do not have an uneven sharpness. For background areas237
(areas that are not included in word regions), the variation of sharpness is not238
visible, so we use the pixel-level decisions directly (without voting) to assign239
their values.240
Fig. 8 illustrates the process of selective image composition and the results.241
Fig. 8(a) shows that most components consist of a single word. Fig. 8(b) shows242
the component-level decision map by two shades of gray. The arrows indicate243
words that are cut into different parts in pixel-level decision map but not not244
broken in component-level decision map.245
Words may still be broken by the boundaries of the overlapping area (e.g.,246
“previously” and “interpret” in the lower left part of Fig. 8(b)). In this case,247
half of the broken word has better sharpness than the other half. We may opt248
to select the entire word from the image with lower sharpness to eliminate the249
difference in sharpness. This choice depends on user preference.250
In the background area, the pixel-level decisions result in a large light gray251
region embedded in dark gray area. This does not create visible differences252
in the final image because the variation of sharpness in the background is253
small. In Fig. 8, the comparison between (c) and (d) shows that our approach254
preserves the sharpness. In Fig. 8(e), the overlapping area boundary is visible.255
It is eliminated in Fig. 8(f).256
4 Experiments257
We test our mosaicing algorithm with real camera-captured images. For each258
test document, we scan it at 300 dpi and use the scan as the ground truth259
14
(a) (b)
(c) (d) (e) (f)
Fig. 8. Selective image composition. (a) Connected components are represented by
white. The overlapping area is represented by light gray. (b) The binary selection
decision map distinguished by dark and light gray. (c, e) Weighted averaging result.
(d, f) Selective image composition result.
image. We capture an image of the whole document page with a digital camera.260
In our test, we used a Canon camera with 6M pixels. The camera is fixed on a261
tripod, the document is posted on a wall, and we carefully calibrate the camera262
to face straight at the wall so the image is perspective free. In some sense, this263
image represents the optimal image we can obtain with the digital camera264
and we use it as the base line in evaluation. Then we hold the camera in hand265
and take a set of images of the document from different positions. Each image266
patch contains a part of the document and has perspective distortion. These267
image patches are the input to our mosaicing algorithm. Fig. 9 shows some268
example patches that come from the same document. Throughout our test,269
we use the default auto-mode settings of the camera.270
15
Fig. 9. Examples of image patches captured from a document page.
In the first experiment, we capture four image patches for each document and271
we make sure that each two patches overlap. We feed the four image patches272
in all possible orders to the mosaicing module which always takes the first273
image as the base image and register the next incoming image to the current274
composite image. In this way we generate 24 (= C44) compositions.275
In the second experiment, we capture the images in a more unconstrained way.276
The number of image patches for a document varies up to twelve. We do not277
force each pair of images to overlap. Instead, we visually inspect the captured278
images to determine if they overlap and record this information. Based on279
this overlapping information, ten random mosaicing sequences are generated280
under the condition that each image patch must overlap with at least one281
image patch in the sequence before it. This ensures that when an image patch282
is fed to the mosaicing module, it overlaps with the current composition.283
Fig. 10 shows the camera-captured frontal views of two documents, and two284
composite images for each document. We see small differences among compos-285
ite images generated from different mosaicing sequences and we find that the286
difference largely depends on the first image in their mosaicing sequences. If287
the first images are not the same, the small amount of perspective remained in288
the rectified images after geometric rectification may lead to a weak projective289
transformation between them, and therefore results in the difference between290
16
the composite images.291
(a) (c)
(b) (d)
Fig. 10. Perspective-free images compared to composite images. (a, b) Camera cap-
tured documents without perspective distortion. To be fair, they are also enhanced
by the same localized histogram normalization module used for image patches in
document mosaicing so that all have the same level of contrast. The slight barrel
effect in both (a) and (b) is caused by the wide lens distortion of the camera. (c,
d) Two composite images for each document. All compositions have higher resolu-
tion and do not show the barrel effect.
Despite this small difference, all compositions are very close to their base line292
counterparts. This is also confirmed by the OCR performance comparison test.293
We apply OCR to the digital scan, the base line camera-captured image, and294
17
all the composition results. We use the OCR text from the scan as the ground295
truth text and compute the character and word recognition rates for the base296
line and composite images. Table. 1 shows the averaged result. We see that the297
OCR performance on composite images is very close to that on the perspective298
free images of the whole page. In some cases, we actually obtain better OCR299
result from the composite image. In those cases, the base line image suffer300
from slight out-of-focus blur or sub-optimal exposure which are common for301
consumer level cameras, while the image patches have better sharpness and302
exposure that also show up in the composite result.303
Base line images Composite images
Character recognition rate 92.3% 91.0%
Word recognition rate 89.2% 89.5%
Table 1
Average OCR rates of base line images and composite images.
5 Summary304
With camera-based document analysis becoming a new trend in the OCR305
community, the mosaicing of document images expands from scans to camera-306
captured documents. While traditional document mosaicing techniques require307
perspective-free images, the method we propose in this paper can handle docu-308
ment images captured from unconstrained positions. Our registration method309
is robust against severe perspective distortion, large displacement and small310
overlapping area. The novel and important step in our approach is to re-311
move the perspective distortion using a rectification method. This step reduces312
18
the difference between two images to only a translation and a scaling, which313
greatly simplifies the feature point verification problem.314
We also propose an image blending method that is optimized for document315
images. Our method addresses the inconsistent lighting, ‘ghost’ image, and316
varying sharpness problems that are common for camera-captured documents.317
Our method produces composite image very close to the frontal view of the318
document. In some cases, we observe that the quality of the composite image319
exceeds that of the frontal image captured by a camera which suffers from320
blur, reflection, or barrel distortion.321
The mosaicing technique proposed in this paper can be applied in many322
camera-based document analysis applications. Especially for mobile devices323
equipped with low-resolution cameras, mosaicing is the only practical way to324
produce a full page image at high resolution.325
In this paper, mosaicing is mainly motivated by the need for a substitute ap-326
proach when direct frontal view is unavailable. However, for curved documents,327
even when direct frontal view is obtained, the non-planar shape will still dis-328
tort the image. Therefore, in our future work we will address the mosaicing of329
curved documents.330
References331
[1] P. J. Burt and E. H. Adelson. A multiresolution spline with application to332
image mosaics. ACM Transactions on Graphics, 2(4):217–236, 1983.333
[2] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph334
19
matching. IEEE Trans. PAMI, 18(4):377–388, April 1996.335
[3] F. Isgro and M. Pilu. A fast and robust image registration method based on an336
early consensus paradigm. Pattern Recognition Letters, 25(8):943–954, 2004.337
[4] Y. Ke and R. Sukthankar. PCA-SIFT: a more distinctive representation for338
local image descriptors. In Proc. CVPR, volume 2, pages 506–513, 2004.339
[5] J. Liang, D. DeMenthon, and D. Doermann. Flattening curved documents in340
images. In Proc. CVPR, pages 338–345, 2005.341
[6] J. Liang, D. DeMenthon, and D. Doermann. Camera-based document image342
mosaicing. In Proc. ICPR, 2006.343
[7] M. Mirmehdi, P. Clark, and J. Lam. Extracting low resolution text with an344
active camera for OCR. In Proc. IX Spanish Sym. Pat. Rec. and Image Proc.,345
pages 43–48, May 2001.346
[8] T. Nakao, A. Kashitani, and A. Kaneyoshi. Scanning a document with a small347
camera attached to a mouse. In Proc. WACV’98, pages 63–68, 1998.348
[9] B. S. Reddy and B. N. Chatterji. An FFT-based technique for translation,349
rotation, and scale-invariant image registration. IEEE Trans. Image Proc.,350
5(8):1266–1271, 1996.351
[10] K. Schutte and A. M. Vossepoel. Accurate mosaicking of scanned maps, or how352
to generate a virtual a0 scanner. In Proc. ASCI’95, pages 353–359, 1995.353
[11] A. P. Whichello and H. Yan. Document image mosaicing. In Proc. ICPR, pages354
1081–1083, 1998.355
[12] A. Zappala, A. Gee, and M. J. Taylor. Document mosaicing. Image and Vision356
Computing, 17(8):585–595, 1999.357
20