mosaic-based 3d scene representation and renderingzhu/icip05/jic06_zhu_hanson.pdfview. stereo...
TRANSCRIPT
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
Mosaic-Based 3D Scene Representation and Rendering
Zhigang Zhu (Corresponding Author)
Department of Computer Science 130th Street and Convent Avenue
City College of New York, New York, NY 10031, USA Tel: 1 (212) 650-8799, Fax: 1 (212) 650-6248
Allen R. Hanson Department of Computer Science
130 Governors Drive University of Massachusetts Amherst, MA 01003, USA
[email protected] Abstract - In this paper we address the problem of fusing images from many video cameras or a moving video camera. The captured images have obvious motion parallax, but they will be aligned and integrated into a few mosaics with a large field-of-view (FOV) that preserve 3D information. We have developed a compact geometric representation that can re-organize the original perspective images into a set of parallel projections with different oblique viewing angles. In addition to providing a wide field of view, mosaics with various oblique views well represent occlusion regions that cannot be seen in a usual nadir view. Stereo pair(s) can be formed from a pair of mosaics with different oblique viewing angles and thus image-based 3D viewing can be achieved. This representation can be used as both an advanced video interface and a pre-processing step for 3D reconstruction. A ray interpolation approach for generating the parallel-projection mosaics is presented, and efficient 3D scene/object rendering based on multiple parallel-projection mosaics is discussed. Several real-world examples are provided, with applications ranging from aerial video surveillance/environmental monitoring, ground mobile robot navigation, to under-vehicle inspection.
Keywords –image-based rendering, parallel projection mosaics, ray interpolation, three dimensional viewing, interactive visual representation.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
1
I. INTRODUCTION
This paper presents a novel approach for fusing images from many spatially distributed video cameras
(or a moving video camera) into a few mosaiced images that preserve 3D information. In both cases, a
virtual 2D array of cameras with field-of-view (FOV) overlaps is formed to generate complete coverage
of a scene (or an object). The proposed mosaic representation has been applied to a variety of
applications, including airborne video for environmental monitoring and urban surveillance, ground
mobile robot navigation, and under-vehicle inspection (Fig. 1). These applications represent very different
imaging scenarios, from far-range through extreme close-range. We will show that the same approach can
be applied to all these cases.
a b
c d
Fig. 1. A few application examples: (a) airborne video for environmental monitoring; (b) airborne urban surveillance; (c) ground mobile robot; and (d) under-vehicle inspection.
A mosaic representation with a single viewpoint is not appropriate for representing a 3D scene
captured by a translating camera, due to the well-known problems of occlusion as illustrated in Fig. 2a. A
2D panoramic mosaic of a 3D scene generated from video from a translating camera has the geometry of
multiple viewpoints [1,2], but it only preserves information from a single viewing direction. An example
is shown in Fig. 2b with parallel projection in the direction of the drawing (i.e. in the plane of the page)
and perspective projection in the orthogonal direction (i.e. into the page). Three-dimensional (3D)
structure and surface information from other viewing directions of the original video is lost in such a
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
2
representation. A digital elevation map (DEM) generated from using traditional aerial photogrammetry
consists of a sampled array of elevations (depths) for a number of ground positions at regularly spaced
intervals [3]. Even though 3D and texture data can be represented, a DEM generated from such a scenario
usually only has a nadir viewing direction (as in Fig. 2b, with parallel projections in both directions),
hence the surfaces from other viewing directions cannot be represented. However, in some applications
such as surveillance and security inspection, a scene or an object (e.g. a vehicle) needs to be observed
from many viewing directions to reveal hidden anomalies (see Section VI.C for an example). Stereo
panoramas [4,5] have been presented as a mechanism for obtaining the “best” 3D information from an
off-center rotating camera. In the case of a translating camera, various layered representations [6-8] have
been proposed to represent both 3D information and occlusions, but such representations need 3D
reconstructions.
O
invisible a
invisible invisible
b
visible invisible
c
invisible visible
d
Fig. 2. Mosaic representations with different projections. (a) Perspective; (b) orthogonal (nadir); (c) oblique looking forward and (d) oblique looking backward. The combination of b to d gives our multi-view parallel
mosaic representation
Figure 2a. illustrates the observation that many viewing directions are already included in each of the
original camera views . This property has been noted and used before, for example in the X-slit mosaics
with non-parallel rays [9] for image-based rendering. In this paper we propose a representation that can
re-organize the original perspective images into a set of parallel projections with different oblique
viewing angles (in both the x and the y directions of the 2D images I (x, y)). Representations with parallel
projections are efficient since only a few 2D image are needed. They are also effective in recovering the
3D structure of the scene in view using the optimal parallel stereo geometry [10-12]. Mosaics with 2D
oblique parallel projections are a unified representation for our previous work on parallel-perspective
stereo mosaics [11,12] and multi-camera mosaics [13,14]. Such representations provide a wide field of
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
3
view, optimal 3D information for stereo viewing and reconstruction, and the capability to represent
occlusions. Fig. 2 (from b to d) shows three oblique views where all the surfaces can be represented with
the combination of the three views. In practice, more views may be needed for both 3D reconstruction
and 3D viewing.
This paper will focus on mosaic representation and our approach for mosaic-based rendering of 3D
scenes, but other research issues, such as mosaic generation and 3D reconstruction will also be briefly
mentioned. Therefore, the rest of the paper is organized as follows. Section II will briefly introduce
mosaic representations with 2D oblique parallel projection and their inherent properties. In Section III, we
will present several practical cases where 2D parallel-projection mosaics can be generated, and discuss
related research issues in generating and using the parallel mosaics. In Section IV, we will present a
general ray interpolation approach for parallel-project mosaic generation. This section will also discuss
some practical issues in generating the mosaics. An efficient mosaic-based 3D rendering method will be
presented in Section V. Experimental results are given in Section VI for three important applications –
aerial video surveillance, ground mobile robot navigation, and under vehicle inspection. Section VII is a
brief summary.
II. 2D OBLIQUE PARALLEL PROJECTION
A normal perspective camera has a single viewpoint, which means all the light rays pass through a
common nodal point. On the other hand, in orthogonal images with parallel projections in both the x and y
directions, all the rays are parallel to each other. Imagining that we have a sensor with parallel
projections, we could turn the sensor to capture images with different oblique angles (including both nadir
and oblique angles) in both the x and y directions. Thus we can create many pairs of parallel stereo
images, each with two different oblique angles, and can observe surfaces occluded in a nadir view.
β1<0 β2>0
B
P
Z
Fig. 3. Depth from parallel stereo with multiple viewpoints: 1D case.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
4
Fig. 3 shows the parallel stereo in a 1D case, where two oblique angles β1 and β2 are chosen. The
depth of a point P can be calculated as
12 tantan ββ −=
BZ (1)
where β1 and β2 are the angles of the two viewing directions, respectively, and B is the adaptive baseline
between the two viewpoints. This adaptive baseline information is recorded in a pair of stereo mosaics
with these two angles, and is proportional to the displacement of the corresponding image projections of
the point P. The baseline is adaptive since a large depth will have a larger baseline given the two angles
than a smaller depth. It has been shown by others [10] and by us [11, 12] that parallel stereo is superior to
both conventional perspective stereo and to the recently developed multi-perspective stereo with
concentric mosaics for 3D reconstruction (e.g., in [5]). The adaptive baseline inherent in the parallel-
perspective geometry permits depth accuracy independent of absolute depth in theory [10,11]. This result
can be easily obtained from Eq. (1) since depth Z is proportional to the adaptive baseline B and therefore
to the recorded visual displacement of the corresponding pair in the two mosaics. In practice, the depth
accuracy is a linear function of depth in stereo mosaics generated from perspective image sequences [12],
due to the ray interpolation process that will be discussed in Section IV. However, this is still better than
perspective stereo or concentric stereo. In contrast, the depth error of perspective stereo and concentric
stereo is proportional to the square of depth.
a b d c
β α α β
Y X
Z
O
ray
Fig. 4. Parallel projections with two oblique angles α and β (around the x and y axes, respectively). (a) Nadir view (α=β=0); (b) β-oblique view (α=0, β≠0); (c) α-oblique view (α≠0, β=0) and (d) dual-oblique view (α≠0, β≠0). Parallel mosaics can be formed by populating each single selected ray in both the x and y directions.
We can make two extensions to this 1D case of parallel stereo. First, we can select various oblique
angles (other than just two) for constructing multiple parallel projections. By doing so we can observe
various degrees of occlusions and can construct stereo pairs with different depth resolution via the
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
5
selection of different pairs of oblique angles. Second, the 1D parallel projection can be extended to 2D
(Fig. 4), obtaining a mosaiced image that has a nadir view (Fig. 4a), oblique angle(s) only in one direction
(Fig. 4b and c) or oblique angles in both the x and the y directions (Fig. 4d).
III. PRACTICAL SCENARIOS AND RESEARCH ISSUES
It is impractical to use a single sensor to capture orthogonal images with full parallel projections in
both x and y dimensions for a large-scale scene, and with various oblique directions. However, there are at
least three practical ways of generating images with oblique parallel projections using existing sensors: a
2D sensor array of many spatially distributed cameras (Fig. 5a), a “scanner” with a 1D array of cameras
(Fig. 5b), and a single perspective camera that moves in 2D (Fig. 5c).
With a 2D array of many perspective cameras (Fig. 5a), we first assume that the optical axes of all the
cameras point in the same directions (into the paper in Fig 3a), and the viewpoints of all cameras are on a
single plane perpendicular to their optical axes. Then the perspective images can be organized into
mosaiced images with any oblique viewing angles by extracting rays from the original perspective images
with the same viewing directions, one ray from each image. If the camera array is dense enough, then
densely mosaiced images can be generated.
a b c
Fig. 5. Parallel mosaics from 2D bed of cameras. (a) 2D array; (b) 1D scan array and (c) a single scan camera.
If only a 1D linear array of perspective cameras is available (Fig. 5b), the camera array can be
‘scanned’ over the scene to synthesize a virtual 2D camera array. Then stereo mosaic pairs with oblique
parallel projections in both directions can still be generated, given that we can accurately control or
estimate the translation of the camera array. We have actually used this approach in an Under Vehicle
Inspection System (UVIS: Section VI.C) [13, 14, 18].
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
6
Even just a single camera is used, we can still generate a 2D virtual bed of cameras by moving the
camera in two dimensions, along a “2D scan” path as shown in Fig. 5c. This is the case for aerial video
mosaics [11, 12, 15, 20] where a single camera is mounted on a light aircraft flying over an area (Section
VI.A).
In real applications where parallel-projection mosaics must be generated, there are two challenging
research issues. The first problem is camera orientation estimation (calibration). In our previous study on
an aerial video application, we used external orientation instruments, i.e., GPS, INS and a laser profiler, to
ease the problem of camera orientation estimation [11, 12, 20]. In the case of under-vehicle inspection
using a 1D array of cameras [13], relative relations among cameras can be obtained by an offline camera
calibration procedure. However, the motion of the cameras or vehicles should be estimated through image
matching. In applications of 3D rendering where accurate 3D estimation is not the main issue, an image-
based camera-motion estimation method [11] is used to get an approximation of the camera orientation
parameters, i.e., the affine transformation parameters. In this paper, we assume that the extrinsic and
intrinsic camera parameters are known at each camera location so that parallel-projection mosaics can be
generated.
The second problem is to generate dense parallel mosaics with a sparse, uneven, camera array, and for
a complicated 3D scene. To solve this problem, a Parallel Ray Interpolation for Stereo Mosaics (PRISM)
approach was proposed in [11]. While the PRISM algorithm was originally designed to generate parallel-
perspective stereo mosaics (parallel projection in one direction and perspective projection in the other),
the core idea of ray interpolation can be used for generating a mosaic with full parallel projection at any
oblique angle. This will be discussed in the next section.
In summary, in the stereo mosaic approach for large-scale 3D scene modeling and rendering, the
computation is efficiently distributed in three steps (Fig. 6): camera pose estimation via the external
measurement units, image mosaicing via ray interpolation, and 3D reconstruction from a pair of stereo
mosaics [11, 12, 15] or 3D rendering with multi-view mosaics [17]. In estimating camera poses (for
image rectification), only sparse tie points widely distributed in the two images are needed for performing
bundle adjustments [21]. In generating dense parallel rays in stereo mosaics, local matches are only
performed for parallel-perspective rays between small overlapping regions of successive frames (Section
IV). In using stereo mosaics for 3D recovery, matches are only carried out between the two final mosaics;
for 3D viewing, only mosaic selection and viewing window cropping are needed. Mosaic-based 3D
viewing will be discussed in Section V.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
7
Step 1. Orientation Estimation (calibration, geo-location, bundle)
Step 2. Stereo Mosaicing (matching, ray interpolation)
Step 3. 3D Recovery/Viewing (stereo, motion, rendering)
Fig. 6. Three step approach for generating and using parallel-projection mosaics.
IV. PRISM: VIDEO MOSAICING ALGORITHM
This section discusses the generalized Parallel Ray Interpolation for Stereo Mosaics (PRISM) approach
to generating dense parallel mosaics with a sparse, uneven, camera array, and for a complicated 3D scene.
Fig. 7 shows how the PRISM algorithm works for 1D images. The 1D camera has two axes – the optical
axis (Z) and the X-axis. Given the known camera orientation at each camera location, one ray with a
given oblique angle β can be chosen from the image at each camera location to contribute to the parallel
mosaic with this oblique angle β. The oblique angle is defined against the direction perpendicular to the
mosaicing direction, which is the dominant direction of the camera path (Fig. 7).
A B
I
Camera path
Nodal point
Optical axis
Image plane
Parallel rays
Mosaicing direction Interpolated ray
Z
X
β
Fig. 7. Ray interpolation for parallel mosaicing from a camera array
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
8
But the problem is that the “mosaiced” image constructed from only those existing rays will be sparse
and uneven since the camera arrays are usually not regular and very dense. Therefore interpolated parallel
rays between a pair of existing parallel rays (from two neighboring images) are generated by performing
local matching between these two images. The assumption is that we can find at least two images to
generate the parallel ray. Such an interpolated ray is shown in Fig 4, where Ray I is interpolated from
Image A and Image B.
One interesting property of the parallel mosaics is that all the (virtual) viewpoints are at infinity.
Therefore, even if the original camera path has large deviation in the direction perpendicular to the
mosaicing direction, we can still generate full parallel mosaics. However, note that in practice, too large a
deviation in the perpendicular direction will result in a captured image sequence with rather different
spatial resolutions of the scene, hence the resulting mosaics will have varying spatial quality via ray
interpolation.
The extension of this approach to 2D images is straightforward, and a region triangulation strategy
similar to that in [11] can be applied here to deal with 2D cases. In principle, we need to match all the
points between the two overlapping slices of the successive frames to generate a complete parallel-
perspective mosaic. In an effort to reduce the computational complexity, a fast PRISM algorithm [11]
based on the proposed PRISM method has been developed. It only requires matches between a set of
point pairs in two successive images; the rest of the points are generated by warping a set of triangulated
regions defined by the control points in each of the two images. The proposed fast PRISM algorithm can
be easily extended to use more feature points (thus smaller triangles) in the overlapping slices so that each
triangle really covers a planar patch or a patch that is visually indistinguishable from a planar patch, or to
perform pixel-wise dense matches to achieve true parallel-perspective geometry.
One important issue here is the selection of neighborhood images for ray interpolation. For example,
with a 1D scan sequence of a single camera, it is hard to generate full parallel projection in the other
direction (i.e., the Y direction), which is perpendicular to the motion of the camera, since the interpolated
parallel rays far off the center of the images in the y direction have to use rays with rather different
oblique angles in the original perspective images.
Fig. 8 shows mosaic results from an aerial video sequence of a cultural scene, with a 1D scan of a
single camera. Note that parallel-perspective mosaics are generated in this example, due to the
aforementioned problem. Here we want to show the effectiveness of the PRISM algorithm. A few frames
of this 1000+-frame sequence are shown in Fig. 8a. In order to save matching time, every 10th frame is
used for generating the parallel-perspective mosaics. Please compare the results of parallel-perspective
mosaicing via the PRISM approach [11] vs. 2D mosaicing using a similar approach (manifold mosaicing
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
9
[2]), by looking along the many building boundaries (associated with depth changes) in the complete
4448x1616 set of mosaics at our web site [15]. Since it is hard to see subtle errors in the 2D mosaics of
the size of Fig. 8a, Fig. 8b and Fig. 8c show close-up windows of the 2D and 3D mosaics for the same
portion of the scene with the tall Campus Center building. In Fig. 8b the multi-perspective mosaic via 2D
mosaicing has obvious seams along the stitching boundaries between two frames. It can be observed by
looking at the region indicated by circles where some fine structures (parts of a white blob and two
rectangles) are missing due to misalignments. As expected, the parallel-perspective mosaic constructed
using 3D mosaicing (Fig. 8c) does not exhibit these problems.
Seams due to misalignment Seamless after
local match
d c
b
The mosaic (b) is the left mosaic generated from a sub-sampled "sparse" image sequence (every 10 frames of total 1000 frames, a few showing in (a)) using the proposed PRISM algorithm. The bottom two zoom sub-images show how PRISM deals with large motion parallax of a tall building: (c) 2D mosaic result with obvious seams (d) 3D mosaic result without seam using PRISM.
a
Fig. 8. Parallel-perspective mosaics of a campus scene from an airborne camera.
V. STEREO VIEWING AND 3D RECONSTRUCTION
Parallel mosaics with various oblique angles represent scenes from the corresponding viewing angles
with parallel rays and with wide fields of view. There are two obvious applications of such representation.
First, for 3D recovery, matches are only performed on a pair of mosaics, not on individual video frames.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
10
The stereo mosaic method also solves the baseline versus field-of-view (FOV) dilemma efficiently by
extending the FOV in the directions of mosaicing. More importantly, the parallel stereo mosaics have
optimal/adaptive baselines for all the points, which leads to uniform depth resolution in theory and linear
depth resolution in practice. For 3D reconstruction, epipolar geometry is rather simple due to the full
parallel projections in the mosaic pair. We will present an example of 3D reconstruction of forest scenes
in Section VI; methods and results on 3D reconstruction of urban scenes can be found in [19].
X
Y
Z
β
α γ
R1a L1a
R2 L2
R4
L4 R3
L3
R1b L1b R1c L1c
Fig. 9. 3D rendering based on multi-view parallel-projection mosaics.
Second, a human can perceive the 3D scene from a pair of mosaics with different oblique angles (e.g.,
using polarized glasses) without any 3D recovery. If we have mosaics with various oblique angles in both
the x and the y direction, a virtual fly/walk-through can be generated by a simple procedure of selecting
different pairs of mosaics, cropping corresponding windows, and performing 2D image rotation and
scaling.
Fig. 9 shows the basic concept of mosaic-based rendering. Each grid represents an oblique viewing
direction (please also refer to Fig. 2). Translation, rotation and zoom of the virtual camera can be
simulated as the following simple operations:
(1) Translation in the xy plane can be simulated by shifting the current displayed mosaic pair, due to
the multiple-viewpoint property of the parallel-projection mosaics.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
11
(2) Rotations around the X and the Y axes can be simulated by selecting different pairs of mosaics with
different oblique angles. In Fig. 9, the pair (L1a, R1a) gives a user a stereo view with the effect of
“panning” to the left, the pair (L1b, R1b) gives a stereo view with the effect of looking down (into the
paper), while the pair (L1c, R1c) gives a stereo view with the effect of panning to the right, all with a zero
α-oblique angle. The pair (L2, R2) gives a stereo view having both α- and β-oblique angles, thus allowing
the user to perform both panning and tilting.
(3) Rotation (with angle γ) around the optical axis Z only requires selection of an appropriate pair of
mosaics followed by rotation of this pair of mosaics in their image planes by angle γ. In Fig. 9, selecting
the pair (L3, R3) gives a stereo view with the effect of rotating the camera 90 degrees (γ=90), and turning
to the right, whereas the pair (L4, R4) gives a stereo view with the effect of rotating the camera 45 degrees
(γ=45) and turning the camera’s head to the right as well.
(4) The visual disparities can also be controlled by changing the selected angles between the two
mosaics for stereo viewing. In another words, if two images close together are selected (‘close’ in the
sense of Fig. 9), the visual disparities will be smaller; otherwise they will be larger. This is useful when
viewing 3D scenes with varying ranges.
(5). Camera zoom can be simulated by scaling the 2D images in the viewing window. By incorporating
visual disparity control with the scaling operation, a Z-translation effect can also be approximately
simulated.
In a virtual fly-/walk-through, the number of mosaics and the switching between different pairs should
allow smooth viewing changes. Rendering results based on real mosaics will be shown in the next
section. Here we want to point out due the to nature of the parallel-projection of each mosaiced image, the
3D visual effect is different from the usual perspective stereo viewing: usually exaggerated 3D effects
will be observed. However, the mosaic-based rendering approach, almost without any computation,
provides 3D information, occlusions, virtual translation and rotation of the cameras. In fact,
independently moving objects can also represented and visualized in this approach. The discussion of
dynamic aspect of the stereo mosaics can be found in [19].
VI. EXPERIMENTAL EXAMPLES
The proposed mosaic representation has been applied to a variety of applications, including (1)
airborne video for environmental monitoring and urban surveillance; (2) ground mobile robot navigation;
and (3) under-vehicle inspection. In this section, we will mainly show 3D rendering results with multi-
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
12
view parallel (perspective) mosaics in these three scenarios. These applications represent very different
imaging scenarios, including far-range, medium-range and extreme close-range imaging. We will show
that the same approach can be applied to all these three cases.
A. Video Mosaics from Aerial Video In theory, with a camera on an airplane undergoing an ideal 1D translation and a nadir view direction,
two spatio-temporal images can be generated by extracting two rows of pixels at the front and rear edges
of each frame perpendicular to the direction of motion (Fig. 10). The mosaic images thus generated are
parallel-perspective, with parallel projection in the motion direction and perspective projection in the
other. In addition, these mosaics are obtained from two different oblique viewing angles of a single
camera’s field of view, so that a stereo pair of left and right mosaics captures the inherent 3D information.
Note that we do not generate parallel projection in the y direction for this 1D scan case due to the
difficulty mentioned in Section IV.
In our aerial video environmental monitoring application, a single camera is mounted in a small aircraft
undergoing 6 DOF motion, together with a GPS, INS and laser profiler to measure the moving camera
locations and the distance to the terrain [11, 12]. Given the acquired data, seamless stereo parallel-
perspective video mosaic strips can be generated from the image sequences with a 1D scan path, but with
a rather general motion model, using the proposed parallel ray interpolation for stereo mosaicing
(PRISM) approach [11].
Left view mosaic
……
Rays of left view
……
Rays of right view
Right view mosaic
Front slit Rear slit
Perspective image Y
Z
XO
dx
motion direction
Fig. 10. Parallel-perspective stereo mosaics with a 1D camera scan path
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
13
A real-world example of video mosaicing using the PRISM approach was shown in Fig. 8. As another
example, Fig. 11 shows stereo mosaics (with two β-oblique angles) generated from a telephoto camera
and 3D recovery for a forest scene in Amazon rain forest. The average height of the airplane is H = 385 m
(i.e., about 1000 feet), and the distance between the two slit windows is selected as dx = 160 pixels (in the
x direction) with images of 720 (x) * 480 (y) pixels. The image resolution is about 7.65 pixels/meter. The
depth map (Fig. 11c) generated from the stereo mosaics (Fig. 11 a and b) was obtained by using a
hierarchical sub-pixel dense correlation method [16]. The range of depth variations of the forest scene
(from a stereo fixation plane) is from -24.0 m (tree canopy) to 24.0 m (the ground). Even before any 3D
recovery, a human observer can perceive the 3D scene from the stereo pair using a pair of red/blue stereo
glasses (Fig. 11d).
a
b
c
d
Fig. 11. Stereo mosaics and 3D reconstruction of a 166-frame telephoto video sequence. (a) Left mosaic (b) right mosaic (c) depth map and (d) stereoscopic view (use left-blue/right-red glasses).
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
14
motion direction
Multi-view stereo viewing
Multi-view 3D reconstruction
Mutli-view mosaics from a single moving camera. Seven mosaics are shown with 7
different viewing directions.
a perspective frame
Fig. 12. Multi-view parallel-perspective mosaics for 3D reconstruction and stereo viewing.
Fig. 13. Mosaic-based fly-through: snapshots. A 3D effect will be seen with a pair of red-blue glasses. The snapshots also show varying occlusions and independently moving targets (cars on the road).
Multiple oblique parallel-perspective mosaics (Fig. 12) generated in a similar way can be used for
image-based rendering as discussed in Section V. In the case of these parallel-perspective mosaics with
only β-oblique angles, only translation in the X and Y directions, rotation around the Y axis, and camera
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
15
zoom can be performed, but a very effective 3D virtual fly-through can be generated. A mosaic-based fly-
through demo may be found at [17], which uses 9 oblique mosaics generated from a real video sequence
of the UMass campus. This result shows motion parallax, occlusion and also moving objects in multiple
parallel-perspective mosaics; a few snapshots are shown in Fig. 12 to illustrate these effects. We note that
the rendering shows a parallel-perspective rather than a true perspective perception. A true perspective
fly-through will be enabled by 3D reconstruction from the multiple mosaics.
B. Video Mosaics for mobile robot application The same approach has also been applied to ground mobile robot applications where the ranges of the
roadside scenes to the camera on a mobile robot is from tens feet (indoor) to hundreds feet (outdoor). The
road-side parallel (-perspective) stereo mosaics can be used for human-robot interaction in robot
navigation. Fig. 14 shows three parallel-perspective mosaics from a 517-frame video sequence captured
from a mobile robot viewing a group of bookshelves and cabinets at close range as the robot moves from
one end to the other.
a
b
Fig. 14. Ground video application. (a) A few frames from a 517-frame sequence of image size 320*240. (b) Ground video mosaics: a left view, the center view and a right view are shown. Each mosaic is 4160*288.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
16
Fig. 15. Mosaic-based walk-through: stereoscopic snapshots (with red/blue glasses)
For this example, eleven (11) mosaics are generated. The video clip of a virtual walk-through using
these 11 mosaics can be found at [22]. Fig. 15 shows a few snapshots extracted from the video clip at two
camera locations, one viewing the connection between two book shelves (the 1st row), and the other
viewing one end of a cabinet (the 2nd row). In this example, the 3D and occlusion effects are dramatic.
C. Video Mosaics for Under-Vehicle Inspection As one of the real applications of full parallel stereo mosaics, an approximate version of mosaics with
full parallel projections has been generated from a virtual bed of 2D camera arrays by driving a car over a
1D array of cameras in an under-vehicle inspection system (UVIS) [13, 14, 18]. UVIS is a system
designed for security checkpoints such as those at borders, embassies, large sporting events, etc. It is an
example of generating mosaics from very short-range video; a 2D virtual array of camera is necessary for
full coverage of the vehicle undercarriage.
1D camera array inside:
Fig. 16. Conceptual 1D camera array for under-vehicle inspection [13, 14].
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
17
Fig. 17. 2D parallel mosaic from “13 cameras” spaced 3 inches apart traveling down the length of the vehicle.
Fig. 16 illustrates the system setup where an array of cameras is housed in a platform. When a car
drives over the platform, several mosaics with different oblique angles of the underside of a car are
created. The mosaics can then be viewed by an inspector to thoroughly examine the underside of the
vehicle from different angles. Fig. 17 shows such a mosaic covering the full under-body of a vehicle,
generated from a 1D array of 13 cameras spaced 3 inches apart traveling down the length of the vehicle
taking pictures every 3 inches. This is equivalent to a stationary 1D array of cameras and a moving
vehicle. The 1D array of 13 cameras are simulated by laterally shifting the real experimental set-up of 4
side-by-side cameras spaced 3 inch apart. Fig. 18 shows two more examples, with the array of 4 side-by-
side cameras, undertaking more general motion that includes simulating turning, backing-up and
stopping-then-starting of the vehicle. The 2D parallel mosaics are generated in two steps. In the first step,
each lateral mosaic strip is generated from images captured by the 1D array of cameras at each location of
the camera array. Then in the second step, the sequence of the mosaic strips is sewed in the direction of
the vehicle’s motion to generate the full 2D mosaics.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
18
a
b
Fig. 18. 2D parallel mosaics from 4 cameras spaced 3 inches apart traveling down the length of the vehicle, (a) as the vehicle turns sharply; and (b) it stops and backs up while traveling before starting forward again.
Fig. 19b shows one of the five mosaics each with different oblique views, generated from a 130-frame
video sequence (sample video frames are shown in Fig. 19a). Different “occluded” regions under a pipe in
the center can be observed by switching to different mosaics in the mosaic-based rendering results (Fig.
20). A PPT demo of these five oblique parallel views of the mosaics can be found at [18]. More results
on 2D parallel-projection mosaics can be found at [14].
In the case of the 1D camera array, the fixed cameras were pre-calibrated and the geometric and
photometric distortions of these wide FOV cameras were corrected. However challenges remain since (1)
the distance between cameras are large compared to the very short viewing distances to the bottom of the
car; and (2) without the assistance of GPS/INS for pose estimation, we need to determine the car’s motion
by other means, e.g. tracking line features on the car. The proposed ray interpolation approach needs to
take these two factors into consideration.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
19
a
b
Fig. 19. Under-vehicle inspection: (a) four frames from a 130-frame video sequence with image size 611x447; (b) one of the stereo mosaic pair.
VII. CONCLUSIONS
This paper presents an approach to the fusion of images from many video cameras or a moving video
camera with external orientation data into a few mosaiced images with oblique parallel projections. In
both cases, a virtual 2D array of cameras with FOV overlaps is formed to generate coverage of the entire
scene (or object). The proposed representation provides wide FOV, preserves extensive 3D information,
and represents occlusions. This representation can be used as both an advanced video interface for
surveillance or a pre-processing step for 3D reconstruction.
We present several practical cases where 2D parallel-projection mosaics can be generated, and discuss
related research issues in generating and using the parallel mosaics. In particular, we present a general ray
interpolation approach for parallel-projection mosaic generation, and discuss some practical issues in
generating the mosaics. A mosaic-based 3D rendering method, almost without any computation, allows
very effective 3D rendering of various complicated visual scenes, from forestry scenes to urban scenes,
with various viewing ranges. Experimental results are given for three important applications – aerial
video surveillance, ground mobile robot navigation, and under vehicle inspection.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
20
Fig. 20. Mosaic-based vehicle inspection: rendering snapshots with stereoscopic viewing capability
ACKNOWLEDGMENT
This work is supported by National Science Foundation (NSF) under Award EIA- 9726401, Air Force
Research Lab (AFRL) under Grants FA8650-05-1-1853 and F33615-03-1-63-83, Army Research Office
(ARO) under Award No. W911NF-05-1-0011, and by funding from New York Institute for Advanced
Studies and from Atlantic Coast Technologies, Inc.
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
21
REFERENCES
[1] J Y Zheng and S Tsuji, Panoramic representation for route recognition by a mobile robot, International Journal of Computer Vision, 9(1), 1992: 55-76
[2] S Peleg, B Rousso, A Rav-Akha, A. Zomet, Mosaicing on adaptive manifolds, IEEE Trans. PAMI, 22(10), Oct 2000: 1144-1154.
[3] USGS DEM, http://data.geocomm.com/dem/
[4] S Peleg, M Ben-Ezra and Y Pritch, OmniStereo: panoramic stereo imaging, IEEE Trans. PAMI, March 2001:279-290.
[5] H-Y Shum and R Szeliski, Stereo reconstruction from multiperspective panoramas, ICCV’99: 14-21.
[6] S Baker, R Szeliski and P Anandan, A layered approach to stereo reconstruction. CVPR'98: 434-441
[7] J Shade, S Gortler, L He. and R Szeliski, Layered depth image. SIGGRAPH'98: 231-242
[8] Z Zhu and A R Hanson, LAMP: 3D Layered, Adaptive-resolution and Multi-perspective Panorama - a New Scene Representation, Computer Vision and Image Understanding, 96(3), Dec 2004: 294-326.
[9] A Zomet, D Feldman, S Peleg, D Weinshall, Mosaicing new views: crossed-slits projection, IEEE Trans. PAMI 25(6), June 2003.
[10] J Chai and H-Y Shum, Parallel projections for stereo reconstruction, CVPR'00: II 493-500.
[11] Z Zhu, E M Riseman, A Hanson, Generalized Parallel-Perspective Stereo Mosaics from Airborne Videos, IEEE Trans. PAMI, 26(2), Feb 2004:226-237.
[12] Z Zhu, A R Hanson, H Schultz and E M Riseman, Generation and error characteristics of parallel-perspective stereo mosaics from real video. In Video Registration, M. Shah and R. Kumar (Eds.), Kluwer, 2003: 72-105.
[13] P Dickson, J Li, Z Zhu, A R Hanson, E M Riseman, H Sabrin, H Schultz and G Whitten, Mosaic generation for under-vehicle inspection. WACV’02: 251-256
[14] http://vis-www.cs.umass.edu/projects/uvis/index.html
[15] http://www-cs.engr.ccny.cuny.edu/~zhu/StereoMosaic.html
[16] H Schultz, Terrain reconstruction from widely separated images, SPIE 2486, April 1995: 113-123.
[17] http://www-cs.engr.ccny.cuny.edu/~zhu/CampusVirtualFly.avi
[18] http://www-cs.engr.ccny.cuny.edu/~zhu/mosaic4uvis.html
[19] Z. Zhu, H. Tang, B. Shen, G. Wolberg, 3D and Moving Target Extraction from Dynamic Pushbroom Stereo Mosaics, IEEE Workshop on Advanced 3D Imaging for Safety and Security, June 25, 2005, San Diego, CA, USA
[20] Z. Zhu, E. M. Riseman, A. R. Hanson and H. Schultz, An Efficient Method for Geo-Referenced Video Mosaicing for Environmental Monitoring. Machine Vision Applications Journal, Springer-Verlag, 16(4): 203-216, 2005
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
22
[21] C. C. Slama (Ed.), Manual of Photogrammetry, Fourth Edition, American Society of Photogrammetry, 1980
[22] http://www-cs.engr.ccny.cuny.edu/~zhu/Multiview/indoor1Render.avi
Zhigang Zhu received his B.E., M.E. and Ph.D. degrees, all in computer
science from Tsinghua University, Beijing, China, in 1988, 1991 and 1997,
respectively. He is currently an Associate Professor in the Department of
Computer Sciences, the City College of the City University of New York, and
is directing the City College Visual Computing Laboratory (CcvcL).
Previously he has been Associate Professor at Tsinghua University, and Senior Research Fellow at the
University of Massachusetts, Amherst. His research interests include 3D computer vision, Human-
Computer Interaction (HCI), virtual / augmented reality, video representation, and various applications in
education, environment, robotics, surveillance and transportation. He has published over 100 technical
papers in the related fields. Dr. Zhu received the Science and Technology Achievement Award (second-
prize winner) from Ministry of Electronic Industry, China, in 1996 and C. C. Lin Applied Mathematics
Award (first prize winner) from Tsinghua University in 1997. His Ph.D. thesis " On Environment
Modeling for Visual Navigation" was selected in 1999 in the top 100 dissertations in China, and a book
based on his Ph.D. thesis was published by China Higher Education Press in December 2001. He was a
recipient of the CUNY Certificate of Recognition "Salute to Scholars" Award, in both 2004 and 2005. He
is a senior member of the IEEE and a member of the ACM.
Allen R. Hanson received his B.S. degree from Clarkson College of
Technology in 1964 and his M.S. and Ph.D. degrees in Electrical Engineering
from Cornell University in 1966 and 1969, respectively. He joined the
Computer Science Department at the University of Massachusetts as
Submission to Image Communication, Special Issue on Interactive Representation of Still and Dynamic Scenes
23
Associate Professor in 1981 and has been a full Professor since 1989. Professor Hanson has conducted
research in computer vision, artificial intelligence, learning, and pattern recognition, and has over 200
publications. He is Co-Director of the Computer Vision Laboratory and has a diverse range of recent
research including aerial digital video analysis for environmental science, three-dimensional terrain
reconstruction, distributed sensor networks, motion analysis and tracking, mobile robot navigation, under-
vehicle inspection for security applications, object recognition, image information retrieval, and
technology for the aged. He has served on most of the major conferences in computer vision in various
ways and is a member of the IEEE and ACM.