skystitch: a cooperative multi-uav-based real-time video surveillance system with stitching

Nagoya UniversityParallel & Distributed Systems Lab.

パターン・映像情報処理特論 (2016.5.6)

SkyStitch: a Cooperative Multi-UAV-based Real-time

Video Surveillance System with Stitching

Yuki [email protected]

Edahiro & Kato Lab.

Graduate School of Information Science

Nagoya University

1


About Paper

p Title

SkyStitch: a Cooperative Multi-UAV-based Real-time Video

Surveillance System with Stitching

p Authors

Xiangyun Meng, Wei Wang, Ben Leong

(National University of Singapore)

p Conference

Proceedings of the 23rd ACM international conference on Multimedia,

2015

URL - https://www.comp.nus.edu.sg/~bleong/publications/acmmm15-skystitch.pdf

2


Contents

1. Introduction

2. Related Work

3. Real-time Video StitchingI. Offloading Feature ExtractionII. Exploiting Flight Status InformationIII. Optimizing RANSACIV. Optimizing Video Stitching QualityV. Multiple Video Sources

4. System Implementation

5. EvaluationI. Stitching TimeII. Stitching Quality

6. Conclusion

3


Contents

1. Introduction

2. Related Work




6. Conclusion

4


Introduction

Applications using UAV-based surveillance system

ü firefightingü search and rescueü Border monitoring

To cover large target area…

Multiple UAVs are used to capture videos and put them together (stitch) consistently.

5

SkyStitch: a Cooperative Multi-UAV-based Real-timeVideo Surveillance System with Stitching

Xiangyun Meng, Wei Wang and Ben LeongSchool of Computing

National University of [email protected], [email protected], [email protected]

ABSTRACT

Recent advances in unmanned aerial vehicle (UAV) technologieshave made it possible to deploy an aerial video surveillance systemto provide an unprecedented aerial perspective for ground moni-toring in real time. Multiple UAVs would be required to cover alarge target area, and it is difficult for users to visualize the over-all situation if they were to receive multiple disjoint video streams.To address this problem, we designed and implemented SkyStitch,a multiple-UAV video surveillance system that provides a singleand panoramic video stream to its users by stitching together mul-tiple aerial video streams. SkyStitch addresses two key design chal-lenges: (i) the high computational cost of stitching and (ii) the diffi-culty of ensuring good stitching quality under dynamic conditions.To improve the speed and quality of video stitching, we incorporateseveral practical techniques like distributed feature extraction to re-duce workload at the ground station, the use of hints from the flightcontroller to improve stitching efficiency and a Kalman filter-basedstate estimation model to mitigate jerkiness. Our results show thatSkyStitch can achieve a stitching rate that is 4 times faster than ex-isting state-of-the-art methods and also improve perceptual stitch-ing quality. We also show that SkyStitch can be easily implementedusing commercial off-the-shelf hardware.

Categories and Subject Descriptors

I.4 [Image processing and computer vision]: [Applications]; I.2.9[Robotics]: [Autonomous vehicles]

Keywords

UAV; stitching; real-time

1. INTRODUCTIONThe proliferation of commodity UAV technologies like quad-

copters has made it possible to deploy aerial video surveillancesystems to achieve an unprecedented perspective when monitoringa target area from the air. UAV-based surveillance systems have abroad range of applications including firefighting [2], search andrescue [3], and border monitoring [8].

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

MM’15, October 26–30, 2015, Brisbane, Australia.c⃝ 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.

DOI: http://dx.doi.org/10.1145/2733373.2806225.

Ground station

Video streaming

Video capturingVideo capturing

Video stitching

Figure 1: Illustration of SkyStitch. Each quadcopter transmits a livevideo stream to the ground station, which stitches them together toprovide a single composite view of the target area.

For example, a UAV could be deployed to hover over a bush-fire to transmit a live video feed to firefighters using wireless linksto allow them to appraise the situation from an aerial perspective.However, because on-board cameras have only a limited field ofview, multiple UAVs would be required to cover a large target area.If the firefighters were to receive multiple disjoint video streamsfrom each UAV, it would be difficult for them to put together a con-sistent view of the situation on the ground.

To address this problem, we designed and implemented SkyS-titch, a quadcopter-based HD video surveillance system that incor-porates fast video stitching. With SkyStitch, users on the groundwould receive a single composite live video feed produced from thevideo streams from multiple UAVs. This is illustrated in Figure 1,where an SOS sign is marked on the ground by victims trapped ina bushfire.

Although the high-level idea of SkyStitch is relatively straight-forward, there are two practical design challenges. The first chal-lenge is that image stitching inherently has a very high compu-tational cost. Existing stitching algorithms in both industry andacademia [5, 26, 13] were designed for better stitching quality ratherthan shorter stitching time. Although some algorithms can achieveshorter stitching times [6, 27, 31], they impose an additional re-quirement that the cameras be in fixed relative positions, which isinfeasible in our operating scenario as the cameras are mountedon moving quadcopters. The other challenge is that existing im-age stitching algorithms are designed for static images, and notvideo. We found that even though the individual frames producedby image stitching algorithms might look fine, the video obtainedby naively stitching them together would usually suffer from per-spective distortion, perspective jerkiness and frame drops.

To address these challenges, we made a few key observationsabout our unique operating environment and developed several prac-tical techniques to improve both stitching efficiency and quality.One observation is that, although each UAV has only a limited pay-

Illustration of SkyStitchUAV – Unmanned Aerial Vehicle

http://www.robotshop.com/en/550mm-rtf-quadcopter-uav.html


Introduction

SkyStitch - a quadcopter-based HD video stitching system for surveillance

Challenges

1. computational cost

2. video stitching (distortion, frame drops, moving objects…)

Solutions

1. Reduce workload of the ground station by offloading pre-processing to the UAVs

2. Exploit flight information to speed up feature matching

3. GPU-based RANSAC implementation

4. Utilize optical flow and Kalman filtering

5. Closing loops when stitching multiple video sources

6


Contents

1. Introduction

2. Related Work




6. Conclusion

7


Contents

1. Introduction

2. Related Work




6. Conclusion

8


Real-time Video Stitching

Image Stitching – finding the transformation (H: homography matrix)

between two images

Estimate H using feature-based method

i. feature extractionii. feature matchingiii. Robust homography estimation

9

𝑥"𝑦"1 &~H ) 𝑥𝑦1 &

load, each on-board computer has a fair amount of computationalpower. In SkyStitch, we divide the video stitching process into sev-eral tasks and offload some pre-processing to the UAVs. In thisway, we can improve the speed of stitching by reducing the work-load of the ground station, which would be a potential bottleneckwhen there are a large number of UAVs. Another observation is thatwe can exploit the hints from instantaneous flight information likeUAV attitude and GPS location from the on-board flight controllerto greatly improve the efficiency of stitching and rectify perspectivedistortion. Our third observation is that image stitching and opticalflow have complementary properties in terms of long-term stabil-ity and short-term smoothness. Therefore, a Kalman filter-basedstate estimation model can be applied by incorporating optical flowinformation to significantly reduce perspective jerkiness.

We deployed SkyStitch on a pair of low-cost self-assembled quad-copters and evaluated its performance for two ground scenarioswith very different feature distributions. Our experiments showedthat SkyStitch achieves a shorter stitching time than existing meth-ods with GPU acceleration [26], e.g., SkyStitch is up to 4 timesfaster for 12 concurrent aerial video sources. In other words, for afixed frame rate, SkyStitch can support many more video sourcesand thus cover a much larger target area in practice. For exam-ple, at a stitching rate of 20 frames per second (fps), SkyStitchis able to stitch videos from 12 quadcopters while existing meth-ods can only support three with the same hardware. In addition,SkyStitch is able to achieve better stitching quality by rectifyingperspective distortion, recovering stitching failures and mitigatingvideo jerkiness. Furthermore, SkyStitch can be easily implementedusing commercial off-the-shelf hardware thereby reducing the costof practical deployment.

The rest of this paper is organized as follows. In Section 2, wepresent the prior work related to SkyStitch. In Section 3, we discussthe design details of SkyStitch, followed by its implementation inSection 4. In Section 5, we present the performance evaluation ofSkyStitch. Finally, we conclude this paper in Section 6.

2. RELATED WORKImage stitching has been widely studied in the literature. Most

available stitching algorithms are focused on improving stitchingquality without considering stitching time [26, 16, 13]. Recently,image stitching algorithms have been applied in UAV-based incre-mental image mosaicking [32]. Basically, a UAV flies over a tar-get area along a pre-planned route and continuously takes images,which are transmitted to a ground station and incrementally stitchedto produce a large image of the target area. Unlike SkyStitch, suchapplications do not have a stringent requirement on stitching rate.Wischounig et al. improved the responsiveness of mosaicking byfirst transmitting meta data and low-resolution images in the air,and only transmitting the high-resolution images at a later stage,i.e., after landing [30]. There are also commercial systems forUAV-based image mosaicking. The company NAVSYS claims thatits system can achieve a stitching rate of one image per second [12].

There are several existing video stitching algorithms that canachieve fast stitching rate, but most of them only work for a staticcamera array like the one used by Google Street View [10]. Ten-noe et al. developed a real-time video stitching system using fourstatic HD cameras installed in a sports stadium [27]. The key ideaof their design is to pipeline the stitching process and implement itin GPU, and it is able to achieve a stitching rate of 30 fps. Adam etal. utilized belief propagation to obtain depth information of imagescaptured on a pair of fixed cameras [9]. Their GPU-based imple-mentation is able to achieve a rate of 37 fps when stitching two1600×1200 images, but it requires a calibration process to com-

H

2. Feature matching 4. Compositing3. Estimation of H1. Feature extraction

Image alignment

Figure 2: Overview of image stitching.

pute the overlapping region of the camera views. Another GPU-based method relied on specific photogrammetric procedure to findthe cameras’ orientation and translation relative to a common co-ordinate system, and it could support a stitching rate up to 360fps when stitching four 1360×1024 images [15]. A CPU-basedmethod of video stitching can be found in [31]. Although thesemethods achieve fast stitching rates, they assume that the camerasare in a static position and require system calibration and are thusnot applicable to stitching videos from highly dynamic UAVs.

As smartphones become ubiquitous, some researchers have triedin recent times to improve viewers’ viewing experience by man-aging video streams from multiple smartphones. Kaheel et al. pro-posed a system called Mobicast which could stitch several 352×288video streams from smartphones that captured the same event [22].While the frame rate of Mobicast was able to meet real-time re-quirements, its stitching quality in real experiments was generallypoor, with a stitching failure rate of some 60%. El-Saban et al.improved the frame rate of a similar system by considering thetemporal correlation of frames [17]. Since most smartphones areequipped with various sensors, we believe that the methods pro-posed in this paper could potentially also be applied to improve theperformance of smartphone-based video stitching systems.

3. REAL-TIME VIDEO STITCHINGImage stitching involves the alignment and blending of a pair

of images with some overlap into a single seamless image. Videostitching is effectively the stitching of a series of image pairs (i.e.,video frame pairs). Image stitching is computationally expensiveand it takes several seconds to stitch a pair of HD images usingHugin [5], a popular image stitching software, on a workstationwith a 3.4 GHz CPU. The challenge of real-time video stitchingis to ensure that we can stitch a set of video frames within the re-quired video frame interval. In this paper, we will use image andvideo frame interchangeably. In this section, we describe the stepsrequired in image stitching and explain how SkyStitch improves theefficiency and quality of video stitching.

The key task in image stitching is to find the transformation be-tween two images. Assume the ground is a planar surface, andlet each pixel location in an aerial image be represented by a ho-mogeneous coordinate (x, y, 1). Then the pixel locations of theoverlapping area in the two images are related by

[x′ y′ 1]T ∼ H · [x y 1]T (1)

where H is a 3×3 homography matrix for transformation. In orderto align two images, we need to estimate H.

The most popular method for H estimation is the feature-basedmethod illustrated in Figure 2. Features refer to the unique partsin an image, e.g., corners are often considered as features. A fea-ture is usually represented using a high-dimensional vector, calledfeature descriptor, which encodes feature information like gradientand orientation histogram in the neighborhood. A feature-based

Overview of image stitching


Offloading Feature Extraction

Traditional stitching system – Transmits only video frames

Feature extraction is performed on ground station.

It would become bottleneck when the # of UAVs increase.

SkyStitch – Transmits video frames and corresponding features

Feature extraction is performed on UAVs*.

* 1,000 features extraction from 1280×1024 on NVIDIA Jetson TK1(192 CUDA cores)

- less than 35 ms

10


Exploiting Flight Status Information

Difficulty in achieving fast and accurate stitching

1. perspective distortion due to wind turbulence

→ Orthorectification of Perspective Distortion

2. time-consuming feature matching between two images

ex) two images with 1,000 features - 1,000,000 combination!

→ Search Region Reduction for Feature Matching

11


Orthorectification of Perspective Distortion

Orthorectification – Transform the image so that it looks as if the camera is pointed vertically

downwards.

ü Transformation matrix – downwards camera direction to the tilted direction

R =1 0 00 cos 𝜃1 sin 𝜃10 −sin 𝜃1 cos 𝜃1

cos 𝜃5 0 −sin 𝜃50 1 0

sin 𝜃5 0 cos 𝜃5

cos 𝜃6 sin 𝜃6 0−sin 𝜃6 cos 𝜃6 00 0 1

ü Orthorectification matrix

M = KBR:;B:;K:;

K: transform the pixel into world coodinates(camera intrinsic matrix)

B: transform camera into flight controller

Orthorectified coordinates can be expressed by

𝑥"𝑦"𝑧′ &~M 𝑥𝑦1 &

12

Pitch

Yaw

Roll

Heading

Figure 3: Flight attitude of quadcopter:roll, pitch and yaw.

BA

P P

A B

P

Actual scenarioNorth

Pixel coordinates (after orthorectification)

(a) Mapping a pixel location from one image to another.

Camera

Camera

GPS difference

P

(b) Merging coordinate systems.

Figure 4: Overlap determination using flight status information.

Then the final world coordinates (xw, yw, zw) can be obtained by,⎧

⎪

⎨

⎪

⎩

xw = s (x′

z′− W

2)

yw = s ( y′

z′− H

2)

zw = h

(6)

In other words, each pixel of an image is mapped to a point (xw, yw)on the ground, with the center of the camera sensor as the coordi-nate origin. The two world coordinate systems can then be corre-lated using the difference in the GPS locations. With such infor-mation, we only need to search in a small local neighborhood forthe matched feature. Because the area of the local neighborhoodis often much smaller than the whole image, we can significantlyreduce the search time of feature matching as compared with thetraditional method of searching the entire image for the target fea-ture.

This simple idea is illustrated in Figure 5. P is the location inImage 2 that we map for a feature point from Image 1, and r isthe search radius for the matched feature. Since an image covers afinite 2D space, a grid indexing approach is preferred. We divideImage 2 into 50 × 40 bins. 50 × 40 was chosen because it is pro-portional to the aspect ratio of the image and it yields bins that aresmall enough. We would use more bins for a bigger image. If a binis within a distance r from P (i.e., the colored bin), all the featuresin the bin will be considered as candidates for feature matching.

It remains for us to determine an appropriate value for the searchradius r. Fortunately, the simple approach of setting a fixed andconservative value for r based on empirical experiments seems tobe sufficient. In particular, we found that r = 200 pixels was suffi-cient in our implementation. In Section 3.4, we show that r can befurther reduced if we estimate and compensate for GPS errors.

3.3 Optimizing RANSACThe step after feature matching is to use the standard RANSAC

method [18] to remove outliers. RANSAC arbitrarily selects fourmatched features to compute a homography matrix H, applies thisestimated H on all the matched features, repeats it for N iterationsand finds the one with the most inliers. In challenging cases wherethe inlier ratio is low, we often set N to a large value (e.g., 500) inthe hope that at least one subset of the four matched features wouldnot contain any outliers.

While a large N improves RANSAC’s likelihood of finding thecorrect homography, it also increases its computation time. For ex-ample, it takes about 11 ms to perform a 512-iteration RANSACon a computer with a 3.4 GHz Intel CPU. This means that it isimpossible to stitch more than 4 videos at 30 fps. In order tospeed up RANSAC, we noticed that the traditional Singular ValueDecomposition (SVD) method is unnecessary for 4-point homog-raphy and can be replaced with Gauss-Jordan elimination. TheGauss-Jordan method runs efficiently on GPU due to its simpleand branchless structure, whereas an SVD implementation is usu-ally much more complicated and GPU-unfriendly. Finally, we im-plemented an OpenCL-based RANSAC homography estimator thatcan compute a 512-iteration RANSAC in 0.6 ms.

To further improve RANSAC performance when the number ofimage pairs is large, we maximize parallelism and minimize IOoverhead by performing RANSAC on all image pairs in one pass(Figure 6). First, candidate homographies of all image pairs arecomputed in a single kernel execution. The candidate homogra-phies are then routed to corresponding image pairs for reprojectionerror computation and inlier mask generation. Finally, the scoresand inlier masks of all candidate homographies are downloaded tomain memory. Our RANSAC pipeline has only four CPU-GPUmemory transactions and the amount of parallelism at each stage ismuch higher than performing RANSAC sequentially for each im-age pair.

In addition to execution speed, we also made improvements tothe accuracy of RANSAC. Traditionally, each RANSAC iterationis scored based on the number of inliers. While this approach workswell most of the time, we found that when the inliers are concen-trated in a particular region, it may result in a bad estimation of H.To mitigate this problem, we modified the score by taking into con-sideration the distribution of inliers. Specifically, we use a scorethat is a function of both the number of inliers and the standarddeviations of their locations:

score = Ninliers σx σy (7)

This score can be efficiently computed on the GPU during inliermask generation. The RANSAC iteration with the highest scoreunder this new metric was used to provide the inliers for the finalH estimation.

3.4 Optimizing Video Stitching QualityGiven the set of inliers identified by RANSAC, we can pro-

ceed to compute the homography matrix H using least squares andthen composite the two images based on H [26]. However, eventhough the set of inliers appears to be the best that we can find, itis possible for the resulting H to be ill-formed and to produce astitched image with some perceptual artefacts. In particular, part ofa stitched video frame may suddenly “jerk” when compared to thevideo frames before and after it.

There are two common causes for such “stitching failures”. Oneis that the overlapping area between the two images is too small,and the resulting matched features are not sufficient to produce agood estimate of H. Our experiments showed that the stitched im-ages generally start exhibiting a certain degree of perceptual dis-tortion when the number of inliers is smaller than 30. An exampleis shown in Figure 7(a). The other cause is that the original setof inliers selected by RANSAC are concentrated in a small region,and the resultant H becomes less stable, causing distant regions ofimages to be misaligned. An example is shown in Figure 7(b).

As human vision is sensitive to sudden scene change, we needto detect ill-formed H and prevent them from affecting the percep-tual experience. In particular, we first determine whether there aresufficient inliers produced by H. If the number of inliers is smallerthan 30 (which we had determined from empirical experiments),we consider it as a failure and discard the current frame.

Flight attitude of quadcopter: roll, pitch and yaw

Roll Pitch Yaw

obtain from flight controller


Search Region Reduction for Feature Estimation

To reduce feature matching time, GPS location is exploited to find the mapping of pixel

locations between two images.

1. transform the pixel locations from pixel coordinate to a world coordinate system

2. Two image in world coordinate systems can be correlatedusing the difference in the GPS locations.

13

P

r

Feature point transformed from Image 1

Bins to be searched for the matched feature

Image 2

Figure 5: Search region reduction for fea-ture matching.

H

H

H

H

H

2 3 41

2 3 41

2 3 41

2 3 41

2 3 41

...

...

...

...

qpqpqpqpqp

qpqpqpqp

Uploadedfrom CPU

Uploadedfrom CPU

...

Reprojection

Error

Pair

Pair

ScoresInlier maskAll matchescorrespondences4−point

homographiesCandidate

Gauss Jordan

Elimination Pair

...

Pair

DownloadedDownloadedfrom GPU from GPU

...

...

...

i1, j1

ik, jk

i1, j1

ik, jkHn−1

Hn−1

H0

H0

Figure 6: RANSAC GPU Pipeline

(a) Inaccurate rotation. Notice thebroken and bent touchline of thefootball field, below the runningtrack.

(b) Perspective distortion. Noticethat the left part of the image iserroneously warped and cannot beperfectly aligned with the right im-age.

Figure 7: Examples of perceptual distortion. Each blue cross rep-resents a feature point.

When there are sufficient inliers, we also check whether H badlywarps the images and introduces perspective distortion. Let θ be theangle between the two image planes after orthorectification usingM. Ideally, θ should be zero, but in practice it could be a smallnon-zero value due to the errors in the readings of our on-boardsensors. To find θ we do homography decomposition on H [24]

H = R+ t nT (8)

R is a 3× 3 rotation matrix that transforms camera from the refer-ence image to the target image. We are concerned about the anglebetween the reference xy-plane and the target xy-plane, which issimply the angle between the reference z-axis and the target z-axis:

cos(θ) = r33 (9)

By analyzing 300 stitched frames that do not exhibit noticeable per-spective distortion (i.e., H is well-formed), we found that 90% ofthem have |θ| ≤ 5◦. Thus, we use θ of 5◦ as the threshold todetermine whether H is erroneously estimated.

3.4.1 Recovering from Failed StitchingWhile discarding ill-formed H alleviates jerkiness in the stitched

video frames, it also leads to frame drops and viewers may noticediscontinuity in the video. To reduce frame drops, we use an al-ternative method to “predict” H when the original method fails, byutilizing the optical flow of adjacent frames.

This prediction method is illustrated in Figure 8. F is the 3×3homography matrix of optical flow, which can be accurately andefficiently computed on each quadcopter as adjacent frames usuallyhave large overlap. Fn+1 is transmitted together with frame n+ 1to the ground station, which will predict Hn+1 as follows:

Hn+1 = M2,n+1 F2,n+1 M−12,n Hn M1,n F

−11,n+1 M

−11,n+1 (10)

where M is the orthorectification matrix described in Section 3.2.1and Hn is the transformation matrix of the previous stitching which

Frame Frame

Frame Frame

Camera 1 Camera 2

F1,n+1 F2,n+1

Hn+1

Hnn n

n+1 n+1

Figure 8: Predict Hn+1 usingoptical flow when Hn+1 is ill-formed.

Optical FlowStitching

TranslationMatched

features

Solve

Measurement Prediction

MEKF

H1 H2

n2t2R2R1t1n1

R′

H′

t′

Figure 9: Fusing informationfrom both stitching and opticalflow.

was not ill-formed. We check that Hn+1 is not ill-formed accord-ing to the criteria described in the previous section.

3.4.2 Fusing Stitching and PredictionIn cases where stitching succeeds, we have two homography so-

lutions: one is from pairwise stitching and the other is from predic-tion by optical flow. Stitching is independently performed for eachpair of frames so it does not take temporal information into con-sideration. Optical flow makes use of previous results to estimatecurrent homography but it could suffer from drift errors in the longterm. The property of stitching is like an accelerometer: stable inlong term but noisy in short term. Optical flow is like a gyroscope:accurate in the short term but prone to drift over time. This impliesthat sensor filtering techniques can be applied to achieve a betterestimate by considering both stitching and prediction results.

The fusion procedure is shown in Figure 9. First, the two homog-raphy matrices are decomposed into rotations and translations (eq.8). The rotation matrices R1 and R2 are converted to quaternionsand fed into a Multiplicative Extended Kalman Filter (MEKF) [28].The quaternion produced by stitching acts as a measurement vectorand the quaternion produced by optical flow is used to generate astate transition matrix. We chose not to filter both the rotation andtranslation at the same time because the filtered homography wouldgenerate noticeable misalignment. Instead, the filtered rotation R′

is used to compute a least-squares solution of translation t′ usingmatched features. Finally, the filtered rotation R′ and recomputedtranslation t′ are combined to form a new homography matrix H′.

3.4.3 GPS Drift CompensationWe explained in Section 3.2.2 that we can improve the speed of

feature matching by focusing on a small search region with radiusr. The purpose of r is to tolerate the potential errors when mappingpixels between two images since there are likely errors in the flightstatus information from on-board sensors, i.e., GPS drift errors.

We observed that when two images are successfully stitched, wewill know the ground truth of the transformation. By comparing theground truth with the transformation which relies on GPS locations

Search region reduction for feature matching

Pitch

Yaw

Roll

Heading

Figure 3: Flight attitude of quadcopter:roll, pitch and yaw.

BA

P P

A B

P

Actual scenarioNorth

Pixel coordinates (after orthorectification)

(a) Mapping a pixel location from one image to another.

Camera

Camera

GPS difference

P

(b) Merging coordinate systems.

Figure 4: Overlap determination using flight status information.

Then the final world coordinates (xw, yw, zw) can be obtained by,⎧

⎪

⎨

⎪

⎩

xw = s (x′

z′− W

2)

yw = s ( y′

z′− H

2)

zw = h

(6)

In other words, each pixel of an image is mapped to a point (xw, yw)on the ground, with the center of the camera sensor as the coordi-nate origin. The two world coordinate systems can then be corre-lated using the difference in the GPS locations. With such infor-mation, we only need to search in a small local neighborhood forthe matched feature. Because the area of the local neighborhoodis often much smaller than the whole image, we can significantlyreduce the search time of feature matching as compared with thetraditional method of searching the entire image for the target fea-ture.

This simple idea is illustrated in Figure 5. P is the location inImage 2 that we map for a feature point from Image 1, and r isthe search radius for the matched feature. Since an image covers afinite 2D space, a grid indexing approach is preferred. We divideImage 2 into 50 × 40 bins. 50 × 40 was chosen because it is pro-portional to the aspect ratio of the image and it yields bins that aresmall enough. We would use more bins for a bigger image. If a binis within a distance r from P (i.e., the colored bin), all the featuresin the bin will be considered as candidates for feature matching.

It remains for us to determine an appropriate value for the searchradius r. Fortunately, the simple approach of setting a fixed andconservative value for r based on empirical experiments seems tobe sufficient. In particular, we found that r = 200 pixels was suffi-cient in our implementation. In Section 3.4, we show that r can befurther reduced if we estimate and compensate for GPS errors.

3.3 Optimizing RANSACThe step after feature matching is to use the standard RANSAC

method [18] to remove outliers. RANSAC arbitrarily selects fourmatched features to compute a homography matrix H, applies thisestimated H on all the matched features, repeats it for N iterationsand finds the one with the most inliers. In challenging cases wherethe inlier ratio is low, we often set N to a large value (e.g., 500) inthe hope that at least one subset of the four matched features wouldnot contain any outliers.

While a large N improves RANSAC’s likelihood of finding thecorrect homography, it also increases its computation time. For ex-ample, it takes about 11 ms to perform a 512-iteration RANSACon a computer with a 3.4 GHz Intel CPU. This means that it isimpossible to stitch more than 4 videos at 30 fps. In order tospeed up RANSAC, we noticed that the traditional Singular ValueDecomposition (SVD) method is unnecessary for 4-point homog-raphy and can be replaced with Gauss-Jordan elimination. TheGauss-Jordan method runs efficiently on GPU due to its simpleand branchless structure, whereas an SVD implementation is usu-ally much more complicated and GPU-unfriendly. Finally, we im-plemented an OpenCL-based RANSAC homography estimator thatcan compute a 512-iteration RANSAC in 0.6 ms.

To further improve RANSAC performance when the number ofimage pairs is large, we maximize parallelism and minimize IOoverhead by performing RANSAC on all image pairs in one pass(Figure 6). First, candidate homographies of all image pairs arecomputed in a single kernel execution. The candidate homogra-phies are then routed to corresponding image pairs for reprojectionerror computation and inlier mask generation. Finally, the scoresand inlier masks of all candidate homographies are downloaded tomain memory. Our RANSAC pipeline has only four CPU-GPUmemory transactions and the amount of parallelism at each stage ismuch higher than performing RANSAC sequentially for each im-age pair.

In addition to execution speed, we also made improvements tothe accuracy of RANSAC. Traditionally, each RANSAC iterationis scored based on the number of inliers. While this approach workswell most of the time, we found that when the inliers are concen-trated in a particular region, it may result in a bad estimation of H.To mitigate this problem, we modified the score by taking into con-sideration the distribution of inliers. Specifically, we use a scorethat is a function of both the number of inliers and the standarddeviations of their locations:

score = Ninliers σx σy (7)

This score can be efficiently computed on the GPU during inliermask generation. The RANSAC iteration with the highest scoreunder this new metric was used to provide the inliers for the finalH estimation.

3.4 Optimizing Video Stitching QualityGiven the set of inliers identified by RANSAC, we can pro-

ceed to compute the homography matrix H using least squares andthen composite the two images based on H [26]. However, eventhough the set of inliers appears to be the best that we can find, itis possible for the resulting H to be ill-formed and to produce astitched image with some perceptual artefacts. In particular, part ofa stitched video frame may suddenly “jerk” when compared to thevideo frames before and after it.

There are two common causes for such “stitching failures”. Oneis that the overlapping area between the two images is too small,and the resulting matched features are not sufficient to produce agood estimate of H. Our experiments showed that the stitched im-ages generally start exhibiting a certain degree of perceptual dis-tortion when the number of inliers is smaller than 30. An exampleis shown in Figure 7(a). The other cause is that the original setof inliers selected by RANSAC are concentrated in a small region,and the resultant H becomes less stable, causing distant regions ofimages to be misaligned. An example is shown in Figure 7(b).

As human vision is sensitive to sudden scene change, we needto detect ill-formed H and prevent them from affecting the percep-tual experience. In particular, we first determine whether there aresufficient inliers produced by H. If the number of inliers is smallerthan 30 (which we had determined from empirical experiments),we consider it as a failure and discard the current frame.

Overlap determination using flight status information

Pixel coordinate (in the picture)𝑥",𝑦",𝑧′

World coordinate (on the ground)

𝑥?,𝑦?,𝑧? = 𝑠𝑥′𝑧′−𝑊2

, 𝑠𝑦′𝑧′−𝐻2

, ℎ

𝑠 = 2ℎ tan𝜃G2 𝑊H

s: # of meters represented by the side of a pixel

W×H: image resolution (in pixel)

𝜃G: angle of horizontal field of view in the camera

h: quadcopter altitude


Optimizing RANSAC

RANSAC (Random Sample Consensus) – method of removing outliers

i. Select four matched features to compute homography matrix Hii. Apply estimated H on all the matched featuresiii. Repeat i-ii for N iterations and find the one with the most inliers Large N improves H but also increases computation time

GPU-accelerated RANSAC

Ø Gauss-Jordan elimination instead of SVD (Singular Value Decomposition)

Ø Maximize parallelism and minimize IOØ Improve score calculation (use the distribution of inliers)

14

P

r



Image 2


H

H

H

H

H

2 3 41

2 3 41

2 3 41

2 3 41

2 3 41

...

...

...

...

qpqpqpqpqp

qpqpqpqp

Uploadedfrom CPU

Uploadedfrom CPU

...

Reprojection

Error

Pair

Pair



Gauss Jordan

Elimination Pair

...

Pair


...

...

...

i1, j1

ik, jk

i1, j1

ik, jkHn−1

Hn−1

H0

H0






H = R+ t nT (8)


cos(θ) = r33 (9)





Hn+1 = M2,n+1 F2,n+1 M−12,n Hn M1,n F

−11,n+1 M

−11,n+1 (10)


Frame Frame

Frame Frame

Camera 1 Camera 2

F1,n+1 F2,n+1

Hn+1

Hnn n

n+1 n+1



TranslationMatched

features

Solve


MEKF

H1 H2

n2t2R2R1t1n1

R′

H′

t′










RANSAC GPU Pipeline


Optimizing Video Stitching Quality

Causes of stitching failuresü small overlapping area between two imagesü concentration of inliers selected by RANSAC in a small region

How to detect ill-formed H and prevent them from affecting the perceptual experience?

Ø # of inliers produced by H

If insufficient, discard the current frame

Ø the angle between the two image planes - 𝜃

H = R+ tn& (R: 3×3 rotation matrix between two images)

cos 𝜃 = 𝑟KK (angle around z-axis)

If 𝜃 > 5°, H is erronous

15

P

r



Image 2


H

H

H

H

H

2 3 41

2 3 41

2 3 41

2 3 41

2 3 41

...

...

...

...

qpqpqpqpqp

qpqpqpqp

Uploadedfrom CPU

Uploadedfrom CPU

...

Reprojection

Error

Pair

Pair



Gauss Jordan

Elimination Pair

...

Pair


...

...

...

i1, j1

ik, jk

i1, j1

ik, jkHn−1

Hn−1

H0

H0






H = R+ t nT (8)


cos(θ) = r33 (9)





Hn+1 = M2,n+1 F2,n+1 M−12,n Hn M1,n F

−11,n+1 M

−11,n+1 (10)


Frame Frame

Frame Frame

Camera 1 Camera 2

F1,n+1 F2,n+1

Hn+1

Hnn n

n+1 n+1



TranslationMatched

features

Solve


MEKF

H1 H2

n2t2R2R1t1n1

R′

H′

t′










Example of perceptual distortion


Optimizing Video Stitching Quality – Recovering from Failed Stitching

Prediction of H by utilizing the optical flow of adjacent frames to reduce

frame drops due to ill-formed H

HOP; = MQ,OP;FQ,OP;MQ,O:;HOM;,OF;,OP;:; M;,OP;

:;

F: 3×3 homography matrix of optical flow

M: orthorectification matrixHn : transformation matrix of the previous stitching

16

P

r



Image 2


H

H

H

H

H

2 3 41

2 3 41

2 3 41

2 3 41

2 3 41

...

...

...

...

qpqpqpqpqp

qpqpqpqp

Uploadedfrom CPU

Uploadedfrom CPU

...

Reprojection

Error

Pair

Pair



Gauss Jordan

Elimination Pair

...

Pair


...

...

...

i1, j1

ik, jk

i1, j1

ik, jkHn−1

Hn−1

H0

H0






H = R+ t nT (8)


cos(θ) = r33 (9)





Hn+1 = M2,n+1 F2,n+1 M−12,n Hn M1,n F

−11,n+1 M

−11,n+1 (10)


Frame Frame

Frame Frame

Camera 1 Camera 2

F1,n+1 F2,n+1

Hn+1

Hnn n

n+1 n+1



TranslationMatched

features

Solve


MEKF

H1 H2

n2t2R2R1t1n1

R′

H′

t′










Predict Hn+1 using optical flow when Hn+1 is ill-formed


Optimizing Video Stitching Quality – Fusing Stitching and Prediction

Filtering techniques to achieve a better estimate by considering both stitching and prediction

results

i. Two homography matrices are decomposed into rotations and translations

ii. Rotation matrices(R1,R2) are converted quaternions and fed into a MEKF

(Multiplicative Extended Kalman Filter)

iii. Compute a least-squares solution of translation t’ using matched features based on

the filtered rotation R’

iv. Form a new homography matrix H’

combining R’ and t’

17

P

r



Image 2


H

H

H

H

H

2 3 41

2 3 41

2 3 41

2 3 41

2 3 41

...

...

...

...

qpqpqpqpqp

qpqpqpqp

Uploadedfrom CPU

Uploadedfrom CPU

...

Reprojection

Error

Pair

Pair



Gauss Jordan

Elimination Pair

...

Pair


...

...

...

i1, j1

ik, jk

i1, j1

ik, jkHn−1

Hn−1

H0

H0






H = R+ t nT (8)


cos(θ) = r33 (9)





Hn+1 = M2,n+1 F2,n+1 M−12,n Hn M1,n F

−11,n+1 M

−11,n+1 (10)


Frame Frame

Frame Frame

Camera 1 Camera 2

F1,n+1 F2,n+1

Hn+1

Hnn n

n+1 n+1



TranslationMatched

features

Solve


MEKF

H1 H2

n2t2R2R1t1n1

R′

H′

t′










Fusing information from both stitching and optical flow


Optimizing Video Stitching Quality – GPS Drift Compensation

GPS drift error can be estimated by comparing the ground truth with the transformation

→ Feed this estimated GPS error to feature matching and reduce search region

18


Multiple Video Sources

Stitching multiple video frames can be done by iteratively performing

pairwise stitching based on grid-like pattern.

ex) Stitching 6 images

Pairwise stitching <I1,I2>,<I2,I3>,<I3,I4>…<I6,I1>

19

I1 I2 I3

I6 I5 I4

Figure 10: A loop sequence ofstitching 6 images.

Figure 11: Our self-assembledquadcopter.

(Section 3.2.2), we can estimate the GPS error. Subsequently, wefeed this estimated GPS error to feature matching and reduce r bysubtracting the GPS error for the next round of stitching. Experi-mental results show that r can be reduced to smaller than 30 pixels,which makes the search area less than 1/300 of total image size. Ifstitching fails, we reset r to the original conservative value.

3.5 Multiple Video SourcesStitching multiple video frames can be considered as iteratively

performing pairwise stitching. One problem is to decide which im-age pairs can be stitched. Previous approaches perform pairwisefeature matching for all the images to identify the pairs of over-lapping images based on matched features [13]. Such exhaustivesearch is too costly for SkyStitch. Instead, we make use of availableflight status information to identify overlapping images and find asequence of images for stitching. If we assume the quadcoptershover in a grid-like pattern, a simple stitching sequence would be aloop that traverses all the images as shown in Figure 10.

While such loop-based stitching is simple, stitching errors willgradually accumulate along the loop. Consequently, there couldbe significant distortion between the first and last images, e.g., be-tween images I1 and I6 in Figure 10. To address this problem, oursolution is to perform adjustment on every pairwise stitching alongthe loop so that the distortion can be gradually corrected.

Suppose we want to stitch four images in Figure 10, accordingto the sequence I1, I2, I5 and I6. Let H1, H2, H3 and H4 be thehomography matrices of image pairs ⟨I1, I2⟩, ⟨I2, I5⟩, ⟨I5, I6⟩ and⟨I6, I1⟩ respectively. Ideally, we should have

M1p ∼ H4H3H2H1M1p

where p is a feature point in image I1 and M1 is the orthorectifica-tion matrix of I1. However, in practice, H4H3H2H1 is often notan identity matrix (even after scaling) because of errors in the ho-mography matrices. To make the above relationship hold, we canintroduce a correction matrix H such that:

M1p ∼ HH4H3H2H1M1p

Then we decompose H into four “equal” components and distributethem among the original four homography matrices. One way todecompose H is RQ decomposition:

H = AR

Here A is an upper triangular matrix (affine) and R is a 3D rotationmatrix. A can be further decomposed as follows:

A = TSD

where T is a 2D translation matrix, S is a scaling matrix and D isa shearing matrix along x-axis. T, S, D can be taken nth real rootwhere n is number of images in the loop. We can write H as

H = T′4S′4D′4R′4

Since the reference image is well orthorectified, A and R have ge-ometric meanings. They represent how a camera should be rotated

Table 1: Major components used in our quadcopter. The totalweight (including components not listed) is 2.1 kilograms, and thetotal cost is about USD1,200.

Components Model

Flight controller APM 2.6GPS receiver uBlox LEA-6H

Motor SunnySky X3108SElectronic speed controller HobbyWing 30A OPTO

Propeller APC 12×3.8Airframe HobbyKing X550

On-board computer NVIDIA Jetson TK1Camera PointGrey BlackFly 14S2C

Lens Computar H0514-MP2WiFi adapter Compex WLE350NX

and translated to match the reference camera frame. We move eachelement from H into H1, · · · , H4 to slightly adjust each camera’srotation and translation. For example, we move one R′ into H1

so that the rotational error is reduced by 1/4. In this way, we canpartially correct the errors at each stitching step along the loop tomitigate the distortion between images I1 and I6. Note that for a 6-image stitching loop as in Figure 10, we can also perform a similaradjustment to correct the distortion between images I2 and I5.

Comparison with Bundle Adjustment. The classical methodto close loops is to do bundle adjustment [13]. We tested bundleadjustment in SkyStitch and noticed that it has a few limitations.First, bundle adjustment requires pairwise matching information tobe available. This is not always possible because pairwise stitchingmay fail and have to be predicted by optical flow, in which casebundle adjustment is unusable. Second, bundle adjustment doesnot take any temporal information into consideration so perspectivejerkiness can be introduced between consecutive frames. Our loopclosing approach works even when pairwise feature matching in-formation is missing; there is very little jerkiness since cameras aresmoothly and coherently adjusted. Even though we cannot provethat our approach will always reduce reprojection errors, we foundthat in practice, it introduces less perspective jerkiness and is muchfaster, while visually, it performs as well as bundle adjustment forsmall loops.

4. SYSTEM IMPLEMENTATIONThe implementation of the SkyStitch system involves the inte-

gration of UAV design, computer vision and wireless communica-tion. In this section, we describe the key implementation details ofSkyStitch.

While there were commercial quadcopters available in the mar-ket, we chose to assemble our own quadcopter because it offeredthe greatest flexibility for system design and implementation. Thecomponents used in our quadcopter assembly are listed in Table 1,and our self-assembled quadcopter is shown in Figure 11. We builttwo identical quadcopters of the same design.

We used ArduPilot Mega (APM) 2.6 [1] as the flight controller.APM 2.6 has built-in gyroscope, accelerometer and barometer, andboth its hardware and software are open source. The on-boardcomputer is a 120-gram NVIDIA Jetson TK1, featuring a 192-coreGPU and a quad-core ARM Cortex A15 CPU. The GPU is dedi-cated for feature extraction, while the CPU is mainly used for datatransmission and optical flow computation. Jetson TK1 also has abuilt-in hardware video encoder, which we used for H.264 encod-ing. The Jetson TK1 board runs Ubuntu 14.04.

Each quadcopter is equipped with a PointGrey BlackFly camera,which has a 1/3 inch sensor with 1280×1024 resolution. The exter-nal lens we added to the camera has a 62.3◦ diagonal field of view.

A Loop sequence of stitching 6 images.


Contents

1. Introduction

2. Related Work




6. Conclusion

20


System Implementation

21

Components Model

Flight controller APM 2.6

GPS receiver uBlox LEA-6H

Motor SunnySky X3108S

Electric speed controller Hobby Wing 30A OPTO

Propeller APC 12×3.8

Airframe Hobby King X550

On-board computer NVIDIA Jetson TK1

Camera PointGrey BlackFly 14S2C

Lens Computer H0514-MP2

WiFi adapter Compex WLE350NX

I1 I2 I3

I6 I5 I4

Figure 10: A loop sequence ofstitching 6 images.

Figure 11: Our self-assembledquadcopter.

(Section 3.2.2), we can estimate the GPS error. Subsequently, wefeed this estimated GPS error to feature matching and reduce r bysubtracting the GPS error for the next round of stitching. Experi-mental results show that r can be reduced to smaller than 30 pixels,which makes the search area less than 1/300 of total image size. Ifstitching fails, we reset r to the original conservative value.

3.5 Multiple Video SourcesStitching multiple video frames can be considered as iteratively

performing pairwise stitching. One problem is to decide which im-age pairs can be stitched. Previous approaches perform pairwisefeature matching for all the images to identify the pairs of over-lapping images based on matched features [13]. Such exhaustivesearch is too costly for SkyStitch. Instead, we make use of availableflight status information to identify overlapping images and find asequence of images for stitching. If we assume the quadcoptershover in a grid-like pattern, a simple stitching sequence would be aloop that traverses all the images as shown in Figure 10.

While such loop-based stitching is simple, stitching errors willgradually accumulate along the loop. Consequently, there couldbe significant distortion between the first and last images, e.g., be-tween images I1 and I6 in Figure 10. To address this problem, oursolution is to perform adjustment on every pairwise stitching alongthe loop so that the distortion can be gradually corrected.

Suppose we want to stitch four images in Figure 10, accordingto the sequence I1, I2, I5 and I6. Let H1, H2, H3 and H4 be thehomography matrices of image pairs ⟨I1, I2⟩, ⟨I2, I5⟩, ⟨I5, I6⟩ and⟨I6, I1⟩ respectively. Ideally, we should have

M1p ∼ H4H3H2H1M1p

where p is a feature point in image I1 and M1 is the orthorectifica-tion matrix of I1. However, in practice, H4H3H2H1 is often notan identity matrix (even after scaling) because of errors in the ho-mography matrices. To make the above relationship hold, we canintroduce a correction matrix H such that:

M1p ∼ HH4H3H2H1M1p

Then we decompose H into four “equal” components and distributethem among the original four homography matrices. One way todecompose H is RQ decomposition:

H = AR

Here A is an upper triangular matrix (affine) and R is a 3D rotationmatrix. A can be further decomposed as follows:

A = TSD

where T is a 2D translation matrix, S is a scaling matrix and D isa shearing matrix along x-axis. T, S, D can be taken nth real rootwhere n is number of images in the loop. We can write H as

H = T′4S′4D′4R′4

Since the reference image is well orthorectified, A and R have ge-ometric meanings. They represent how a camera should be rotated

Table 1: Major components used in our quadcopter. The totalweight (including components not listed) is 2.1 kilograms, and thetotal cost is about USD1,200.

Components Model

Flight controller APM 2.6GPS receiver uBlox LEA-6H

Motor SunnySky X3108SElectronic speed controller HobbyWing 30A OPTO

Propeller APC 12×3.8Airframe HobbyKing X550

On-board computer NVIDIA Jetson TK1Camera PointGrey BlackFly 14S2C

Lens Computar H0514-MP2WiFi adapter Compex WLE350NX

and translated to match the reference camera frame. We move eachelement from H into H1, · · · , H4 to slightly adjust each camera’srotation and translation. For example, we move one R′ into H1

so that the rotational error is reduced by 1/4. In this way, we canpartially correct the errors at each stitching step along the loop tomitigate the distortion between images I1 and I6. Note that for a 6-image stitching loop as in Figure 10, we can also perform a similaradjustment to correct the distortion between images I2 and I5.

Comparison with Bundle Adjustment. The classical methodto close loops is to do bundle adjustment [13]. We tested bundleadjustment in SkyStitch and noticed that it has a few limitations.First, bundle adjustment requires pairwise matching information tobe available. This is not always possible because pairwise stitchingmay fail and have to be predicted by optical flow, in which casebundle adjustment is unusable. Second, bundle adjustment doesnot take any temporal information into consideration so perspectivejerkiness can be introduced between consecutive frames. Our loopclosing approach works even when pairwise feature matching in-formation is missing; there is very little jerkiness since cameras aresmoothly and coherently adjusted. Even though we cannot provethat our approach will always reduce reprojection errors, we foundthat in practice, it introduces less perspective jerkiness and is muchfaster, while visually, it performs as well as bundle adjustment forsmall loops.

4. SYSTEM IMPLEMENTATIONThe implementation of the SkyStitch system involves the inte-

gration of UAV design, computer vision and wireless communica-tion. In this section, we describe the key implementation details ofSkyStitch.

While there were commercial quadcopters available in the mar-ket, we chose to assemble our own quadcopter because it offeredthe greatest flexibility for system design and implementation. Thecomponents used in our quadcopter assembly are listed in Table 1,and our self-assembled quadcopter is shown in Figure 11. We builttwo identical quadcopters of the same design.

We used ArduPilot Mega (APM) 2.6 [1] as the flight controller.APM 2.6 has built-in gyroscope, accelerometer and barometer, andboth its hardware and software are open source. The on-boardcomputer is a 120-gram NVIDIA Jetson TK1, featuring a 192-coreGPU and a quad-core ARM Cortex A15 CPU. The GPU is dedi-cated for feature extraction, while the CPU is mainly used for datatransmission and optical flow computation. Jetson TK1 also has abuilt-in hardware video encoder, which we used for H.264 encod-ing. The Jetson TK1 board runs Ubuntu 14.04.

Each quadcopter is equipped with a PointGrey BlackFly camera,which has a 1/3 inch sensor with 1280×1024 resolution. The exter-nal lens we added to the camera has a 62.3◦ diagonal field of view.

Self-assembled quadcopter

Video Synchronization

If not synchronized…

Ghosting – a mobile object on the ground appear at different locations

from differnet cameras

→ Hardware-based video synchronization using GPS receiver

Major components used in the quadcopter.Total weight: 2.1 kilograms, Total cost: USD 1,200


Contents

1. Introduction

2. Related Work




6. Conclusion

22


Evaluation

Venue

ü Grass field – rich in features

ü Running track – not have many good features

Altitude – 20m

Benchmark for comparison - OpenCV (CPU and GPU) stitching algorithms

23


Stitching Time – Execution Time of Individual Step

24

0

50

100

150

200

250

300

2 4 6 8 10 12

Fe

atu

re e

xtra

ctio

n t

ime

(m

s)

Number of video sources

benchmark, CPUbenchmark, GPU

SkyStitch

(a) Feature extraction.

0 2 4 6 8

10 12 14 16 18 20

2 4 6 8 10 12

Fe

atu

re m

atc

hin

g t

ime

(m

s)


benchmark, GPUSkyStitch

(b) Feature matching.

0

20

40

60

80

100

120

140

160

180

2 4 6 8 10 12

RA

NS

AC

tim

e (

ms)



SkyStitch

(c) RANSAC.

Figure 12: Execution time of each step of video stitching. The number of features extracted from every frame is 1,000.

0

5

10

15

20

25

30

35

40

2 4 6 8 10 12

Stit

chin

g r

ate

(fp

s)


SkyStitchBenchmark GPUBenchmark CPU

Figure 13: Stitching rate (in terms of frames per second) with re-spect to the number of video sources.

time of the remaining steps (i.e., feature matching, RANSAC andimage compositing). Then the execution time per stitching is equalto max(Textract, Tothers) for SkyStitch and Textract + Tothers

for the benchmarks.Figure 13 shows that SkyStitch achieves a stitching rate up to

4 times faster than the GPU benchmark and up to 16 times fasterthan the CPU benchmark. The stitching rates of both benchmarksdrop sharply as N increases due to the fact that all computationsare performed at the ground station. SkyStitch, however, offloadsfeature extraction to the UAVs to greatly reduce the workload at theground station. When N ≤ 10, SkyStitch’s stitching rate is solelydetermined by the feature extraction time, which is around 28 fps.When N > 10, the bottleneck of SkyStitch shifts to the groundstation and the stitching rate starts to drop. Nevertheless, we esti-mate that SkyStitch will be able to support a stitching rate of 22 fpswhen N = 12 while the two benchmarks could not maintain suchstitching rate when N > 3.

Impact of feature count and overlap percentage. In the aboveexperiments, the number of extracted features is fixed at 1,000 andthe overlap between two adjacent video sources is about 80%. Wefound that these two factors do not affect stitching time except thetime of feature matching. Figure 14(a) shows the impact of featurecount on the time of feature matching, for the case of two videosources. We can see that the feature matching time of both SkyS-titch and the GPU benchmark increases with feature count, but therate of increase for SkyStitch is slower. This is expected since theGPU benchmark requires pairwise comparison for all the featureswhile SkyStitch only needs to search in a small area for the poten-tial match.

The impact of overlap percentage on feature matching time isshown in Figure 14(b). Generally the time of SkyStitch increaseswith overlap since more features in the overlapping area need tobe matched. The exception is at the overlap of 20%, where featurematching time does not follow the trend. Upon closer inspection,we found that it is because when the overlap is small there are morestitching failures which lead to a larger search radius r (see Sec-

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

500 600 700 800 900 1000

Fe

atu

re m

atc

hin

g t

ime

(m

s)

Number of Features


(a) Impact of feature count.

0

0.5

1

1.5

2

20 30 40 50 60 70 80

Fe

atu

re m

atc

hin

g t

ime

(m

s)Overlap Percentage


(b) Impact of overlap.

Figure 14: Impact of feature count and overlap percentage on theexecution time of feature matching.

tion 3.4.3). The feature matching time of the GPU benchmark ap-pears to be independent of the overlapping area and is much longerthan SkyStitch.

5.2 Stitching QualityIn this section, we show that SkyStitch produces high-quality

stitching results. In particular, we evaluate the number of inliers,the stitching success rate and the degree of frame jerkiness and alsothe incidents of ghosting artefacts. Sample videos of stitching canbe found in [4].

Number of inliers. The number of inliers is a basic measureof goodness for stitching algorithms. More inliers generally leadsto a better H estimate. Figure 16 shows the number of inliers atdifferent degrees of overlap. For both types of ground, SkyStitchgenerates more inliers than the benchmark. The reason is that thefeature matching step in the benchmark uses pairwise comparisonamong all the features and the probability of false matching is high.In contrast, the feature matching in SkyStitch only searches in asmall target area and thus it is less likely to have false matching.From Figure 16 we can also see that the slope of Grass field issteeper than that of Running track. This is likely because these twotypes of ground have different feature distributions.

Success rate of stitching. In Section 3.4, we define the failure ofstitching (or ill-formed H) based on the number of inliers and theangle θ. In SkyStitch, failed stitching can be potentially recoveredusing optical flow. We calculate the success rate of stitching forthe above experiment, as shown in Figure 17. For both types ofground, we can see that SkyStitch has higher success rate than thebenchmark, because of the recovery using optical flow.

Frame jerkiness. Stitching every frame independently couldintroduce perspective jerkiness which degrades visual quality. Toquantify perspective jerkiness and show how our approach can re-duce jerkiness, we measure the rotation component in the homog-raphy matrix, which is directly related to camera perspective. To

Execution time of each step of video stitching

SkyStitch: scalable to many video

sources because feature extraction is

offloaded to each quadcopter

Behchmarks: proportional to the # of

video sources

×4

(×70 over CPU)

# of features: 1,000

Overlap: 80%

12 video sources (2×6 grid)

N>7: slopes becomes steeper due to

2×6 grid pattern

×18

×24

Gap CPU-GPU is not large.

- SVD solver not run GPU efficiently

- Many CPU-GPU memory transcation


Stitching Time – Stitching Rate

0

50

100

150

200

250

300

2 4 6 8 10 12

Fe

atu

re e

xtra

ctio

n tim

e (

ms)



SkyStitch


0 2 4 6 8

10 12 14 16 18 20

2 4 6 8 10 12

Featu

re m

atc

hin

g tim

e (

ms)




0

20

40

60

80

100

120

140

160

180

2 4 6 8 10 12

RA

NS

AC

tim

e (

ms)



SkyStitch

(c) RANSAC.


0

5

10

15

20

25

30

35

40

2 4 6 8 10 12

Stit

chin

g r

ate

(fp

s)









0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

500 600 700 800 900 1000

Featu

re m

atc

hin

g tim

e (

ms)

Number of Features



0

0.5

1

1.5

2

20 30 40 50 60 70 80

Featu

re m

atc

hin

g tim

e (

ms)

Overlap Percentage










25

Stitching rate (frames per second)

×4 over GPU×16 over CPU

Effect of offloading feature extraction to the UAVs

exeutiontime = W𝑚𝑎𝑥 𝑇[\]1^_], 𝑇 ]a[1b ⋯SkyStitch𝑇[\]1^_] + 𝑇 ]a[1b⋯Benchmarks

Bottleneck of SkyStitch shifts to the

ground station


Stitching Time – Impact of Feature Count and Overlap Percentage

26

0

50

100

150

200

250

300

2 4 6 8 10 12

Fe

atu

re e

xtra

ctio

n tim

e (

ms)



SkyStitch


0 2 4 6 8

10 12 14 16 18 20

2 4 6 8 10 12

Featu

re m

atc

hin

g tim

e (

ms)




0

20

40

60

80

100

120

140

160

180

2 4 6 8 10 12

RA

NS

AC

tim

e (

ms)



SkyStitch

(c) RANSAC.


0

5

10

15

20

25

30

35

40

2 4 6 8 10 12

Stit

chin

g r

ate

(fp

s)









0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

500 600 700 800 900 1000

Featu

re m

atc

hin

g tim

e (

ms)

Number of Features



0

0.5

1

1.5

2

20 30 40 50 60 70 80F

eatu

re m

atc

hin

g tim

e (

ms)

Overlap Percentage










rate of increase is slower than than benchmark

Benchmark – pairwise comparison all the features

SkyStitch – search in a small area

Because when the overlap is small,

there are more stitching failures which

lead to a larger search radius.

GPU: Independent of the overlapping area


Stitching Quality – Number of Inliers

27

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250A

ng

le (

de

g)

Frames

Only StitchingOnly Optical Flow

Kalman Filtered

(a) Roll

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(b) Pitch

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(c) Yaw

Figure 15: Jerkiness of rotation components in computed homographies

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rs

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rs

Overlap Percentage

SkyStitchbenchmark

(b) Running track.

Figure 16: Impact of overlap percentage on the number of inliers.The number of features extracted from every frame is 1,000.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(b) Running track.

Figure 17: Impact of overlap percentage on success rate.

make things more intuitive, we convert the rotation matrix intoEuler angles yaw, pitch and roll. We stitched a pair of video se-quences (40% overlap) in the Running Track scene. Figure 15shows the Euler angles extracted in the estimated homographies.Note that “Only stitching” produces very significant jerkiness inpitch and roll angles. In contrast, “Only optical flow” method ismuch smoother but suffers from drift especially in yaw and pitchangles. The “Kalman filtered” angles are not prone to either of theproblems. It follows closely the results of “Only stitching” whilemitigating most of the jerkiness. The perceptual difference can beseen from our demo videos [4].

Multiple video sources. In Section 3.5, we proposed a loopclosing method for multi-image stitching. Figure 18 illustrates theeffect of alleviating the misalignment between the first and the lastimage during multi-image stitching. The six images were collectedfrom the Running track scene. We can see in Figure 18(a) that thereis a discernible misalignment along the stitching seam between thefirst and the last image. By applying the loop closing method dis-cussed in Section 3.5, we were able to mitigate the misalignment sothat the viewers could hardly notice it (see Figure 18(b)). A samplevideo can be found in [4].

(a) No loop closing. (b) With loop closing.

Figure 18: Effect of loop closing for 6-image stitching.

(a) Unsynchronized frames. (b) Synchronized frames.

Figure 19: Effect of video synchronization on ghosting artefacts.

We follow a similar method to that in Section 3.4 to determinethe success rate of multi-image stitching. Specifically, if any pair-wise stitching is found to have an ill-formed H, we consider themulti-image stitching as a failure. Using the above trace with sixvideo sources, we found that the success rate of SkyStitch is about97%, while the success rate of the benchmark is only 32%.

Elimination of Ghosting Artefacts. In Section 4, we explainedthat our hardware-based method for video synchronization is im-portant to ensure good stitching quality when there are mobile ob-jects. Figure 19 illustrates the effectiveness of our synchronizationmethod. The source images were taken from two quadcopters withhardware-based synchronization, while a man was jogging on therunning track. Figure 19(a) is stitched from two video images thatare 200 ms apart in capturing time. We can see a clear ghostingartefact around the stitching seam, i.e., two running men instead ofone. In practice, such ghosting artefact will affect the viewers’ as-sessment of the actual situation on the ground. In Figure 19(b), theimage is stitched from two video frames synchronized by GPS. Wecan see that our hardware-based synchronization method is suffi-ciently accurate to eliminate distracting ghosting artefacts.

6. CONCLUSIONIn this paper, we present the design and implementation of SkyS-

titch, a multi-UAV-based video surveillance system that supportsreal-time video stitching. To improve the speed and quality of videostitching, we incorporated several practical and effective techniques

Impact of overlap percentage on the number of inliers

SkyStitch: only searches in a small target area and thus it is less likely to have false matching

Benchmark: pairwise comparison among all the features and the probability of false matching is high


Stitching Quality – Success Rate of Stitching

28

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(a) Roll

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(b) Pitch

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(c) Yaw


0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rsOverlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rs

Overlap Percentage

SkyStitchbenchmark

(b) Running track.


0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(b) Running track.












Impact of overlap percentage on succes rate

SkyStitch has higher success rate than the benchmark, because failed

stitching can be recovered using optical flow.


Stitching Quality – Frame Jerkiness

29

Jerkiness of rotation components in computed homographies

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(a) Roll

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(b) Pitch

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(c) Yaw


0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Num

ber

of in

liers

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Num

ber

of in

liers

Overlap Percentage

SkyStitchbenchmark

(b) Running track.


0

20

40

60

80

100

20 30 40 50 60 70 80 90

Succ

ess

Rate

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Succ

ess

Rate

Overlap Percentage

SkyStitchbenchmark

(b) Running track.












Only stitching: produces very significant jerkiness in pitch and roll

Only optical flow: much smoother but suffers from drift especially in yaw and pitch

Kalman filtered: angles are not prone to either of the problems


Stitching Quality – Multiple Video Sources

30

Effect of loop closing for 6-image stitching

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(a) Roll

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(b) Pitch

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

An

gle

(d

eg

)

Frames


Kalman Filtered

(c) Yaw


0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rs

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Nu

mb

er

of

inlie

rs

Overlap Percentage

SkyStitchbenchmark

(b) Running track.


0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Su

cce

ss R

ate

Overlap Percentage

SkyStitchbenchmark

(b) Running track.












The misalignment is mitigated by applying loop closing method.


Stitching Quality – Elimination of Ghosting Artefacts

31

Effect of video synchronization on ghost artefacts

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(a) Roll

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(b) Pitch

-3

-2

-1

0

1

2

3

4

5

0 50 100 150 200 250

Angle

(deg)

Frames


Kalman Filtered

(c) Yaw


0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Num

ber

of in

liers

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

50

100

150

200

250

300

350

400

20 30 40 50 60 70 80 90

Num

ber

of in

liers

Overlap Percentage

SkyStitchbenchmark

(b) Running track.


0

20

40

60

80

100

20 30 40 50 60 70 80 90

Succ

ess

Rate

Overlap Percentage

SkyStitchbenchmark

(a) Grass field.

0

20

40

60

80

100

20 30 40 50 60 70 80 90

Succ

ess

Rate

Overlap Percentage

SkyStitchbenchmark

(b) Running track.












The image is stitched from two video frames synchronized by GPS.


Contents

1. Introduction

2. Related Work




6. Conclusion

32


Conclusion

SkyStitch – a multi-UAV-based video surveillance system that supports real-time stitching

Ø Feature extraction offloading

Ø Utilization of flight status information

Ø Efficient GPU-based RANSAC implementation

Ø Jerkiness mitigation by Kalman filtering

4 times faster than state-of-the-art GPU accelerated method & Good stitching quality

Future Work

ü Other types of features (color features, low-dimensional feature…)

ü Performance in challenging environment (rough ground with buildings…)

34

skystitch: a cooperative multi-uav-based real-time video surveillance system with stitching

Engineering