tracking and following a moving person onboard a pocket drone · tracking and following a moving...

Tracking and Following a Moving PersonOnboard a Pocket Drone

Tomás [email protected]

Instituto Superior Técnico, Lisboa, Portugal

November 2016

Abstract

Flying a drone is not an easy task, usually requiring a trained pilot. In this paper a vision basedstrategy is presented, designed to work fully onboard a pocket drone, for autonomously tracking andfollowing a person. With the presented system it is possible to use a drone for filming or taking picturesfrom previously inaccessible places without the need for a person controlling the aircraft. The systemis comprised by two main components, a tracker and a control system. The tracker has the function ofestimating the position of the person to be followed, while the control system gets the drone near thatperson. Limited by payload weight, power consumption and processing power, the system results in adelicate balance between these constraints. The main contributions of this paper are the comparisonbetween two state-of-the-art visual trackers running on paparazzi, Struck and KCF, the control systemthat uses the output location of the tracker to perform the person following task, and a new tracker,developed to be as computationally light as possible, so that it can run onboard a small pocket drone,based on HOG feature extraction, it uses logistic regression to train a detector on the appearance of aperson. Both KCF and Struck proved too demanding to run onboard the drone, while the new trackerruns fully onboard, producing very promising results.

1. IntroductionIn the last few years, drones have become relativelyinexpensive, leading to a growing number of appli-cations for them. Their agility and ability to reachpreviously inaccessible places makes them partic-ularly attractive for applications such as gatheringaerial footage, for cinema and television, and in par-ticular for sports coverage. These tasks usually con-sist of following a moving target, either a person oran object, and consistently pointing a camera atthem. However, drones are notoriously difficult tocontrol, requiring highly skilled trained pilots fortheir safe operation, so it is not straightforward touse them for the purpose of target-following.

An alternative to manually piloting the drone ishaving it fly autonomously. This is a very activefield of research, and in the context of following amoving target, the main challenges are detectingand tracking the position of the target, and guidingthe drone towards it. Recently, commercial prod-ucts have been developed for this task, like the LilyDrone [4], the AirDog [1], or the Hexo+ [2]. Theserely on GPS markers on the target which transmittheir position back to the drone. Although this so-lution works mostly fine outdoors, it does not workin GPS-denied environments, making it unsuitablefor indoor applications, among others.

To overcome this limitation, a vision-based sys-tem can be used instead. Object detection andtracking is an extensively studied problem in thecomputer vision literature, and many differentmethods have been developed to detect the in-frame position of the person or object that is beingtracked.

Vision-based tracking is a very complex task,dealing with a number of different problems[26] [28], namely illumination changes, occlusions,changes to the appearance of the object, and themotion of both camera and target. By imposingconstraints on the tracking problem, such as assum-ing that the movement of the target is smooth, withsmall displacements, or that a person has mostly anupright position, tracking strategies can be simpli-fied. In cases where the camera has a fixed posi-tion, subtracting the background from the imagecan greatly simplify the problem, as it becomesmuch easier to identify moving objects in the scene.

Some trackers continuously update the model ofthe target online, resulting in a greater accuracy,due to the increased robustness to changes in thescene, as well as in the appearance of the targetitself. Failure detection and redetection capabili-ties also add greatly to tracking robustness as itallows the tracker to reinitialise when the task de-

1

teriorates below a certain threshold. Though oftenused strategies, they may be too demanding to al-low for smooth onboard running of the algorithm.

But visual trackers are, in general, computation-ally demanding, and when the purpose is to run itonboard a MAV, special care has to be taken, due toits limited memory, processing power, and energyconsumption. A delicate balance has to be main-tained between these constraints, and the accuracyof the tracker. A commonly employed solution tothis problem is sending the images from the cam-era to a ground-station, e.g. a personal computer,where the heavy calculations are performed, andthen send the steering controls back to the drone.However, the use of a ground station is often im-practical due to its reduced portability, eliminatingone of the main virtues of using a drone for thistask.

As far as we are aware of, no previous vision-based tracker implementation has managed to workfully onboard an inexpensive drone, without theassistance of a special feature to identify the tar-get that is being followed, and distinguish it fromothers. In this work, we aim at a system whichdoes not require such special identifiers on the tar-get, and that works fully onboard a small pocketdrone, highly limited in both memory and process-ing power. We ported two state of the art track-ers into the Paparazzi UAV [21] [23] autopilot sys-tem, namely Struck [17] and KCF [18], and testedthem on a Parrot Bebop drone. Both trackers arediscriminative, and update the model of the targetonline with each new frame, making them very ro-bust, but computationally unsuitable for onboarduse. We then develop a simple tracking algorithm,designed to be computationally efficient, so that itcan run onboard a pocket drone. This tracker isbased on tracking by detection, and uses HOG fea-tures to describe the model of the person, a logisticregression classifier is trained offline with a few hun-dred images of both positive and negative examplesof people.

The remainder of the paper is organised as fol-lows. In section 2 an overview of the related workis presented and in section 3 the hardware used inthe project is described. Section 4 gives a detailedexplanation of the visual trackers and section 5 ex-plains the control system design. In section 6 wehave the experimental results and finally in section7 the conclusions and future work.

2. Related WorkThere have been many previous vision-based ap-proaches to the task of using a drone for personand object following. Most works use very similarcontrol policies, achieving good results with a sim-ple PID controller. What distinguishes the differentsystems is the tracker, used to estimate the posi-

tion of the target. A recent survey by Smeulderset al [26] reviews and makes a detailed comparisonbetween 19 of the most used and cited trackers inthe literature. All the trackers are evaluated underthe same test conditions, using 12 different perfor-mance metrics. In the following review we focus onthe simplest trackers and on the ones that performbest.

Some of the simpler tracking strategies are basedon colour thresholding, images are filtered in orderto identify the target by its colour. The methodsare not very robust, having difficulties with trackingtargets with similar colours to those of the back-ground. Dang et al [12] use colour thresholdingto track and follow a small red object from above.Even though such a system should be capable ofrunning entirely onboard the AR Drone they used,they actually perform the computations offboard,using a ground station.

In recent work, Smedt et al [13] develop a sys-tem to track and follow a walking person, onboarda UAV, using a colour-based particle filter as atracker, and the Aggregate Channel Feature (ACF)[14] detector. The detector is used to reinitialise thetracker whenever its confidence level drops below acertain threshold, increasing the overall robustnessof the system. To increase the speed of the detector,the ground plane is estimated, and used to reducethe region over which the target is searched.

More sophisticated methods have been developedto deal with the problems of the previously pro-posed colour based techniques. The two trackersthat performed best in the survey by Smeulderswere Struck [17] and TLD [19].

The Struck tracker was developed by Hare etal [17]. It makes use of a support vector machine(SVM), trained on the appearance of each trainingexample, as well as on the translation from the pre-vious detection. By incorporating translations onthe learning phase, the algorithm is able to directlyoutput the position of the new detection, unlike tra-ditional trackers, which output a score correspond-ing to the likelihood of each tested location. Weused Struck in some of our experiments, a more de-tailed description is given in section 4.1.

More recently, Kalal et al developed the TLDtracking framework, consisting of 3 components:a Tracker, a Learning component, and a Detec-tor. The main purpose of the tracker is estimatingthe motion of the target between successive frames.However, should the target move too much, pos-sibly even getting out of sight, the detector thenstarts scanning each new frame independently, inorder to find the target again and reinitialise thetracker. The learning component trains the detec-tor based on the performance of both the trackerand the detector, discriminating the appearance of

2

the target against the background, through the useof positive and negative training examples.

However, the TLD algorithm is computationallyvery demanding, and it is currently unfeasible torun it onboard a consumer drone. In the work byPestana et al [22], a ground station is used to runthe TLD algorithm on the images sent by an ARDrone 2.0. The control commands are then gener-ated by a PD controller, and sent back to the drone.Similar work has been performed by Bart et al [6],again using a ground station for offboard computingof the TLD tracker.

Many trackers are based on the Histogram of Ori-ented Gradients (HOG) feature descriptor, whichwas introduced by Dalal and Trigs [8], and origi-nally used to train a SVM for pedestrian detection.It has since been widely used in various branches ofcomputer vision, for object and people detection.Danelljan et al [10] worked on a person-followingframework, composed by an Adaptive Color Tracker[11], and a SVM trained on HOG features to re-cover from tracking failures. All computations areperformed on a ground station.

In Haag et al [16], two correlation filter-basedtrackers are used for person pursuit, using a low costquadcopter, and the algorithms themselves runningoffboard. They extend the DSST [9] and KCF [18]tracking algorithms with target loss detection andredetection capabilities. Both trackers make use ofHOG features for improved results, when comparedwith simple pixel intensity values. KCF is one of thetrackers used in our work, and a detailed explana-tion of the algorithm is given in section 4.2.

From the wide variety of approaches used by thevarious works in the literature, we observe that nosolution has yet emerged as the single best strat-egy for the task of target-following using a drone.Most of the systems make use of a ground stationto perform the intensive computations, sacrificingmobility, and the ones that do manage to performthe calculations onboard require powerful special-ized hardware, usually not found in small inexpen-sive drones.

3. Hardware

Throughout the course of this work we made use oftwo hardware platforms, the Parrot Bebop and thepocket drone equipped with a stereo camera.

3.1. Parrot Bebop

The bebop is a commercially available platform. Ituses a Parrot P7 dual-core CPU Cortex A9, a Quadcore GPU and 8 Gb of flash memory. The camerahas a Fisheye lens 180◦1/2,3 with 1920x1080p pixeldefinition, that works up to 30 fps.

Figure 1: Pocket drone with the stereo camera at-tached.

3.2. Pocket Drone & Stereo CameraThe pocket drone is a very small and light air-craft, it only weighs 43 grams when equipped witha stereo-vision camera, seen in Fig. 1. The algo-rithms runs directly on the stereoboard, the boardthat does the processing of the images, which is lim-ited to 196kB of memory with a processor speed of168MHz. The camera has an horizontal field of viewof 57,4◦and can handle a maximum frame rate of 30frames per second. The maximum pixel resolutionof the images is 660× 492 though we use a smallerresolution image of 128× 96.

4. Visual TrackersIn this section, we give a detailed explanation ofeach of the trackers we used in our experiments,namely the Struck [17] and KCF [18] trackers, andthe Pocket Tracker, that we developed to work on-board the Pocket Drone.

Both Struck and KCF are adaptive trackers, us-ing information provided in the first and followingframes to build a model of the target, with no needfor a pre-trained detector. The pocket tracker worksdifferently, it was trained offline in the task of peo-ple detection, and is consequently less robust thanthe other two.

Most trackers, including the three we present, usea bounding box to identify the target, giving infor-mation not only of its position within the imageframe, but also its size. An example can be seen inFig. 3(a).

4.1. StruckThe authors of Struck [17] make a few comments onwhat they see as shortcomings of traditional adap-tive tracking by detection approaches. Learningis usually decoupled from the information on thetranslation from the previous detection, based onlyon a binary label given to each training example.Moreover, examples are equally weighted, indepen-dent of the level of overlap with the bounding boxof the tracker. Small inaccuracies during tracking

3

Frame TrackerBounding Box Reference Control

Location LoopsGenerator

Figure 2: General overview of the tracking and following process.

can lead to poorly labeled examples, which worsensthe tracker in the training phase, in turn leadingto cyclic deterioration of the tracker. Usual strate-gies to generate training samples are based on fixedrules, such as considering the predicted location inthe current frame as a positive training sample, andlocations that are a certain distance from this loca-tions as negative samples. This is an error pronestrategy, as there is no control on what the nega-tive samples look like, these samples may result indeterioration of the performance of the tracker.

The framework of Struck is built to deal withthese issues, instead of learning a classifier, itsprediction function directly estimates the objectstransformation between each frame. The outputspace becomes all possible object transformationsinside a search region, instead of binary labels ±1.

The prediction function is learned in a structuredoutput SVM [7] [27]. SVMs keep both positive andnegative samples from previous frames, called sup-port vectors, a frame for which support vectors havebeen kept is called a support pattern.

In the optimisation process of the prediction func-tion, a loss function that depends on the level ofoverlap with the current bounding box is incorpo-rated. By doing so, they manage to differentiatenegative samples that have significant overlap withthe bounding box from those that have little orno overlap, addressing their comment on giving thesame weight to all negative samples.

Three main optimisation stages perform themaintenance of the support vectors, namelyProcessNew, ProcessOld and Optimize.ProcessNew processes a new example, hav-ing the ability to add both positive and negativesupport vectors, while ProcessOld, as the nameindicates, goes back to a previous support pattern,only having the capacity to add negative supportvector or change the coefficients associated withthe existing support vectors. Unlike ProcessNewand ProcessOld, the Optimize stage cannotadd support vectors, only change the coefficientsassociated with a existing support pattern, beingas such the less expensive operation among thethree. Each full cycle of Struck runs ProcessNewonce, and multiple instances of ProcessOld andOptimize.

The last big component of the tracker is its bud-get maintenance system. In SVMs the number ofsupport vectors is not bounded, in a task like track-ing they can grow indefinitely if a mechanism is not

incorporated to deal with the maintenance of sup-port vectors. With more support vectors, storageand computational costs grow, ProcessNew andProcessOld also become more costly operations. InStruck, the budget maintenance process is run everytime a new support vector may have been added, itchecks whether or not the number of SVs has ex-ceeded the budget limit, and if it has, the supportvector that causes the smallest change to the normof the weight vector is removed, causing little im-pact on the performance of the tracker.

As in support vector machine algorithms, by us-ing different kernel function they can implement dif-ferent features as well as a combination of more thanone. In their experiments they make use of featureslike raw pixel intensities, Haar features and inten-sity histogram features, obtaining the best resultsfor the combination of Haar features with inten-sity histograms. The overall structure of Struck,incorporating translations directly on the learningphase, together with their support vectors optimi-sation and budgeting mechanisms made for some ofthe best results among current adaptive trackers.

4.2. KCFThe tracker proposed by Henriques [18] deals withwhat they consider to be the main factor inhibitingperformance in trackers, the undersampling, espe-cially of negative samples. In most modern trackers,incorporating more samples comes at both a com-putational and memory cost, and as such they usu-ally try to add as many samples as possible, whilemaintaining both costs viable.

KCF manages to incorporate thousands of sam-ples, and even though it is based on Kernel RidgeRegression as proposed by [24], it manages to breakthe "Curse of Kernelization”. The samples are gen-erated by a specific translation model, circulant ma-trices, which in the Fourier domain allows for a fastlearning algorithm. The tracker can also use multi-ple features, e.g. HOG features, which results in anincrement to its performance.

Training samples are generated by a base sam-ple that is used as a positive example, then multi-ple negative samples result from translations of thatbase sample. The aforementioned circulant matrixresults from making each of its lines the base sam-ple with increasing translations. What makes theuse of circulant matrices so appealing is that thesematrices become diagonal by discrete Fourier trans-forming them.

4

Working in the Fourier domain with training datain the format of circular shifts allows for the sim-plification of linear regression, matrices become di-agonal and all operations become element-wise onthe diagonal elements.

Powerful non-linear regression functions can beused with the kernel trick, the optimisation problemmaintains its linearity, in the dual space, with thedownside that now the complexity of the resultingregression function grows with the number of sam-ples, the "curse of kernelization". By proving thatthe kernel matrix is circulant for certain kernel func-tions, including Gaussian, Linear and Polynomial,among others, and working in the Fourier domainthe regression problem is greatly accelerated.

The detection phase is also significantly acceler-ated. By modelling the patches that are to be testedby cyclic shifts, the entire detection response can becomputed very fast, again in the Fourier domain.

The dual formulation of the problem allows foreasy implementation of multiple channels, like HOGfeatures, by simply summing the results for eachchannel in the Fourier domain. As referred before,the use of HOG features gives a good performanceenhancement. Modelling samples with translationsand making the resulting data and kernel matri-ces circulant allowed KFC to be one of the fasteststate-of-the-art trackers, with speeds of hundredsof frames per second when compared to the lowframe rates of other state-of-the-art trackers, suchas Struck.

The code for the paparazzi implementation ofboth KCF and Struck is made available and canbe found in [3].

4.3. Pocket TrackerThe stereoboard is extremely limited in terms ofprocessing power and memory capacity, as shownin section 3.2. Both Struck and KCF are too de-manding to run onboard the stereo camera, simplerless expensive strategies have to be taken. Adap-tive trackers, which require online training, tend tobe expensive in the updating of the model, and forthis reason we chose not to update our person modelonline and use a pre-trained detector. The PocketTracker applies a traditional tracking by detectionapproach based on HOG feature extraction.

4.3.1 Detector

To train our detector we make use of logistic regres-sion, a binary classification algorithm, often usedin machine learning problems. The function we aregoing to use to represent our hypothesis is:

hθ(x) = g(θTx) (1)

where x is an input example, theta are the param-

eters that describe our model and g is the sigmoidfunction:

g(z) =1

1 + e−z(2)

The sigmoid function outputs values between 0and 1, and hθ(θTx) can be seen as the probabilityof x belonging to the positive class. By thresholdingthe output of Eq. 1 we can separate positive fromnegative predictions, a typical value is to considera prediction positive for hθ(x) ≥ 0.5 and negativeotherwise.

The training of our logistic regression classifierconsists in finding the parameters θ which minimisethe cost function in Eq. 3 across all training exam-ples.

J(θ) =

− 1

m

m∑i=1

y(i) log(hθ(x

(i)))+ (1− y(i)) log

(1− hθ(x(i))

)+

λ

2m

n∑j=1

θ2j

(3)where m represents the number of training ex-

amples and n the number of features, y(i) is theground truth label for training sample i and λa regularization parameter to prevent overfitting.Usually used in logistic regression, this cost func-tion penalises false positives and false negatives inthe training dataset. For false positives we havehθ(x

(i)) ≥ 0.5 and y(i) = 0, making the term(1 − y(i)) log

(1− hθ(x(i))

)large. A similar situ-

ation happens with false negatives, hθ(x(i)) ≤ 0.5

and y(i) = 1, with y(i) log(hθ(x

(i)))also becoming

large. In the limit, for completely wrong predic-tions, i.e. y(i) = 0 with hθ(x

(i)) = 1 or y(i) = 1with hθ(x

(i)) = 0, the cost function would go toinfinity.

To find the set of parameters θ that produced thebest results we used Matlab optimisation functionfminunc.

As we are limited in both memory and speed,our detector is run for a fixed scale, a 28× 24 pixelsized window was chosen, which at 2 meters awayfrom the camera includes the head, shoulders and asmall portion of background. The size of the areathat is going to be used to search for our trackeris also an important factor influencing the speed ofthe algorithm. We could run our search in the en-tire picture, but that would make the tracker slower,without the need for it. The movement of our tar-get, a human, should be mostly horizontal, a humanhead maintains a relatively constant altitude. To

5

give emphasis to this motion constraint of the tar-get, we chose a region of interest (ROI) of 96× 56,used to search for the target.

Various HOG features configurations were testedand the results are presented in section 6.3, using2 × 2 pixel cells without the use of blocks and 4bins was the chosen configuration. This configura-tion produced some of the best results as well asbeing one of the fastest, it uses a small number ofbins, reducing the number of computations, and asit does not use blocks there is no need for the extrastep of calculating them. Running the stereoboardwith this configuration gives us a frame rate of 1.68.Although this is a low frame rate, unlike adaptivetrackers like KCF and Struck, this should not bea problem for our tracker as long as the target re-mains inside the region of interest.

4.3.2 Pipeline of the Tracker

In the first frame we don’t know the location of ourtarget, as we don’t use the full image to run the de-tector, we start by selecting the ROI in the centreof the frame.Our aim is to find the patch, repre-sented by its respective x features, inside the ROIthat has the highest likelihood of being our target,meaning, the patch that produces the highest out-put to hθ(x). The patch that produces the highesthθ(x) is the same as the patch that produces thehighest θTx, as such we evaluate patches directlyby this measure. To prevent the detector from se-lecting patches that have low probability of beingselected as the output of the detectors search, i.e.a patch that has a low θTx but still the highest inthe image, which is especially problematic when nohuman is present in the scene, we introduce a detec-tion threshold. Whenever a frame fails to producea detection above this threshold, the system con-siders that no detection has occurred and enters itsredetection mode. During this redetection mode theoutput of the tracker is a perfectly centred bound-ing box, as to result in the drone just hovering inplace, while waiting for the next successful detec-tion. Following a frame with a successful detection,the ROI is selected centred in the position of theprevious detection, where the detector searches fora new target position.

As debugging is extremely limited on the stere-oboard, the only possible form of communication isa flashing led, prototyping of the tracker was doneon C and is made available in [5], as well as thecode used for extracting the features, annotatingthe data set and training the detector.

5. Control System DesignWe could say that in a person following task, theperfect controller is the one that can maintain theperson in the centre of the camera image, at a fixed

distance from him or her. Most trackers, includingthe three used during this work, Struck [17], KCF[18] and the Pocket Tracker, output their locationin the format of a bounding box as described insection 4. As such, the control system was builtto be easily implemented with any tracker whoseoutput is also a bounding box, depending essentiallyon the camera that is being used. Using the fieldof view and the pixel resolution of the camera, itis possible to calculate angles in the image and usethem in turn to calculate distances.

The implemented control scheme makes use ofthe control loops of paparazzi, by analysing the be-haviour of the output bounding box, the positionof the person relative to the drone is calculated andused to generate the references that are fed to thecontrol loops. Yaw and Altitude control have a verysimilar approach while the control in the horizon-tal plane is slightly different. An overview of thevarious components of the system leading up to thecontrol of the drone is presented in Fig. 2.

5.1. Yaw and Altitude ControlThe reference point is the centre of the frame, cen-tring the person in that point can be divided intotwo control problems, yaw and altitude control. Bydividing the field of view of the camera, fov, by thenumber of corresponding pixels, npixels, we get theapproximate angle per pixel, let’s call it anglestep:

anglestep =fov

npixels(4)

When a detection is made, the centre of thebounding box has a horizontal and vertical pixel dis-placement, respectively hpixel and vpixel, from thecentre of the frame, as can be seen in Fig. 3(a).These displacements can be used to calculate theassociated horizontal and vertical angle displace-ments, hangle and vangle, which can be seen in Fig.3(b) and 3(c), using the following equation:

angledisp = pixeldisp × anglestep (5)

The yaw reference, hangle, can be directly calcu-lated using Eq. 5, in which pixeldisp is hpixel, vanglecan be calculated in an analogous way.

To calculate the altitude reference, an extra infor-mation is needed, the distance between the targetand the camera, dtarget, which is calculated for theHorizontal control, and as such explained in section5.2. Through the use of trigonometry the verti-cal distance displacement, vdist is calculated withdtarget and vangle.

vdist = dtarget × tan(vangle) (6)

6

vpixel

hpixel

BBw

BBh

Pc

PBB

Frame

(a)

dtarget

vangle

BBh

vpixel

Frame

Camera

(b)

hangle hpixel

BBw

Frame

Camera

(c)

Figure 3: These three figures represent three views of a frame with a detection on it, the drawn boundingbox, and the three dimensional position of the camera in space. From left to right we have the frontalview, then the side view and lastly the top view. The various parameters used to calculate the referencesfor the control system are depicted in the figures. The bounding box height and width are respectivelyBBh and BBw, the centre of the bounding box is PBB and the centre of the frame is Pc. The rest of theparameters are explained in sections 5.1 and 5.2.

5.2. Horizontal Control5.2.1 Scale Adaptive Trackers

The horizontal control works on the position alongthe axis that is pointing towards the front of theaircraft, which is also the camera axis. Referencesare generated for the aircraft to move forth and backin this direction, while trying to maintain a fixeddistance from the person. The distance we want tokeep from the person, ddesired is a parameter of thesystem and the error, derror we want to minimizecomes from the subtracting from this distance thecalculated distance from the target:

derror = ddesired − dtarget (7)

The tracker is initialised at a fixed distance andwith a fixed box size, as such we know the realheight, heightreal, corresponding to the boundingbox pixel height. If the person gets closer or fur-ther away from the camera, the pixel size of thebounding box changes, but the equivalent real sizeis kept constant, if tracking is successful of course.Through the use of Eq. 5 we can calculate thevertical angle associated with the pixel height ofthe bounding box, heightangle. Using heightreal andheightangle we can calculate dtarget with the follow-ing trigonometric relation:

dtarget =heightreal

tan(heightangle)(8)

5.2.2 Fixed Scale Trackers

The detector used in the pocket tracker doesn’tadapt to scale, as mentioned before, making thestrategy used for the other trackers unfeasible.Working with a stereo camera gives us access to thedisparity map corresponding to the images that weare using for tracking. With a pair of images from a

stereo system, by matching the pixels in one of theimages with the corresponding pixels in the otherimage, we can calculate the displacements betweeneach pair of pixels, the image formed by all thesedisplacements is the disparity map. The disparitymap can in turn be used to calculate the depth map,which contains an estimate of the distance at whicheach pixel is from the camera. As it will be seenin the experimental results section 6.3, successfuldetections are very accurate, always with the headof the person it is tracking in the centre of the de-tection window. Matching pixels from the centre ofthe detection window with the corresponding pixelsin the depth map and averaging over them gives usan estimate of the distance between the drone andthe target.

6. Experimental ResultsOur aim was to figure what strategy would enableus to implement a person following system onboarda small pocket drone. The parrot Bebop was usedas a platform to test out two state of the art al-gorithms, KCF and Struck. Both visual trackershad C++ open source implementations by the re-spective authors, which were adapted to work withpaparazzi on the Bebop. Due to the large field ofview of the camera of the Bebop, only a crop of thefull image was used, a 124 × 96 pixel image, witha field of view of 60 degrees. We first discuss theresults for the two trackers, then the experimentson the Pocket Tracker and finally the reference cal-culations.

6.1. StruckWhen run on a normal computer, Struck presentsvery good results, it easily handles changes in ap-pearance, e.g. a person rotating, abrupt movementslike quick changes in the direction of movement arealso not a problem and adapts very smoothly thesize of the bounding box to the size of the person.

7

Figure 4: Detection results for the pocket tracker,the four images on the left correspond to positivedetections by the Pocket Tracker, while the imageson the right correspond to difficult situations wherethe pocket tracker failed to produce positive detec-tions. The images were recovered from the stereocamera.

Though very promising, this performance comes ata cost, when faced with weaker hardware, the speedat which the tracker can run decreases, which re-sults in deteriorated performance. Though it canstill run fully onboard the Bebop, the frame ratedrops dramatically to 0.3 frames per second, be-coming impossible to obtain any kind of trackingpast the first frame.

6.2. KCFSimilarly to Struck, KCF shows impressive resultswhen tested at higher speeds. With its much fasterframework, when compared to Struck, it is able torun at 3.5 frames per second onboard the Bebop.Tracking is achieved but the relative movement ofthe target with regard to the image plane that thetracker can handle becomes very limited. The ca-pacity to adapt to scale also suffers from the lowframe rate, performing very crudely in its adaptionto objects coming closer or further away from thecamera. For a slow moving object the tracker canhandle the task, but for the normal speed of a per-son, it drifts aways from the target very quickly.

6.3. Pocket TrackerIn this section we present the results obtained forthe detector of the Pocket Tracker, its main compo-nent, trained and tested for different HOG featuresconfiguration and for different regularisation pa-rameters for the logistic regression training phase.The dataset used to train the detector was madeusing the stereocamera, ground truth positive sam-ples were manually annotated and negative samplesextracted from the background. We used 764 pos-itive samples of a person with different orientationangles, i.e. samples from all 360 degrees of rota-tion, and 1000 negative samples extracted from thebackground. The test sequence used to evaluate theresults of the various detector configurations wasmade of 167 images, with a section where trackingpresented reasonable to excellent results dependingon the configuration in place, and a section in whichthe pixel values of the background were very similarto those of the person and as such the performanceof the detector deteriorated for all configurations.Examples of those two sections can be seen in Fig.

4. We used always the same training dataset andtest sequence in order to ensure comparable results.

A first phase of tests consisted in running our testsequence for different block sizes and block overlappairs, for a fixed number of HOG bins, across awide range of λ (see Eq. 3) values. Then, for theλ that showed the best result in each of the threebest configurations, similar tests were run, now fora fixed λ, block size and and block overlap, whilevarying the number of HOG bins.

To evaluate the performance of the detectorwe used three different metrics, the percentage offailed detections, the Object Tracking Performance(OTP) [25] and the Average Tracking Accuracy(ATA) [20]. For the first one we use the PASCALcriterion [15]: ∣∣T i ∩GT i∣∣∣∣T i ∪GT i∣∣ ≥ 0.5 (9)

where T i and GT i represent respectively the de-tected bounding box in frame i and the groundtruth bounding box in frame i. Whenever the Pas-cal criterion is met, i.e. theintersection area of thetwo boxes divided by the union area is greater than0.5, a detection is considered as successful, if bellowthis threshold we consider it to be a failed detection.The OTP gives us a measure of how accurate thedetector is for the cases where positive detectionswere achieved, discarding the frames where Eq. 9is not met. OTP is given by:

OTP =1

|Ms|∑i∈Ms

∣∣T i ∩GT i∣∣∣∣T i ∪GT i∣∣ (10)

where Ms denotes that set of frames with posi-tive detections. Finally ATA, very similar to OTP,calculates the average accuracy across all frames in-cluding the ones with true positives and the oneswith false positives:

ATA =1

Nframes

∑i

∣∣T i ∩GT i∣∣∣∣T i ∪GT i∣∣ (11)

The ATA is the metric that gives us the best eval-uation on the overall performance of the detector,penalised by both the inaccuracy of the detector onpositive detections and on failed detections.

All the results of the tests to follow are presentedin Fig. 5, each configuration is identified by a letterin the legend, and the correspondence between thatletter and the configuration specifications can befound in Table 1.

6.3.1 Block Configuration

As the bounding box in our detector is of fixed size,not all HOG configurations fit well into the size ofthe box, as an example, if we had a 5 × 5 pixel

8

λ

OT

POTP for varying λ

0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 30000.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

(a)

λ

ATA

ATA for varying λ

0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 30000.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

(b)

A

B

C

D

E

F

λ

percentageof

fails

Percentage of fails for varying λ

0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 300020

25

30

35

40

45

50

55

60

65

70

(c)

nBins

OT

P

OTP for varying number of Bins

1 2 3 4 5 6 7 8 90.855

0.86

0.865

0.87

0.875

0.88

0.885

0.89

0.895

0.9

(d)

nBins

ATA

ATA for varying number of Bins

1 2 3 4 5 6 7 8 90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(e)

nBins

percentageof

fails

Percentage of fails for varying number of Bins

1 2 3 4 5 6 7 8 910

20

30

40

50

60

70

80

90

(f)

Figure 5: Performance results for the logistic regression trained detector. On top we have the results forHOG feature configuration, while varying λ (see Eq. 3), evaluated from left to right on OTP, ATA andpercentage of fails respectively. On the bottom we have the results for fixed λ, while varying the numberof bins, nbins, performed on the three best configurations from the top row, again evaluated from left toright on OTP, ATA and percentage of fails respectively.

Table 1: HOG configurations.

Configuration Block Size[cells]

Block Overlap[fraction of block]

A 1×1 0B 2×2 0C 2×2 1/2D 3×3 2/3E 4×4 3/4F 4×4 1/2

window and used 2 × 2 pixel cells, we would notbe able to fit an integer number of cells into suchwindow. Each configuration was tested for differentregularisation values, ranging from 0.03 up to 30, insteps of roughly three times the previous value, andwhenever necessary, lower and higher values weretested for, until convergence was achieved. In thistesting phase we used 9 bins, which was the numberof bins that gave the best equilibrium between speedand accuracy in the original work by Dalal and Trigs[8].

The OTP, Fig. 5(a), was very high for all testedconfiguration and regularisation values, with val-ues always close 0.9, it reflects how well the out-put bounding box matches with the ground truthresults for positive detections. It’s the ATA, Fig.5(b), and the percentage of fails, Fig. 5(c), thatclearly shows us that two of the configurations havemuch better results than the others, respectively the1×1 and 2×2 cells per block configuration without

overlap. Block overlap turns out not to be bene-ficial for the detectors performance, in contrary towhat was to be expected.

6.3.2 Bin configuration

Configuration A, B and C were the ones thatachieved the best results, even though A and Cclearly surpass the performance of B and the oth-ers, we still tested the three of them for differentnumbers of bins, using the regularisation values thatproduced the best results within each configuration.Going from 1 to 9 bins, the lower numbers alreadypresented impressive results, even with 1 bin somecorrect detections were achieved, the intensity ofthe gradients alone gives us a hint at how goodthe HOG descriptor is. Increasing the number ofbins quickly gets the results near the ones we getfor 9 bins, with 4 bins sometimes getting better re-sults than any other number. OTP, Fig. 5(d), isagain very high to all tested configurations, whileATA, Fig. 5(e), and the percentage of fails, Fig.5(f), gives very close results for the A and C con-figurations. As 4 bins already produced very goodresults, and the difference in performance betweenusing 1×1 and 2×2 cells per block is very small, wechose to use configuration A with 4 bins, both fastand with some of the most accurate results, alsobenefits from not needing to calculate blocks andthe step between tested image patches is smaller, 1cell instead of 2. The frame rate for configurationA, for varying number of bins is presented in Table

9

2.

Table 2: Frame rates with the number of bins forconfiguration A.

nBins 1 2 3 4 5 6 7 8 9fps 2.56 2.09 1.89 1.68 1.49 1.39 1.27 1.20 1.15

6.4. Reference generationAs we make use of the control loops of paparazzi,our evaluations go into the accuracy of the refer-ences we are generating. There are two main mea-surements influencing the references, angle mea-surements based on pixel displacements and dis-tances based on the adaptive size of the boundingbox.

To evaluate the angle measurements we comparedfor different angle positions, going for −25◦ to 25◦,in 5◦ steps with 0◦ being the centre of the framewhere the KFC tracker was initialised on the be-bop. These measurements can be seen as direct re-sults for yaw reference. For distance measurementsthe camera of a macbook air was used because ofthe aforementioned low frame rates when on boardthe drone, resulting in the tracker not adapting wellto scale, which is the main factor influencing dis-tance calculations. Initialised with the target at adistance of 2 meters from the camera we performed0.25 meter steps from 1 meter to 2.5 meters awayfrom the camera, for higher distances the resultsget too deviated due to the trackers performance,and as such we stopped the tests there, for closerdistances the target gets too close to the camera.

For each experiment 5 repetitions were made, andthe results plotted in box plots. In Fig. 6(a) we havethe box plot for the yaw angle experiments, and inFig. 6(b) for the distance calculation experiments.Both results are very positive, angle measurementspresent a very small variance, note that the resultsfor the distance equal to 2 meters are perfect, thisis because it is the distance at which the the trackerwas initialised. The references can be concluded tobe well calculated, depending mainly on the perfor-mance of the tracker that is generating the bound-ing box.

7. Conclusions and Future WorkIn this paper, a vision based strategy for trackingand following a person onboard a MAV is presented,not needing external information on the location ofthe person, it retrieves an estimation of the positionthrough the processing of the frames from the cam-era. We first tested the performance of two adaptivestate-of-the-art trackers, KCF and Struck, onboardthe Parrot Bebop drone. The overall result for bothtrackers tells us that these kinds of adaptive track-ers work extremely well if they can run at normalframe rates, but with the low frames per second im-

Angle target (deg)

Ang

leou

tput

(deg

)

Angle estimation

−25 −20 −15 −10 −5 0 5 10 15 20 25

−25

−20

−15

−10

−5

0

5

10

15

20

25

(a)

Distance target (m)

Dis

tanc

eou

tput

(m)

Distance estimation

1 1.25 1.5 1.75 2 2.25 2.5

1

1.25

1.5

1.75

2

2.25

2.5

(b)

Figure 6: Accuracy tests on angle estimation,above, and distance estimation, below, calculatedfrom the detected bounding box, compared againstground truth results.

posed by the hardware of the drone, makes trackingvery difficult for a person moving at normal speed.

As our objective was to implement a person fol-lowing system onboard a Pocket Drone, which hasvery limited hardware when compared with the Be-bop, it was clear that a different strategy had tobe implemented, we chose a tracking by detectionframework, based on a logistic regression classifier,trained on HOG features. Our detector performsvery well for scenes where the darkness of the back-ground is different from the darkness of the personthat is being tracked, but when there is very littledifference between the intensity values of the pixelsof the hair and those of the background, the detec-tor has a hard time predicting the correct locationof the person, as seen in Fig. 4.

The Pocket Tracker shows great results for sucha lightweight tracker, though some work still has tobe done to improve its robustness, one of the mainfeatures that would increase the performance wouldbe the introduction of a penalisation factor to givemore emphasis to samples which are more close tothe last detection, making it more resistant to thepresence of multiple people in the scene.

10

Tests should be performed on the detector for dif-ferent template window sizes, and make a detailedevaluation into the accuracy of the depth map re-covered from the stereocamera. Finally extensiveflight tests need to be performed in order to evalu-ate the performance of our strategy in a real envi-ronment.

References[1] Airdog. information can be found at: https:

//www.airdog.com.

[2] Hexo+. information can be found at: https://hexoplus.com.

[3] KCF & Struck ports to paparazzi. code can befound at: https://github.com/TomasDuro/kcf_str_paparazzi.

[4] Lily drone. information can be found at:https://www.lily.camera.

[5] Pocket Tracker repository. code can befound at: https://github.com/TomasDuro/pocketTracker.

[6] R. Bart, A. Vykovsk, et al. Any object track-ing and following by a flying drone. In 2015Fourteenth Mexican International Conferenceon Artificial Intelligence (MICAI), pages 35–41. IEEE, 2015.

[7] M. B. Blaschko and C. H. Lampert. Learningto localize objects with structured output re-gression. In European conference on computervision, pages 2–15. Springer, 2008.

[8] N. Dalal and B. Triggs. Histograms of orientedgradients for human detection. In 2005 IEEEComputer Society Conference on ComputerVision and Pattern Recognition (CVPR’05),volume 1, pages 886–893. IEEE, 2005.

[9] M. Danelljan, G. Häger, F. Khan, and M. Fels-berg. Accurate scale estimation for robustvisual tracking. In British Machine VisionConference, Nottingham, September 1-5, 2014.BMVA Press, 2014.

[10] M. Danelljan, F. S. Khan, M. Felsberg,K. Granström, F. Heintz, P. Rudol,M. Wzorek, J. Kvarnström, and P. Do-herty. A low-level active vision frameworkfor collaborative unmanned aircraft systems.In Workshop at the European Conference onComputer Vision, pages 223–237. Springer,2014.

[11] M. Danelljan, F. Shahbaz Khan, M. Felsberg,and J. Van de Weijer. Adaptive color attributesfor real-time visual tracking. In Proceedings of

the IEEE Conference on Computer Vision andPattern Recognition, pages 1090–1097, 2014.

[12] C.-T. Dang, H.-T. Pham, T.-B. Pham, and N.-V. Truong. Vision based ground object track-ing using ar. drone quadrotor. In Control, Au-tomation and Information Sciences (ICCAIS),2013 International Conference on, pages 146–151. IEEE, 2013.

[13] F. De Smedt, D. Hulens, and T. Goedemé. On-board real-time tracking of pedestrians on auav. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern RecognitionWorkshops, pages 1–8, 2015.

[14] P. Dollár, R. Appel, S. Belongie, and P. Per-ona. Fast feature pyramids for object detec-tion. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 36(8):1532–1545,2014.

[15] M. Everingham, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal vi-sual object classes (voc) challenge. Interna-tional journal of computer vision, 88(2):303–338, 2010.

[16] K. Haag, S. Dotenco, and F. Gallwitz. Cor-relation filter based visual trackers for personpursuit using a low-cost quadrotor. In Inno-vations for Community Services (I4CS), 201515th International Conference on, pages 1–8.IEEE, 2015.

[17] S. Hare, A. Saffari, and P. H. Torr. Struck:Structured output tracking with kernels. InComputer Vision (ICCV), 2011 IEEE Inter-national Conference on, pages 263–270. IEEE,2011.

[18] J. F. Henriques, R. Caseiro, P. Martins, andJ. Batista. High-speed tracking with kernel-ized correlation filters. Pattern Analysis andMachine Intelligence, IEEE Transactions on,37(3):583–596, 2015.

[19] Z. Kalal, K. Mikolajczyk, and J. Matas.Tracking-learning-detection. Pattern Analysisand Machine Intelligence, IEEE Transactionson, 34(7):1409–1422, 2012.

[20] B. Karasulu and S. Korukoglu. A software forperformance evaluation and comparison of peo-ple detection and tracking methods in videoprocessing. Multimedia Tools and Applications,55(3):677–723, 2011.

[21] M. Mueller and A. Drouin. Paparazzi—the freeautopilot. build your own uav. In 24th ChaosCommunication Congress, Berliner CongressCenter, Dec, pages 27–30, 2007.

11

[22] J. Pestana, J. L. Sanchez-Lopez, P. Campoy,and S. Saripalli. Vision based gps-denied ob-ject tracking and following for unmanned aerialvehicles. In 2013 IEEE International Sympo-sium on Safety, Security, and Rescue Robotics(SSRR), pages 1–6. IEEE, 2013.

[23] B. Remes, P. Esden-Tempski, F. Van Tienen,E. Smeur, C. De Wagter, and G. De Croon.Lisa-s 2.8 g autopilot for gps-based flight ofmavs. In IMAV 2014: International MicroAir Vehicle Conference and Competition 2014,Delft, The Netherlands, August 12-15, 2014.Delft University of Technology, 2014.

[24] R. Rifkin, G. Yeo, and T. Poggio. Regular-ized least-squares classification. Nato ScienceSeries Sub Series III Computer and SystemsSciences, 190:131–154, 2003.

[25] S. Salti, A. Cavallaro, and L. Di Stefano. Adap-tive appearance modeling for video tracking:Survey and evaluation. IEEE Transactions onImage Processing, 21(10):4334–4348, 2012.

[26] A. W. Smeulders, D. M. Chu, R. Cucchiara,S. Calderara, A. Dehghan, and M. Shah. Vi-sual tracking: An experimental survey. Pat-tern Analysis and Machine Intelligence, IEEETransactions on, 36(7):1442–1468, 2014.

[27] I. Tsochantaridis, T. Joachims, T. Hofmann,and Y. Altun. Large margin methods forstructured and interdependent output vari-ables. Journal of Machine Learning Research,6(Sep):1453–1484, 2005.

[28] A. Yilmaz, O. Javed, and M. Shah. Objecttracking: A survey. Acm computing surveys(CSUR), 38(4):13, 2006.

12

tracking and following a moving person onboard a pocket drone · tracking and following a moving...

Documents