integrating object detection with 3d tracking …integrating object detection with 3d tracking...

Integrating Object Detection with 3D Tracking Towards a Better DriverAssistance System

Victor Adrian Prisacariu1, Radu Timofte2, Karel Zimmermann2, Ian Reid1, Luc Van Gool21Active Vision Laboratory, University of Oxford, United Kingdom

2ESAT-PSI / IBBT, Katholieke Universiteit Leuven, Belgiumvictor, [email protected],rtimofte, kzimmerm, [email protected]

Abstract

Driver assistance helps save lives. Accurate 3D poseis required to establish if a traffic sign is relevant to thedriver. We propose a real-time system that integratessingle view detection with region-based 3D tracking ofroad signs. The optimal set of candidate detections isfound, followed by AdaBoost cascades and SVMs. The2D detections are then employed in simultaneous 2Dsegmentation and 3D pose tracking, using the known3D model of the recognised traffic sign. We demon-strate the abilities of our system by tracking multipleroad signs in real world scenarios.

1 Introduction

Traffic signs are designed to help drivers reach theirdestination safely, by providing them with usefull in-formation. However, when road signs are missed ormissunderstood accidents happen. A recent statisticshows that over 98% of car accidents happen becausethe driver was distracted. By attracting the driver’s at-tention to the traffic signs on the road many accidentswould be averted. As a result there has been muchwork towards a fast and reliable traffic sign detectionand tracking system.

Most current work involves combining a detectorwith a Kalman filter, like in [3, 8], or with a particlefilter, like in [5, 4]. These methods rely on a predictablecar motion model or reliable feature detectors. For ex-ample in [6] and [3] the car is assumed to move in astraight line and with constant velocity. As feature de-scriptors, trackers usually use edges or some kind ofinformation extracted from the traffic sign shape. In [3]the authors explicitly model geometric properties of thetraffic sign shape (i.e. a triangle shaped sign has to beequilateral). This leads to a lack of robustness when

subjected to occlusions, deformations or motion blur.In [5] the authors track circular signs and assume thesehave a coloured border on a white interior and that clearedges can be extracted. This approach would not scaleto differently shaped signs and is again vulnerable tomotion blur. A solution to some of these problems is re-gion based tracking. Regions are more stable than edgesso tracking is more robust, and they are less affected byocclusions or motion blur.

To our knowledge, with notable exceptions like [5]and [10], most of previous road sign work was 2D. Thatis the position of the traffic sign was tracked in the im-age rather than in 3D space. In [5] the authors use in-ertial sensors mounted on the car to help them obtainan approximation of the 3D pose of the signs, with re-spect to the car. Unfortunately this approach would failwhen the traffic sign does not point towards the car, likein the case shown in Figure 1. Here the no right turnsign does not have the same rotation as the inertial sen-sor mounted on the car. An alternative method is pre-sented in [10], where multiple views, 3D reconstruction,and an Minimum Description Length based method areused. Although a 3D pose is recovered, processing timeis very long - it is an offline method for 3D mobile map-ping purposes.

Our approach 1 integrates a single view detection andrecognition step with a multiobject, model and regionbased, 3D tracker. This has several advantages: (i) weare able to obtain the full 3D pose of the traffic signs inthe image, accounting for the case in Figure 1, (ii) thetracking is region based, making it robust to motion blurand occlusions, (iii) because our tracker processes onlya small region in and around the detection we are ableto achieve real time performance.

The remainder of this paper is structured as follows:we begin by presenting an overview of our algorithm in

1This work was supported by the Flemish IBBT-URBAN projectand by EPSRC through a DTA grant. The authors thank GeoAutoma-tion for providing the images.

Figure 1. Importance of determining thetraffic sign orientation. The no right turnsign does not point towards the car.

Section 2. In Section 3 we detail our single view detec-tion and recognition step while in Section 4 we presentour 3D tracker. In Section 5 we include the results ofapplying our system to several images and videos. Weconclude in Section 6.

2 Algorithm Overview

a b c d e

OriginalImage

Single viewdetection andrecognition

Selection ofbest detectionfor each object

Reconstruction of approximate

3D pose3D Tracker

Figure 2. Algorithm overview

An outline of our algorithm is shown in Figure 2. Itconsists of two phases: first the single view detector isran on the image and the best detection for each objectis selected. Second, the 3D pose at the current frameis predicted, based on the 2D detection and a constantvelocity motion model. The 2D detection bounding boxis converted to a 3D pose using a 4 point planar poserecovery algorithm. The 3D tracker is then used to re-fine the 3D pose, for each object in the image. If thedetection corresponds to a traffic sign the approximate3D pose will be its initialisation.

3 Object Detection and Recognition

When a new frame is available we first use the objectdetection and recognition algorithm of [10]. The single-view detection phase consists of the following steps:

1) Candidate extraction - very fast preprocessing step,where the optimal combination of simple (i.e. compu-tationally cheap), adjustable extraction methods selectbounding boxes with possible traffic signs. This steprequires an automatic offline learning stage, where anoptimal subset of those extraction methods is learnt toyield very few false negatives, while keeping the num-ber of false positives in check.2) Detection - Extracted candidates are verified furtherby a binary classifier which filters out remaining back-ground regions. It is based on the Viola and Jones Dis-crete AdaBoost classifier [11]. Detection is performedby cascades of AdaBoost classifiers, followed by anSVM operating on normalised RGB channels, pyramidsof HOGs [2] and AdaBoost-selected Haar-like features.3) Recognition - A hierarchy of SVM classifierssplits the detections in six basic traffic sign subclasses(triangle-up, triangle-down, circle-blue, circle-red, rect-angle and diamond) and then another hierarchy of SVMclassifiers (for each subclass) assigns the traffic signtype for the different candidate detections.

At this stage several detections might be availablefor each object, as shown in Figure 2b. We perform thenon-maximum suppression using the energy function of[1, 7] as the colour segmentation score. With b as thebounding box we write:

P (b) =∏x∈Ω

(He(a(x))Pf +

(1−He(a(x))

)Pb

)(1)

Here Ω is the image domain, He is the smooth Heavi-side function, x is the pixel in the image, a(x) equals 1inside the bounding box and -1 outside and:

Pf =P (yi|Mf )

ηfP (yi|Mf ) + ηbP (yi|Mb)(2)

Pb =P (yi|Mb)

ηfP (yi|Mf ) + ηbP (yi|Mb)(3)

with ηf the number of foreground pixels, ηb the num-ber of background pixels, yi the colour of the i-th pixel,P (yi|Mf ) the foreground model over pixel values y andP (yi|Mb) the background model. We use RGB imagesand our models are histograms with 32 bin for eachchannel, which are updated online, allowing for vari-ations in illumination. When an object is first detectedwe chose the detection with the highest SVM score, andinitialise P (yi|Mf ) and P (yi|Mb).

4 3D Initialisation and Tracking

The core of our tracking is the PWP3D algorithm[7]. It assumes a known 3D model, a calibrated camera

and known foreground/background region statistics. Alevel set embedding function is built from the contour ofthe projection of the model and the posterior per-pixelforeground/background probability is maximised as afunction of pose. The energy function that is minimisedis similar to the log of Equation 1:

E(Φ) = −N∑x∈Ω

log(He(Φ)Pf +

(1−He(Φ)Pb

))(4)

with the level set embedding function Φ replacing a(x).The actual minimisation is done by computing the

derivatives of this energy function with respect to thepose parameters and using gradient descent, as in [7].This does have the disadvantage that convergence is notguaranteed within the permitted number of iterationsi.e. more iterations would almost always lead to a bettersolution. In our testing we noticed that an average of 15iterations is enough for a good enough result.

The PWP3D tracker needs an initial 3D pose and val-ues for the foreground / background membership proba-bilities. Also, at each new frame, the 2D detections needto be converted to (approximate) 3D poses to be com-bined with the 3D tracker. In our case the objects canbe approximated with their planar counterparts, whichmeans we can use any one of several planar pose re-covery algorithms currently available to convert the 2Dbounding box to a 3D pose. We use a current state-of-the-art algorithm introduced in [9]. This algorithmrequires (at least) 4 3D-2D point correspondences. Weuse the 4 corners of the detection bounding box as the2D points and relate them to the 4 corners of the bound-ing box enclosing the 3D model of object. An exampleresult is depicted in Figure 2d.

At each new frame, for all previously known ob-jects, the new 3D poses (obtained from the 2D boundingboxes) must be integrated with the tracker. To do thiswe begin by defining a constant velocity motion model:

vitk = tk−1 − tk−2 virk = rk−1r−1k−2 (5)

where k is the current frame, k−1 is the previous frame,t is the translation and r is the rotation quaternion fromthe tracker. The velocity given by the 2D detections is:

viitk = uk − uk−1 viirk = pkp−1k−1 (6)

where u is the translation and p is the rotation quater-nion, obtained from the detector by using the 4 pointplanar pose recovery algorithm. The predicted pose forthe current frame becomes:

tk = tk−1 + αvitk + βviitk rk = rk−1qαvirkqβv

iirk

(7)

where k is the current frame, k−1 is the previous frame,t is the translation from the tracker and u is the trans-lation from the detector. The variables α, β, qα and qβ

are dependant on the distance between the object andthe camera. α and qα are inverse proportional to thisdistance while β and qβ are proportional to it. Thus weare able to give more importance to the motion modelwhen the object is closer to the car and vice versa.

Finally the predicted pose (tk, rk) is refined usingthe tracker. The tracker could have been used aloneto obtain the pose changes from consecutive frames,but this would have led to longer processing times andwould have increased the chance of loosing tracking.

We could have used a Kalman filter for a purely sta-tistical fusion of the tracking data with the approximateposes from the detector. A Kalman filter represents allmeasurements, system state and noise as multivariateGaussian distributions. By iterating the tracker ratherthan doing a purely statistical data fusion we make nopretence on the type of these probability distributions.

5 Results

We tested our algorithm with a multitude of trafficsigns types and shapes. In Figure 3 (top) we show oursystem tracking a pedestrian crossing sign and obtain-ing a reasonably accurate pose even when the object isfar from the camera. It is possible to calculate the dis-tance between the sign and the car, which could then beused to automatically adjust the speed of the car.

In Figure 3 (bottom) we show our system trackingmultiple objects. At present the pose for each sign is es-timated independently, though of course relative to thecamera each sign undergoes the same rigid transforma-tion. We might expect improved performance were weto introduce this coupling.

In Figure 4 we compare the performance of our sys-tem, tracking a single sign over a distance of 70m (or 70frames at 36km/s), with and without the 3D tracker. Athigher distances the object becomes only tens of pix-els big so detection and tracking are prone to errors.Higher resolution images would allow for the sign tobe detected and tracked earlier. The car was moving ina straight line, with constant speed. Translation shouldtherefore change linearly, while rotation should remainconstant. Though there is little choice between usingthe 3D tracker and using just the 4 point planar pose re-covery algorithm with regard to translation the trackermust be used to reliably obtain rotation.

A video processed by our system is available athttp://homes.esat.kuleuven.be/˜rtimofte . It showsour system tracking multiple objects of different types,orientations and colours, with or without motion blurand occlusions. Our CPU implementation of the de-tection phase runs at around 20fps on 640x480 images

Figure 3. Filmstrip showing 5 frames from a video tracking a pedestrian crossing sign (top)and tracking multiple traffic signs (bottom)

0 10 20 30 40 50 60 70−15

−10

−5

0

5Translation X and Y with PWP and RPP

Tr X RPP

Tr X PWP

Tr Y RPP

Tr Y PWP

0 10 20 30 40 50 60 700

10

20

30

40Translation Z with PWP and RPP

Tr Z RPP

Tr Z PWP

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1Rotation quaternion with RPP

Qx

Qy

Qz

Qw

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1Rotation quaternion with PWP

Qx

Qy

Qz

Qw

Figure 4. System performance while tracking a sign over 70m, with just the 4 point pose recov-ery (RPP) and with the tracker (PWP)

while the GPU based tracker needs up to 20ms per ob-ject (on a 640x480 image).

6 Conclusions

In this work we proposed a system that can trackmultiple traffic signs in 3D, from a single view. By inte-grating accurate detections with 3D region based track-ing our system is robust to motion blur and occlusions,while still running in real time. Future work could addperson and car 3D detection and tracking to create acomplete driver assistance system.

References

[1] C. Bibby and I. Reid. Robust real-time visual trackingusing pixel-wise posteriors. In ECCV 2008, pages 831–844.

[2] A. Bosch, A. Zisserman, and X. Munoz. Represent-ing shape with a spatial pyramid kernel. In CIVR 2007,pages 401–408.

[3] C.-Y. Fang, S.-W. Chen, and C.-S. Fuh. Road-signdetection and tracking. Vehicular Technology, IEEETransactions on, 52:1329–1341, 2003.

[4] L. D. Lopez and O. Fuentes. Color-based road sign de-tection and tracking. In ICIAR 2007, pages 1138–1147.

[5] M. Meuter, A. Kummert, and S. Muller-Schneiders. 3dtraffic sign tracking using a particle filter. In ITSC 2008,pages 168–173.

[6] G. Piccioli, E. De Micheli, P. Parodi, and M. Campani.Robust road sign detection and recognition from imagesequences. In Intelligent Vehicles ’94 Symposium, pages278–283.

[7] V. Prisacariu and I. Reid. Pwp3d: Real-time segmenta-tion and tracking of 3d objects. In BMVC 2009.

[8] A. Ruta, Y. Li, and X. Liu. Real-time traffic sign recog-nition from video by class-specific discriminative fea-tures. Pattern Recognition, 43:416–430, 2010.

[9] G. Schweighofer and A. Pinz. Robust pose estimationfrom a planar target. PAMI, 28:2024–2030, 2006.

[10] R. Timofte, K. Zimmermann, and L. van Gool. Multi-view traffic sign detection, recognition, and 3d localisa-tion. In WACV 2009, pages 69–76.

[11] P. Viola and M. Jones. Robust real-time face detection.In IJCV 2001, volume 2, pages 747–757.

integrating object detection with 3d tracking …integrating object detection with 3d tracking...

Documents